We've been reporting for a while that it's 'easily' possible to get more than
1 Million Transactions Per Second out of Memcached and OpenOnload.
Here's the HOWTO.
Many thanks to Rip Sohan who did the work / tuning. Rip has also taken a look at how
to make memcached go faster - take a look at his patches on the memcached Google group.
His reults (reproduced at the end of this post) are pretty stunning.
MaMaximising Memcached Performance On SF OpenOnload
=================================================
1. Prerequisite tools
a. cpuset (1.5.5)
b. Openonload (201104)
c. SF mycset script
d. memcached 1.4.5
2. Principles:
a. Avoid inter-node memory accesses: Idea is to run a copy of
memcached per-node and only on core-local nodes.
b. Minimise CPU cache pollution: Idea is to prevent cache pollution by
pinning server threads to cores and _only_ running server threads on
the cores.
3. Preparation for setup:
[[I'm going to use my hardware configuration which is similar to the
one you sent me, adjust accordingly]]
[numactl --hardware]
numactl --hardware
available: 2 nodes (0)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 3059 MB
node 0 free: 2824 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 3071 MB
node 1 free: 2844 MB
node distances:
node 0 1
0: 10 20
1: 20 10
So, I have a machine with 2 nodes, 8 cpus per node, 3G memory attached
locally to every node as presented below:
Ideally, I'd have 2 copies of memcached running, with threads for the
first copy running on 0,2,4,6,8..etc and for the second on 1,3,5,7...
NOTE: If numactl hardware does not present you with your node-local
cores, you can obtain them from the /sys filesytem, e.g.:
[grep --with-filename "," /sys/devices/system/node/node*/cpulist]
which provides the output below:
/sys/devices/system/node/node0/cpulist:0,2,4,6,8,10,12,14
/sys/devices/system/node/node1/cpulist:1,3,5,7,9,11,13,15
indicating node0 has 0,2,4... and node1 has 1,3,5,7...
3. Setup
a. General setup:
i. Turn off swapping: [swapoff -a]
b. CPU Shielding:
i. Create a ringfence around application CPUS. [/root/mycset]
ii. Check your cpu shielding: [cset set -l]. You should get output
similar to the following:
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-15 y 0-1 y 104 2 /
system 0 y 0 n 47 0 /system
user 1-15 y 0 n 0 0 /user
ii. Create two new shields, one for memcached on node0 and one for
memcached on node1. This involves a little manipulation of the
pre-configured shield.
a. Destroy the existing user cpuset [cset set --set=user -d]
b. Create a couple of new cpusets for memcached on node 0 and 1.
[cset set --cpu=2,4,6,8,10,12,14 --mem=0 --set=memcached0 --cpu_exclusive
cset set --cpu=1,3,5,7,9,11,13,15 --mem=1 --set=memcached1 --cpu_exclusive]
Note: I start the memcached0 node from cpu 2 as 0 is exclusively
allocated to the system cpuset.
c. Check your cpusets [cset set -l]. You should get output similar to the following:
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-15 y 0-1 y 105 3 /
memcached0 2,4,6,8,10,12,14 y 0 n 0 0 /memcached0
memcached1 1,3,5,7,9,11,13,15 y 1 n 0 0 /memcached1
system 0 y 0 n 46 0 /system
d. Turn off adaptive rx moderation and set interrupt moderation to 60 us.
[ethtool -C eth4 adaptive-rx off rx-usecs-irq 60]
e. Export the relevant OpenOnload variables.
export EF_POLL_USEC=1000
export EF_FDS_MT_SAFE=0
export EF_NETIF_COUNT=4
export EF_EPOLL_SPIN=1
export EF_POLL_SPIN=1
export EF_SELECT_SPIN=1
export EF_STACK_PER_THREAD=1
[For debugging purposes you can set EF_NO_FAIL=0 as well]
f. Change locked memory and file limits.
[limit -l unlimited
ulimit -n 2048]
g. Startup memcached on node0 in the proper shield]
[cset proc --set memcached0 --exec -- onload /home/rs/mp/tmp/memcached-1.4.5/
memcached -m 2500 -k -u rs -L -t 2 -p 11212]
Of note in the above command are:
-m 2500 = use 2.5G for memcached. You cannot use more memory than is
physically attached on the node. I usually keep this a little smaller
than the maximum physical memory on the node in order to provide a
buffer for kernel/other process memory allocations.
-k Lock all paged memory, prevent the memory from being paged
-u rs Run the program as user identity rs (You'll probably have to change this to the
memcached user id on your system)
-L Use large (4MB) pages if possible.
-t 2 Use two threads (In my configuration 2 threads were more than sufficient to
provide 500MB/s, you may need to experiment with your config).
-p 11212 Run on port 11212.
h. Once it's running, you need to pin CPU threads to cpus. So, for
example, assuming we know the memcached server is running with PID
3301, we can retrieve its thread ids (tids) with the command [ps -p
3301 -L], which provides output similar to the following:
PID LWP TTY TIME CMD
3301 3301 pts/0 00:00:00 memcached
3301 3321 pts/0 00:00:00 memcached
3301 3322 pts/0 00:00:00 memcached
3301 3323 pts/0 00:00:00 memcached
Indicating the LWP processes (thread ids) are 3321,3322,3323.
You pin each thread to a core using the taskset program
[taskset -pc 2 3321] - pins tid 3321 to core 2
[taskset -pc 4 3322]
[taskset -pc 6 3323]
You will notice that even though I indicated 2 server threads
memcached (1.4.5) has started up three threads. Memcached always
starts up (t+1) total threads where the last thread is used for
hashtable expansion. I have many more CPUs than threads and so this
is not a problem, if you're bound on cores you can skip pinning the
final thread.
i. You need to repeat steps g--i for memcached on node1. Remember to
change the appropriate parameters (--set memcached1 and -p 11213)
At this point you should be able to get approx 1e6 TPS using memslap.
4. Note on setting up memslap.
-I used shielding on the machine generating the load as well.
-I used OpenOnload to run the memslap benchmark (same environment vars as
described in 3.e).
-My memslap command line is:
/home/rs/mp/build/bin/memslap -s dellr610e-sf-p1:11212,dellr610e-sf-p1:11213 -t 90s
-T 8 -c 256 -B -S10M -S1s
----
Rip says:
Hi,
I've been trying to opimise the performance of memcached (1.6) on 10G
Ethernet and in doing so have created a series of patches that enable
it to scale to link speed.
Presently, memcache is unable to provide more than 450K
transactions-per-second (TPS) (as measured with Membase's memslap
benchmark on a set of Solarflare SFC 9020 NICs) with the kernel TCP/IP
stack and about 600K TPS with Solarflare's OpenOnload TCP/IP stack.
With the patches it scales to about 850K TPS with the kernel TCP/IP
stack and 1100K TPS with Solarflare's OpenOnload TCP/IP stack, as
illustrated in the graph at http://www.cl.cam.ac.uk/~rss39/mm_comp.pdf
I have tried to keep the changes as self-contained and small as
possible and have tested them as extensively as I can, but I look
forward to your feedback and comments on the set.
Kind regards,
Ripduman Sohan.
---