Tuning suggestions for high-core-count Linux servers

Fri Jun 2 07:12:09 UTC 2017

Just some interesting investigation results. One of the URL's Matthew Ian Eis linked to talked about using a tool called 'perf'. For the hell of it, I gave it a shot.

Sure enough it tells some very interesting things.

When BIND was restricted to using a single NUMA node, the biggest call (to _raw_spin_lock) showed 7.05% overhead.

When BIND was allowed to use both NUMA nodes, the same call showed 49.74% overhead; an astonishing difference.

As it was running unrestricted, memory from both nodes was more used:

[root at kr20s2601 ~]# numastat -p 22441

Per-node process memory usage (in MBs) for PID 22441 (named)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.45            0.12            0.57
Stack                        0.71            0.64            1.35
Private                      5.28         9415.30         9420.57
----------------  --------------- --------------- ---------------
Total                        6.43         9416.07         9422.50

Given the numbers here, you wouldn't think it should make much of a difference.

Sadly, I didn't get which CPU the UDP listener was attached to.

Anyway, what I've changed so far:

    vm.swappines = 0
    vm.dirty_ratio = 1
    vm.dirty_background_ratio = 1
    kernel.sched_min_granularity_ns = 10000000
    kernel.sched_migration_cost_ns = 5000000

Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

Stuart

'perf' data collected during a 3 minute test run:

[root at kr20s2601 ~]# ls -al perf.data*
-rw-------. 1 root root  717350012 Jun  2 08:36 perf.data.24
-rw-------. 1 root root 1366620296 Jun  2 08:53 perf.data.48

'perf' top 5 (24 cores, numa restricted):

Overhead  Command  Shared Object         Symbol
   7.05%  named    [kernel.kallsyms]     [k] _raw_spin_lock
   6.96%  named    libpthread-2.17.so    [.] pthread_mutex_lock
   3.84%  named    libc-2.17.so          [.] vfprintf
   2.36%  named    libdns.so.165.0.7     [.] dns_name_fullcompare
   2.02%  named    libisc.so.160.1.2     [.] isc_log_wouldlog

'perf' top 5 (48 cores):

Overhead  Command  Shared Object         Symbol
  49.74%  named    [kernel.kallsyms]     [k] _raw_spin_lock
   4.52%  named    libpthread-2.17.so    [.] pthread_mutex_lock
   3.09%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
   1.84%  named    [kernel.kallsyms]     [k] _raw_spin_lock_bh
   1.56%  named    libc-2.17.so          [.] vfprintf