Enable systemd hardening options for named

Mon Jan 15 20:14:39 UTC 2018

Tony Finch wrote:
> Ludovic Gasc <gmludo at gmail.com> wrote:
> >
> > 1. The list of minimal capabilities needed for bind to run correctly:
> > http://man7.org/linux/man-pages/man7/capabilities.7.html
> 
> named already drops capabilities - have a look at the code around here:
> https://source.isc.org/cgi-bin/gitweb.cgi?p=bind9.git;a=blob;f=bin/named/unix/os.c;hb=v9_11_2#l234
> 
> Note that it's a bit clever - the privileges are dropped in two stages,
> right at the start, and after the server has been configured.

I checked just now to see what that code actually ends up doing, and on
my system I ended up with:

    $ grep -h ^Cap /proc/$(pidof named)/**/status | sort | uniq -c
          6 CapAmb:	0000000000000000
          6 CapBnd:	0000003fffffffff
          6 CapEff:	0000000001000400
          6 CapInh:	0000000000000000
          6 CapPrm:	0000000001000400
    $

That decodes to:

 - The effective and permitted capabilities sets were reduced to
   CAP_NET_BIND_SERVICE and CAP_SYS_RESOURCE.

 - The ambient and inheritable capabilities sets were cleared.

 - The capability bounding set was left completely open-ended.

It's not clear why CAP_SYS_RESOURCE needs to be retained past startup:

        /*
         * XXX  We might want to add CAP_SYS_RESOURCE, though it's not
         *      clear it would work right given the way linuxthreads work.
         * XXXDCL But since we need to be able to set the maximum number
         * of files, the stack size, data size, and core dump size to
         * support named.conf options, this is now being added to test.
         */
        SET_CAP(CAP_SYS_RESOURCE);

See commits 5e4b7294d88ab58371d8c98e05ea80086dcb67cd,
108490a7f8529aff50a0ac7897580b59a73d9845. "[T]o test"?

CAP_SYS_RESOURCE is documented as permitting:

   CAP_SYS_RESOURCE
          * Use reserved space on ext2 filesystems;
          * make ioctl(2) calls controlling ext3 journaling;
          * override disk quota limits;
          * increase resource limits (see setrlimit(2));
          * override RLIMIT_NPROC resource limit;
          * override maximum number of consoles on console allocation;
          * override maximum number of keymaps;
          * allow more than 64hz interrupts from the real-time clock;
          * raise msg_qbytes limit for a System V message queue above  the
            limit in /proc/sys/kernel/msgmnb (see msgop(2) and msgctl(2));
          * allow  the  RLIMIT_NOFILE resource limit on the number of "in-
            flight" file descriptors to  be  bypassed  when  passing  file
            descriptors  to  another process via a UNIX domain socket (see
            unix(7));
          * override the /proc/sys/fs/pipe-size-max limit when setting the
            capacity of a pipe using the F_SETPIPE_SZ fcntl(2) command.
          * use  F_SETPIPE_SZ to increase the capacity of a pipe above the
            limit specified by /proc/sys/fs/pipe-max-size;
          * override /proc/sys/fs/mqueue/queues_max  limit  when  creating
            POSIX message queues (see mq_overview(7));
          * employ the prctl(2) PR_SET_MM operation;
          * set  /proc/[pid]/oom_score_adj to a value lower than the value
            last set by a process with CAP_SYS_RESOURCE.

I would guess that retaining CAP_NET_BIND_SERVICE and CAP_SYS_RESOURCE
during the process runtime permits open-ended reloading of the config at
runtime (e.g., binding to a new IP address on port 53 without needing to
restart the daemon). So even though BIND drops some capabilities, it's
still running with elevated privileges compared to a traditional
non-root user.

systemd permits a nice pattern for network daemons that want to run as
an unprivileged user, but bind to a privileged port (and without using
socket activation), without starting the process as root. Basically, you
put something like this in the unit file:

    [Service]
    User=…
    Group=…
    CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SYS_CHROOT CAP_SETPCAP
    AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_SYS_CHROOT CAP_SETPCAP
    …

Any needed filesystem directories and permissions need to be set up
correctly before hand. The service is started by the init system as the
unprivileged User/Group specified in the unit file, so there's no need
to change UID/GID. CAP_NET_BIND_SERVICE is then used to bind to a
privileged port, CAP_SYS_CHROOT is used to perform the chroot, and
CAP_SETPCAP is used to drop all remaining capabilities from the
capability sets and the capability bounding set, so you end up with a
completely unprivileged process at runtime. (Alternatively you could
keep CAP_NET_BIND_SERVICE and drop CAP_SYS_CHROOT and CAP_SETPCAP, if
you wanted to retain the capability to perform privileged binds at
runtime. Or you could eliminate CAP_SYS_CHROOT and use other systemd
functionality to make parts of the filesystem inaccessible, etc.) This
pattern might be a bit hard to retrofit into BIND at this point, though,
other than by adding more knobs.

-- 
Robert Edmonds