I was recently dockerizing tests for a program I’m working on that utilizes fanotify and noticed that the tests, which worked in a VM, were failing to run. Reviewing the logs yielded the culprit: fanotify_init could not be called in the container. TL;DR: you need to run the container with the CAP_SYS_ADMIN capability. But I’m new to the Linux kernel and wanted to know why this was necessary.

To replicate the issue at home, use the following dockerfile:

FROM ubuntu:19.10

WORKDIR /fanotify
COPY . /fanotify
RUN apt-get update
RUN apt-get install -y build-essential
RUN gcc fanotify_example.c -o fanotify_example
CMD ["/fanotify/fanotify_example", "/fanotify"]

The file fanotify_example.c is copy-pasted from the first example on the man page. Building the container should work, but running yields an error:

user@linux:~/fanotify$ docker build -t test_container .
...
user@linux:~/fanotify$ docker run --rm test_container
fanotify_init: Operation not permitted
Press enter key to terminate.

Docker and seccomp

I found this issue on the Docker github page. It refers to issues running Chrome in a container without disabling the sandbox security feature or giving the container CAP_SYS_ADMIN. This is not ideal, as it disables some of docker’s security features. Later down in the thread someone shows it working with a custom seccomp profile, but before I got to that I needed to know what seccomp was.

From reading the docker docs and wikipedia page I was able to infer that seccomp is used to restrict the syscalls available to a given process. This made sense, the default security profile was restricting the container’s access to the fanotify_init syscall. I checked if seccomp was enabled in my kernel and it was:

user@linux:~/fanotify$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
CONFIG_SECCOMP=y

I guess I really should have read the manual first.

The Github issue thread mentions that the Chrome docker seccomp profile whitelists several syscalls not found in the default profile, including the one I need: fanotify_init. So if we copy that profile and use it to run the container it should work, right? I wanted to be a little smarter than that, so I found the default docker profile and attempted to modify it by moving fanotify_init to the whitelist. Interestingly, fanotify_mark was already whitelisted by default.

First, download the profile:

user@linux:~/fanotify$ wget https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json

Then remove fanotify_init from line 571 and add it after fanotify_mark on line 95. The block surrounding line 571 already hints at what’s going on here: the list that fanotify_init is in is added when you add the CAP_SYS_ADMIN capability to the container.

If you run the container again with this new profile you can see that…

user@linux:~/fanotify$ docker run --rm \
                                  --security-opt seccomp=default.json \
                                  test_container
fanotify_init: Operation not permitted
Press enter key to terminate.

…it still doesn’t work.

Calling fanotify_init

In fact, not only does adding fanotify_init to the whitelist not work, simply whitelisting everything with the unconfined option doesn’t work either:

user@linux:~/fanotify$ docker run --rm \
                                  --security-opt seccomp=unconfined \
                                  test_container
fanotify_init: Operation not permitted
Press enter key to terminate.

The only thing that works is giving the container CAP_SYS_ADMIN:

user@linux:~/fanotify$ docker run --rm --cap-add=CAP_SYS_ADMIN test_container
Press enter key to terminate.
Listening for events.
Listening for events stopped.

Why?

So why is this happening? I noticed that nowhere in the example code does it return a permission error. The error seemed to come from the code itself, specifically the perror("fanotify_init"); line, so I wanted to see what was actually failing. The first place I looked was where the fanotify_init syscall was defined and to my surprise there was the answer on line 776:

/* fanotify syscalls */
SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
{
    // ...
	if (!capable(CAP_SYS_ADMIN))
		return -EPERM;
    // ...
}

If the calling process is not running with CAP_SYS_ADMIN capabilities then fanotify_init itself refuses to run, returning insufficient permissions instead. Unfortunately, it seems that the only way to make fanotify work inside a docker container is to pass the container CAP_SYS_ADMIN.

This made me uncomfortable so I looked up the history of CAP_SYS_ADMIN. It turns out I’m not the only one who thinks this is overly permissive. According to this LWN article from 2012 the CAP_SYS_ADMIN capability accounted for 30% of the uses of capabilities at the time. Having one capability responsible for so many calls seems, to my amateur eyes at least, to defeat the purpose of the capability. The capabilities man page even warns against using CAP_SYS_ADMIN for new features:

       *  Don't choose CAP_SYS_ADMIN if you can possibly avoid it!  A vast
          proportion of existing capability checks are associated with this
          capability (see the partial list above).  It can plausibly be
          called "the new root", since on the one hand, it confers a wide
          range of powers, and on the other hand, its broad scope means that
          this is the capability that is required by many privileged
          programs.  Don't make the problem worse.  The only new features
          that should be associated with CAP_SYS_ADMIN are ones that closely
          match existing uses in that silo.

It’s not ideal, but it looks like I don’t really have a choice. Oh well, at least I learned something.