19 Nov 2024

Docker Containers: Priviliged, Unpriviliged, Rootless

It all started a couple of months ago when I was working on updating a base image for a Docker container … a Docker container that run systemd inside 😄💀 Yeah, I know that this is a controversial topic that triggers strong emotions in most developers, but today I won’t touch it. Instead I’ll tell about something different.

The first thing I realized while trying to make systemd work inside Docker is that ~~you should never run systemd in Docker~~ running systemd inside a privileged and unprivileged container is very different. But wait, what does it even mean, “privileged” and “unprivileged” (except the keyword in docker-compose.yaml)?

To figure it out I made several experiments, read the docs, and by the end was shocked how many misconceptions about Docker, Linux namespaces and process capabilities I had.

Maybe while reading this text you will laugh at how stupid I was, maybe you will discover something new. Who knows? In any case, welcome!

Different Flavours of Docker Installation

Before diving into different ways to run Docker containers (unprivileged, privileged, rootless), let’s talk about different ways to install Docker on Linux.

Docker Engine

One way to think about Docker is that it’s made from three components.

Docker CLI (1). It’s just a docker command binary. It doesn’t do any scary system-level stuff, it’s a thin REST API client talking to the Docker daemon (2) dockerd.

Docker CLI can have plugins, for example, Docker Compose v2 is implemented as a CLI plugin.

Usually Docker CLI is installed on the same host as Docker daemon and they communicate through a unix domain socket. Remember those usermod / newgrp commands that you must Google every time you install Docker? They are needed to give your unprivileged user access to the socket file owned by root.

But it’s not the only possible scenario, Docker daemon can be exposed over TCP and/or run on the different host. For example that’s the normal case when you’re using Docker-in-Docker, aka dind. CLI resides in one container (unprivileged) and the Docker daemon in other (privileged).

Another common case of Docker CLI and daemon on different hosts is Docker Desktop. But we’ll talk about it later.

dockerd knows how to chat with Docker Registry (e.g. Docker Hub ), and delegates the rest of the work to containerd (3). containerd prepares network and filesystem for the container and invokes runc (“run container”) to execute process in the containerized environment. runc is usually considered as a part of containerd, so we don’t count it as a 4th component.

You can read more about it in the Docker Blog.

One picture is worth a thousand words, so here it is:

Docker Engine

According to the Docker documentation the whole picture above is a “Docker Engine”, but I like to think that only the part inside the dotted box is a “Docker Engine” and CLI is CLI. It just makes more sense to think this way especially when you take Docker Desktop into the picture. From now on I’ll refer to the Docker Engine in this sense.

Third-Party and Official Packages

I will talk about installation from the POV of Ubuntu, but on the other distributives the idea is the same.

There are two typical ways to install Docker Engine:

Using Ubuntu-maintained repo docker.io that comes with the packages docker and containerd. Just call apt install docker.io
Using Docker-maintained docker-ce and docker-ce-cli packages, that come with containerd.io package (maintained by Containerd team). They can come from PPA or be downoladed and installed manually. Instructions in detail can be found here.

Docker-maintained packages are called “Docker Engine Community Edition” and your docker version will show something like this:

Client: Docker Engine - Community
 Version:           26.1.4
 API version:       1.45
 Go version:        go1.21.11
 Git commit:        5650f9b
 Built:             Wed Jun  5 11:28:57 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.4
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       de5c9cf
  Built:            Wed Jun  5 11:28:57 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker installed from Ubuntu-maintained packages will look like this¹:

docker version
Client:
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.21.1
 Git commit:        24.0.7-0ubuntu2~20.04.1
 Built:             Wed Mar 13 20:29:24 2024
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.21.1
  Git commit:       24.0.7-0ubuntu2~20.04.1
  Built:            Wed Mar 13 20:29:24 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.12
  GitCommit:        
 runc:
  Version:          1.1.0+dev
  GitCommit:        v1.1.0-919-g0c5a7353
 docker-init:
  Version:          0.19.0
  GitCommit:

I don’t know benefits of using Ubuntu-maintained packages, but the downside is that it offers very limited set of versions:

$ apt-cache policy docker.io
docker.io:
  Installed: (none)
  Candidate: 24.0.7-0ubuntu2~20.04.1
  Version table:
     24.0.7-0ubuntu2~20.04.1 500
        500 http://fi.archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages
     20.10.21-0ubuntu1~20.04.2 500
        500 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages
     19.03.8-0ubuntu1 500
        500 http://fi.archive.ubuntu.com/ubuntu focal/universe amd64 Packages

Compare it with the Docker CE repo (I applied a sed script to make the output more compact):

$ apt-cache policy docker-ce | awk '/500$/ {print $0}' | sed -n 's/.*:\([0-9.]*\)-.*~ubuntu.*/\1/p'
27.3.1
27.3.0
27.2.1
27.2.0
27.1.2
27.1.1
27.1.0
27.0.3
27.0.2
27.0.1
26.1.4
26.1.3
26.1.2
26.1.1
26.1.0
26.0.2
26.0.1
26.0.0
25.0.5
25.0.4
25.0.3
25.0.2
25.0.1
25.0.0
24.0.9
24.0.8
24.0.7
24.0.6
24.0.5
24.0.4
24.0.3
24.0.2
24.0.1
24.0.0
23.0.6
23.0.5
23.0.4
23.0.3
23.0.2
23.0.1
23.0.0

If you are staying on the top of the Docker iceberg and your life is like a stock photo of happy people dressed in business costumes and shaking hands with each other while telling that “now the problem ‘it works on my machine’ is solved by Docker” … In this case apt-get install docker.io is enough.

In other cases, especially when you need to care about Docker & Containerd minor versions, it makes sense to install Docker-maintained packages. From now on I’ll assume that we do that.

Docker Desktop on Linux

Containers are Linux. They run on Linux (using Linux Kernel namespaces) and they have Linux inside. At the same time, you know that you can install and use Docker on your Windows or Mac. It works by spawning a virtual machine with Linux. Docker CLI then connects to the Docker Engine inside that VM.

Docker Desktop

Using Docker Desktop is inevitable on Windows and Mac, but not everyone knows that you can also run Docker Desktop on Linux! Yes, it will spawn a VM with Linux on your Linux. But why would we do that?

There could be several possible reasons:

To have the same environment as your colleague with Windows/Mac.
To enhance security without involving more sophisticated userns-remap mode or Rootless Docker.
For plugins. Just look at all these beautiful plugins that come with Docker Desktop installation 😍

Client: Docker Engine - Community
 Version:    27.2.0
 Context:    desktop-linux
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.1-desktop.1
    Path:     /usr/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.1-desktop.1
    Path:     /usr/lib/docker/cli-plugins/docker-compose
  debug: Get a shell into any image or container (Docker Inc.)
    Version:  0.0.34
    Path:     /usr/lib/docker/cli-plugins/docker-debug
  desktop: Docker Desktop commands (Alpha) (Docker Inc.)
    Version:  v0.0.14
    Path:     /usr/lib/docker/cli-plugins/docker-desktop
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.2
    Path:     /usr/lib/docker/cli-plugins/docker-dev
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.25
    Path:     /usr/lib/docker/cli-plugins/docker-extension
  feedback: Provide feedback, right in your terminal! (Docker Inc.)
    Version:  v1.0.5
    Path:     /usr/lib/docker/cli-plugins/docker-feedback
  init: Creates Docker-related starter files for your project (Docker Inc.)
    Version:  v1.3.0
    Path:     /usr/lib/docker/cli-plugins/docker-init
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     /usr/lib/docker/cli-plugins/docker-sbom
  scout: Docker Scout (Docker Inc.)
    Version:  v1.11.0
    Path:     /usr/lib/docker/cli-plugins/docker-scout

Downsides of Docker Desktop on Linux:

The main advantage of using containers is that they are more lightweight than VMs. By using Docker Desktop you cancel most of this advantage because you involve VM anyways.
Unlike Docker Engine CE, Docker Desktop is a commercial software and you need to pay money if you are using it for commercial purposes. It might be difficult to convince your boss to buy it when there is a FOSS version that does basically the same.
Docker Desktop Linux involves user ID mapping when working with bind-volumes. It might be a deal-breaker in some cases.

Anyways, Docker Desktop on Linux is a curious creature and we’ll take a look at it later.

One thing I need to warn you about is that the official documentation about installing Docker Desktop on Linux is horrible. There are three lengthy articles (general steps, Ubuntu-specific steps and FAQ), and the volume of the text can easily repell you from even trying.

Don’t be afraid:

Add Docker apt repo for Ubuntu.
Make sure that package uidmap is installed and /etc/subuid has some sane values (see later).
Don’t bother too much whether your Ubuntu version is officially (tm) supported or not. According to the documentation, the only supported version at the moment is 22, but in reality it works fine on 20 and I suspect that with a few tweaks it will work on 23 and 24.
Just download .deb and install it through apt.

Docker Desktop Launcher

You need to launch Docker Desktop through UI at least once to do a proper initialization and make it work.

Docker Desktop GUI

In CLI you should see that your “context” switched to Docker Desktop.

$ docker context ls
NAME              DESCRIPTION                               DOCKER ENDPOINT                                 ERROR
default           Current DOCKER_HOST based configuration   unix:///var/run/docker.sock                     
desktop-linux *   Docker Desktop                            unix:///home/igor/.docker/desktop/docker.sock

To switch between Docker Engine and Docker Desktop call docker context use <context-name>.

There Is More To Cover

On top of the different ways to install Docker, there are different ways to configure and run it and we’ll dive into that soon! But for now, enoguh with the installation lore, let’s go to the main topic of the article.

Container Privileges

So, you tell me that there are “privileged” and “unprivileged” containers? Let’s do an experiment and check what privileges (aka capabilities) a process inside a container actually has. Let’s take a look at the Docker Engine CE with the default configuration.

I reused this code in the following sections to test user ID mapping, so I added extra stuff. We’ll create a new user inside the container and touch files on a bind mount as a container root and container user. After that we will sleep forever to make it possible to investigate the process status.

Dockerfile:

FROM alpine:latest

ARG UID=1000
RUN apk add su-exec
RUN adduser --disabled-password  --shell "/bin/sh" --gecos "" --uid "$UID" appuser

docker-compose.yaml ² :

services:
  mapping-test:
    container_name: mapping-test
    build:
      context: .
    command: sh -c "touch /host_mount/hello-root; su-exec appuser touch /host_mount/hello-user; tail -f /dev/null"
    volumes:
      - ./volume:/host_mount

Now let’s run our container with docker compose up -d and make sure that it’s running:

$ docker ps
CONTAINER ID   IMAGE                               COMMAND               CREATED         STATUS         PORTS     NAMES
a20aa66656cd   user-mapping-example-mapping-test   "tail -f /dev/null"   3 minutes ago   Up 3 minutes             mapping-test

Let’s find out PID of the tail -f /dev/null inside the “mapping-test” container:

$ docker top mapping-test
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                14090               14067               0                   20:12               ?                   00:00:00            tail -f /dev/null

Okay, PID is 14090. Let’s learn more about this process usign /proc filesystem:

/proc/14090/status :

Name:	tail
Umask:	0022
State:	S (sleeping)
Tgid:	14090
Ngid:	0
Pid:	14090
PPid:	14067
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	64
Groups:	0 1 2 3 4 6 10 11 20 26 27 
...
CapInh:	0000000000000000
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
...

This process is running as root (PID=0), but the set of capabilities is limited. Capabilities mask can be dechipered using capsh:

$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Meaning of these flags can be found in man capabilites. For example:

CAP_CHOWN
              Make arbitrary changes to file UIDs and GIDs (see chown(2)).
CAP_FOWNER
              * Bypass  permission  checks on operations that normally require
                the filesystem UID of the process to match the UID of the file
                (e.g., chmod(2), utime(2)), excluding those operations covered
                by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH;
              * set inode flags (see ioctl_iflags(2)) on arbitrary files;
              * set Access Control Lists (ACLs) on arbitrary files;
              * ignore directory sticky bit on file deletion;
              * modify user extended attributes on sticky directory  owned  by
                any user;
              * specify O_NOATIME for arbitrary files in open(2) and fcntl(2).

Already sounds pretty scary, right? And this is the default, “unprivileged” container³ 😄

If we add privileged: true to our service in docker-compose.yaml the CapEff will change to the full house mask 000001ffffffffff with all possible capabilities enabled, including CAP_SYS_ADMIN.

But wait. It can’t be that scary, right? 😇 After all, we know that containers use kernel namespaces for isolation. So, probably this root is a root only inside the container and these capabilities are all but smoke and mirrors, right? 😅 Right? 😅😅😅

Let’s check namespace IDs of the parent shell (runs in the init namespace):

$ realpath /proc/self/ns/{cgroup,mnt,pid,user,ipc,net,uts}
/proc/15191/ns/cgroup:[4026531835]
/proc/15191/ns/mnt:[4026531840]
/proc/15191/ns/pid:[4026531836]
/proc/15191/ns/user:[4026531837]
/proc/15191/ns/ipc:[4026531839]
/proc/15191/ns/net:[4026532000]
/proc/15191/ns/uts:[4026531838]

Now let’s check namespace IDs of the process inside the container.

# realpath /proc/14090/ns/{cgroup,mnt,pid,user,ipc,net,uts}
/proc/14090/ns/cgroup:[4026533050]
/proc/14090/ns/mnt:[4026532931]
/proc/14090/ns/pid:[4026532934]
/proc/14090/ns/user:[4026531837]
/proc/14090/ns/ipc:[4026532933]
/proc/14090/ns/net:[4026532936]
/proc/14090/ns/uts:[4026532932]

WTF?! /proc/14090/ns/user:[4026531837] Don’t tell me that Docker doesn’t use user namespaces 🤦

To sum it up, this root inside the container is the root on the host. And the privileges are privileges for real, not only inside the container.

In case of privileged: true it’s more or less equal (security-wise) to giving up containers at all and simply running your process on the host as root. With the 000001ffffffffff CapEff in the init namespace process inside the container can do whatever it wants, including mounting arbitrary filesystems inside the container and changing the host settings.

Even for “unprivileged” containers without CAP_SYS_ADMIN in the set, the filesystem checks will allow the root inside the container access bind mounts on the host as the root :

$ ll volume/
total 8
drwxrwxrwx 2 igor igor 4096 loka    7 11:24 ./
drwxrwxr-x 4 igor igor 4096 loka    7 11:24 ../
-rw-r--r-- 1 root root    0 loka    7 11:24 hello-root
-rw-r--r-- 1 igor igor    0 loka    7 11:24 hello-user

Of course, we don’t have to run processes inside the container as root. The “best practice” is to add USER appuser to the Dockerfile, so the process will run as an unprivileged user. There is also some syscall filtering, so you can cross your fingers and hope that it won’t allow too scary things from inside a container.

But still, the more you think about it, the more it feels “not like advertised on TV”. For example, the common after installation steps is to add yourself to the docker group in order to get access to the docker socket. After that you can run docker command without sudo. What prevents you from asking it to run a privileged container and mounting a / inside? Right, nothing. Actually, access to the docker socket is equal to the root access to the host system. When you added yourself to the docker group you made yourself a passwordless sudo. It should be written in capital letters next to the “after installation steps”, ~~but for some reason it isn’t~~⁴.

Is Docker a scam? Weren’t we promised isolation by namespaces? User namespaces are capable of making a root inside the container to act like a regular user with zero privileges on the host. Why don’t use them?

The good news is that it’s possible to make Docker run containers in its own user namespaces. And it’s even possible to use unprivileged user namespaces ⁵. The latter mode is called “Rootless Docker” and containers running on it are referred to as “rootless” containers.

The bad news … well, let’s not hurry.

User Namespaces ID Mapping

This topic is important for everything that we’ll see later in this article, so let’s talk about it before moving forward.

If you learned about user namespaces by reading man user_namespaces or any article focusing on the kernel API you might be surprised if you start reading Docker documentation. It all rotates around /etc/subuid, some “subordinate users” and setuidmap / setgidmap programs.

Did we miss something in our story about user namespaces? No. User IDs are mapped from one namespace to another based only on the /proc/PID/uid_map and /proc/PID/gid_map files. That’s all and it’s really that simple.

Then what is /etc/subuid etc.? To explain I’ll do a recap of how uid_map and gid_map work.

`uid_map` and `gid_map` format

The format of those files is dead-simple. It’s a text file with one or more lines. Each line has exactly 3 integers, separated by whitespaces.

<ID range start in the child namespace> <ID range start in the parent namespace> <range lenght>

Few examples of UID mappings:

0 0 4294967295

Maps everything from child to parent 1-to-1. This is e.g. how “dummy” uid_map and gid_map look like for a process that is already in the topmost (“initial”) user namespace.

0 1000 1

Maps the user 0 (root) inside the namespace to the user 1000 (by a systemd convention, the first non-system user).

0 1000 1
1 10000 65536

Maps the user 0 inside the namespace to the user 1000 outside, user 1 inside to the user 10000, 2 to 10001 and so on.

Permissions to write UID and GID maps

man user_namespaces documents the rules on who, when and how is allowed to write to uid_map and gid_map. I made (at least tried) a flowchart of these rules:

uid_map write permissions

I can’t guarantee that it’s 100% accurate, but at least it highlights one important thing: to make arbitrary changes, you need a “real” privileges (not obtained through switching into a new user namespace).

An important exception is the case when you set for yourself a mapping of a process effective user inside the namespace to the effective user outside. This is how the example from my previous article worked. So, ~~One Piece is real~~ it’s possible to start as a user with no privileges at all and end up as a root inside the namespace.

Here we should make a side note that this ability is considered too scary by some people. On Ubuntu 24.04 unprivileged user namespaces are disabled by default, and not by the kernel knob, but by an AppArmor profile. It’s not because there’s something wrong with unprivileged user namespaces it’s “just in case”.

`newuidmap` and `/etc/subuid`

But normally, as I said, you need capabilities to write to those files. One way to get them is to use setuid helpers and that’s what newuidmap and newgidmap are.

newuidmap pid uid loweruid count [uid loweruid count [ ... ]]

— man newuidmap

“uid” = uid inside the namespace. “loweruid” = uid “outside” “count” = length of the range

If you think about UIDs from the POV of the user namespaces hierarchy (they are nested), the name “loweruid” for the UIDs in the parent namespace feels wrong, because outer namespaces are higher in the hierarchy. This mystery has simple explanation: you need to think not in terms of namespaces, but in terms of users.

Imagine that you created a user namespace as a user A. Then, using newuidmap and newgidmap you can set a mapping of some ID range inside to some ID range outside. It means, that now you can act as those IDs in the outside world, i.e. these users and groups are somehow “subordinates” of user A, or in other words, they are “lower” in rank.

As with every setuid program, newuidmap and newgidmap checks if the action (i.e. writing a particular ID map) is allowed. And this is where /etc/subuid (and its twin brother /etc/subgid) come to play.

/etc/subuid is a policy file that describes the set of “subordinate users” (see above) that each user is allowed to have. Again, this is not an interface of a kernel, it’s just a policy file for newuidmap setuid helper.

The policies have the following format:

<login name or UID>:<UID in the parent namespace>:<range length>

The values in the /etc/subuid should be (ideally) set through the usermod

-v, --add-subuids FIRST-LAST
 Add a range of subordinate UIDs to the user's account.

-V, --del-subuids FIRST-LAST
 Remove a range of subordinate UIDs from the user's account.
       
-w, --add-subgids FIRST-LAST
 Add a range of subordinate gids to the user's account
 -W, --del-subgids FIRST-LAST
 Remove a range of subordinate gids from the user's account.

… and usermod consults /etc/login.defs in order to check that the requested changes are permitted 😄

The relationship between all these entities can be illustrated by a following diagram:

usermod

Here I should make a note. If you already have the necessary capabilities (i.e. you already run as a root), you don’t have to use newuidmap & newgidmap at all and you could just write /proc/PID/uid_map and /prod/PID/gid_map directly, without consulting any policy files. Despite this Docker prefers to work through these helpers 🤷

Now, after all this theory let’s finally see how to enable user namespaces support in Docker.

userns-remap mode in Docker

userns-remap is a precursor for the Rootless Docker. Everything else works the same (dockerd & containerd still run as root), but containers are in their own user namespaces. In theory it should make things work easier than in Rootless Docker, but in practice it’s ass.

The Docker documentation about userns-remap mode is here. The article is super long and hard to read, but the essence is trivial: just create (if it doesn’t exist) a file /etc/docker/daemon.json with

{"userns-remap":"<your username>"}

and restart the Docker 🤷

Ah, as with everythin involving user namespaces in Docker, newuidmap and newgidmap must be installed and subordinate users configured. But if you read the previous section you should be already an expert in this topic 😉

So, again, EDITOR=vim sudoedit /etc/docker/daemon.json and put there something similar to:

{
  "userns-remap": "igor"
}

Before restarting the Docker let’s (as an experiment) list available images:

$ docker image ls
REPOSITORY    TAG       IMAGE ID       CREATED         SIZE
ubuntu        24.04     edbfe74c41f8   4 weeks ago     78.1MB
busybox       latest    87ff76f62d36   15 months ago   4.26MB
hello-world   latest    d2c94e258dcb   16 months ago   13.3kB
...

Now call

sudo systemctl restart --now docker.service

If you do everything right your Docker is now in the userns-remap mode. To confirm it call docker image ls again. The list should be empty (if you are doing it for the first time), because your data directory changed from the default /var/lib/docker to something like /var/lib/docker/100000.100000.

Otherwise, the userns-remap mode is indicated in docker info under “Security Options”:

$ docker info
...
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  userns
...
 Docker Root Dir: /var/lib/docker/100000.100000`

Btw, 100000 is the ID of the first “subordinate” user in my /etc/subuid. If you configured your mapping differently, these numbers will be different for you.

$ cat /etc/subuid
igor:100000:65536
lfs:165536:65536

Data directory has curious permissions:

# ls -lA /var/lib/docker
total 48
drwx--x--- 12 root 100000 4096 syys    3 16:19 100000.100000

At this point you might start to feel something off. Let’s repeat our experiment. Run docker-compose up -d and find the pid of tail -f /dev/null using docker top mapping-test:

# cat /proc/8281/status
Name:	tail
...
Uid:	100000	100000	100000	100000
Gid:	100000	100000	100000	100000
...
CapInh:	0000000000000000
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000

$ sudo cat /proc/8281/uid_map 
         0     100000      65536

So yeah, container root is now visible on the host system as the first subordinate user. File system access was mapped accordingly:

$ ll volume/
...
-rw-r--r-- 1 100000 100000    0 loka    8 19:35 hello-root
-rw-r--r-- 1 101000 101000    0 loka    8 19:35 hello-user

I don’t know what you expected, but I expected the root to be mapped as “igor” (the user that we set as a master in the /etc/docker/docker.json) and user 1000 inside the container to be mapped as 101000 or 100999. As we see later this is the way how Docker Desktop and Rootless Docker map users. The way how userns-remap maps users is weird and from all Docker modes that involve user ID mapping it’s the most inconvenient.

I don’t know why Docker has choosen this particular way to map users in the userns-remap mode, especially considering that in this mode dockerd and containerd still run as root and can do whatever they want. For example they could write /proc/PID/uid_map directly without newuidmap, i.e. Docker could have its own rules for user ID mapping per container.

One possible reason that I can think about is that userns-remap is so unpopular that it’s basically abandoned. For example the documentation is outdated:

Docker uses only the first five mappings [of the /etc/subuid & /etc/subgid], in accordance with the kernel’s limitation of only five entries in /proc/self/uid_map and /proc/self/gid_map.

This claim about Kernel limitation of 5 entries is wrong:

There is a limit on the number of lines in the file. In Linux 4.14 and earlier, this limit was (arbitrarily) set at 5 lines. Since Linux 4.15, the limit is 340 lines. In addition, the number of bytes written to the file must be less than the system page size

— man user_namespaces

Maybe userns-remap was abandoned around kernel 4.14 time 🤷 The sad part is that if it’s true, it’s a vicious circle: userns-remap is unpopular because of the clumsy way it maps users … which is not gonna be fixed because it’s unpopular… But this is all just my theory 😄

After all, there is a more simple and more obvious reason why userns-remap is unpopular. If you decided to bother with user namespaces and ID mappings, live life to the fullest: run Docker ~~ruthlessly~~ rootlessly!

Rootless Docker

In the rootless mode all components (CLI, dockerd and containerd) run under an unprivileged user. All namespaces are created from the unprivileged user namespace.

The fundamental requirement for Rootless Docker is unprivileged user namespaces enabled and allowed.

Let’s start with the installation. The documentation is here. Basically, to be able to run your Docker in the rootless mode you need a package docker-ce-rootless-extras. If you installed Docker using convenience script, it will be installed automatically.

Don’t be afraid to install docker-ce-rootless-extras, by default it does nothing, just adds several scripts that need to be run manually:

docker-ce-rootless-extras_26.1.4-1~ubuntu.24.04~noble_amd64/
├── DEBIAN
│   ├── control
│   └── md5sums
└── usr
    ├── bin
    │   ├── dockerd-rootless-setuptool.sh
    │   ├── dockerd-rootless.sh
    │   ├── rootlesskit
    │   └── rootlesskit-docker-proxy
    └── share
        └── doc
            └── docker-ce-rootless-extras
                └── changelog.Debian.gz

7 directories, 7 files

The description of the package also suggests that either VPNKit or slirp4netns should be installed:

Description: Rootless support for Docker.
  Use dockerd-rootless.sh to run the daemon.
  Use dockerd-rootless-setuptool.sh to setup systemd for dockerd-rootless.sh .
  This package contains RootlessKit, but does not contain VPNKit.
  Either VPNKit or slirp4netns (>= 0.4.0) needs to be installed separately.

In my case slirp4netns was already installed.

The recommended way to start rootless Docker is by calling dockerd-rootless-setuptool.sh install. Don’t hesitate to use --force option if the script will complain about rootfull Docker, there’s no harm in running rootless and rootful Docker side-by-side.

The script should give you the output that ends similarly to this ⁶:

+ systemctl --user enable docker.service
Created symlink /home/igor/.config/systemd/user/default.target.wants/docker.service → /home/igor/.config/systemd/user/docker.service.
[INFO] Installed docker.service successfully.
[INFO] To control docker.service, run: `systemctl --user (start|stop|restart) docker.service`
[INFO] To run docker.service on system startup, run: `sudo loginctl enable-linger igor`

[INFO] Creating CLI context "rootless"
Successfully created context "rootless"
[INFO] Using CLI context "rootless"
Current context is now "rootless"

[INFO] Make sure the following environment variable(s) are set (or add them to ~/.bashrc):
export PATH=/usr/bin:$PATH

[INFO] Some applications may require the following environment variable too:
export DOCKER_HOST=unix:///run/user/1000/docker.sock

Another nice thing is that this script adds docker context for the Rootless Docker, so you will be able to switch between your Docker Engine, Docker Desktop and Rootless Docker:

$ docker context ls
NAME            DESCRIPTION                               DOCKER ENDPOINT                                 ERROR
default         Current DOCKER_HOST based configuration   unix:///var/run/docker.sock                     
desktop-linux   Docker Desktop                            unix:///home/igor/.docker/desktop/docker.sock   
rootless *      Rootless mode                             unix:///run/user/1000/docker.sock

Check docker info | grep "Docker Root Dir". It should show a path in your home directory ⁷:

Docker Root Dir: /home/igor/.local/share/docker

Once again, we repeat our experiment from above. Run docker-compose up -d and take a look at the volume:

$ ll volume/
...
-rw-r--r-- 1 igor   igor      0 loka   23 19:38 hello-root
-rw-r--r-- 1 100999 100999    0 loka   23 19:38 hello-user

Now it’s much better than with userns-remap mode. The root inside the container was mapped to the user igor that I used to start Rotless Docker.

It’s a little bit confusing that ID 1000 inside the container was mapped to the 100999, but it makes sense if we remember that ID 1 is mapped to the 100000.

To save you time for scrolling, here is how it looks under userns-remap mode:

$ ll volume/
...
-rw-r--r-- 1 100000 100000    0 loka    8 19:35 hello-root
-rw-r--r-- 1 101000 101000    0 loka    8 19:35 hello-user

You can decide for yourself what looks better.

The final thing. Let’s find out the set of capabilities of the process inside the container:

$ docker top mapping-test 
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                19576               19542               0                   20:07               ?                   00:00:00            tail -f /dev/null

and:

$ cat /proc/19576/status
...
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb

Wow. Again, it’s not what I expected to see.

This is basically the capabilities set of the normal “unprivileged” root inside the container running on the default Docker Engine configuration.

For me it was even a bigger surprise than when I found that Docker by default does not employ user namespaces.

Even more scary are the privileges of the containerd process (the Rootless Docker documentation promises containerd running as an unprivileged user):

Rootless Docker htop

$ cat cat /proc/19542/status
...
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
...

Had we got scammed again? Did we somehow got privileged process running under our user?

It’s not that easy this time. Let’s first notice that Rootless Docker processes run in own namespaces (as promised):

$ realpath /proc/self/ns/user
/proc/20993/ns/user:[4026531837] # Shell
$ realpath realpath /proc/19542/ns/user
/proc/19542/ns/user:[4026532538] # Conatinerd
$ realpath /proc/19576/ns/user
/proc/19576/ns/user:[4026532538] # tail -f

and then follow to the explanation.

Process Capabilities vs. User Namespaces

The source of my confusion with capabilities vs. namespaces was the idea that a process has different sets of capabilities in the different user namespaces:

User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs … and capabilities.

In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace …

in other words, the process has full privileges … inside the user namespace, but is unprivileged … outside the namespace.

— man user_namespaces

It tells about different user and group IDs, but not about different capabilities, although you could think that it’s implied by “isolate”. For example in my previous article I wrote:

user namespaces separate not only user and group IDs but also capabilities, i.e. process in one namespace can have one set of capabilities and another set in another

Turns out this statement above is simply wrong 😅 “Isolate” in the context of user namespaces means “apply” or “does not apply”. In other words, it’s impossible that a process has capabilities mask 0000001fffffffff in one namespace, 00000000a80425fb in the second, and 0000000000000000 in the third. It’s always the same mask and on the kernel level it’s stored in the cred field of the task structure:

/* Objective and real subjective task credentials (COW): */
const struct cred __rcu		*real_cred;

— include/linux/sched.h

struct cred has referene to the “owning” namespace :

struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */

— include/linux/cred.h

but /proc file system does not take it into account when rendering capabilities:

static inline void task_cap(struct seq_file *m, struct task_struct *p)
{
	const struct cred *cred;
	kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
			cap_bset, cap_ambient;

	rcu_read_lock();
	cred = __task_cred(p);
	cap_inheritable	= cred->cap_inheritable;
	cap_permitted	= cred->cap_permitted;
	cap_effective	= cred->cap_effective;
	cap_bset	= cred->cap_bset;
	cap_ambient	= cred->cap_ambient;
	rcu_read_unlock();

	render_cap_t(m, "CapInh:\t", &cap_inheritable);
	render_cap_t(m, "CapPrm:\t", &cap_permitted);
	render_cap_t(m, "CapEff:\t", &cap_effective);
	render_cap_t(m, "CapBnd:\t", &cap_bset);
	render_cap_t(m, "CapAmb:\t", &cap_ambient);
}

— fs/proc/array.c

Considering how it all works together:

A process has an “owning” namespace, to which it “belongs”.
Process capabilities apply in this namespace and all child namespaces.
They do not apply in the parent and sibling namespaces.

At least this is my best understanding of the interaction between process capabilities and user namespaces so far 😄 If you have a better one, let me now 🤓

Here’s the diagram that shows if a process is privileged or not. Process box has a solid border in the “owning” namespace and dashed in all others:

Priveleged or Unprivileged

So, despite /proc/PID/status showing us non-zero CapEff in Rootless Docker, these capabilities should not work outside the user namespace. It’s not difficult to prove, but this time I “leave it as an exercise for the reader” 😄.

The sad part is that there is no easy way to tell if capabilities we see in the /proc/PID/status are “real” or not. In additon to reading status we need to get the user namespace of the process and find out where this namespace is located in the namespaces tree. I think it would be a nice improvement if the/proc filesystem could do these calculations for us. Who knows, maybe it will be you, my dear reader, who will make this improvement.

Docker Desktop, Again

I already wanted to wrap up this article, but suddenly remembered that I promised to show Docker Desktop in action. We already run through the installation steps earlier, so now we can simply switch the context to Docker Desktop:

$ docker context use desktop-linux
desktop-linux
Current context is now "desktop-linux"

and repeat our test. This time processes run inside the VM, so nothing to show on the host. But check how the file access was mapped in the bind mount:

ll volume/
...
-rw-r--r-- 1 igor   igor      0 loka   23 21:32 hello-root
-rw-r--r-- 1 100999 100999    0 loka   23 21:32 hello-user

As you see, it’s the exact same as with Rootless Docker. The rules are taken from the same /etc/subuid, but this time they are applied by something called Virtiofs.

Conclusion

The key takeaways for those who TLDR the article:

Use Docker-mantained packages and not distro-maintained packages if you need flexibility with Docker versions.
By default Docker does not run containers in user namespaces.
Access to the Docker socket is equal to root access to the host system: when you added yourself to the docker group in post-installation steps you made yourself a passwordless sudo.
It’s possible to force Docker to employ user namespaces either by userns-remap mode of the rootfull Docker or by running Docker rootlessly.
Another way to harden security is to use Docker Desktop, a Docker Engine inside a VM. Yes, it’s also possible to run on Linux.
Docker doesn’t make any app “just work” on the other machine. The host OS, the way how Docker is installed and configured⁸ matter.

✨✨✨

I hope you learned something new today or at least this article entertained you 🙏 If you find any errors in the text feel free to drop me an email. You can also say thanks by buying me a coffee.

Sorry for the dev version of runc, I was hacking around. ↩︎
You need to create volume manually and chmod o+wx it. ↩︎
That said, this “unprivileged” root is a really weird beast. On one hand it’s quite powerful (e.g. CAP_FOWNER), on the other it misses many capabilities of a real root, and sometimes it can be very confusing. For example, the root inside the “unprivileged” container can’t read /proc/PID/environ of other processes if they are started by a non-root (turns out you need CAP_SYS_PTRACE for that). ↩︎
Nowadays they at least give a warning 😄 ↩︎
I wrote about them in Containers and Threads From the POV of clone. ↩︎
Honestly, before trying Rootless Docker I had no idea that systemd can have per-user units. ↩︎
Btw, if you want to configure Rootless Docker, you can do it through ~/.config/docker/daemon.json. ↩︎
And sometimes even the minor versions, but it’s for another article. ↩︎

Docker Containers: Priviliged, Unpriviliged, Rootless

Different Flavours of Docker Installation

Docker Engine

Third-Party and Official Packages

Docker Desktop on Linux

There Is More To Cover

Container Privileges

User Namespaces ID Mapping

uid_map and gid_map format

Permissions to write UID and GID maps

newuidmap and /etc/subuid

userns-remap mode in Docker

Rootless Docker

Process Capabilities vs. User Namespaces

Docker Desktop, Again

Conclusion

`uid_map` and `gid_map` format

`newuidmap` and `/etc/subuid`