..

Containers and Threads from the POV of `clone`

This article discusses such notions as process, thread, and container through the prism of a clone system call.

The article is rather educational and intended for a person who is familiar with Unix/Linux system programming concepts in theory, but never dived into it. I assume that a reader understands what is OS kernel, a system call, and how to write and compile a C program.

If you have experience with Unix/Linux system programming or even did kernel development, you are unlikely to find any new information here. Nevertheless, I would be happy if you read the article and would appreciate any feedback and reports of errors in the text.

Introduction

In the old times when computers had a weight of several tonnes and consumed dozens of kilowatts of energy, a program usually had direct access to all memory and processor time.

Computer form 70s

β€” Old-school computer from 70s, National Museum of Technology in Warsaw.

Even in those ancient times several programs usually resided in computer memory (i.e. loader itself and the code, or the shell and the code).

Therefore, the need for sharing computation time and system resources between different programs arose very early in computer history and long before the advent of personal PCs. Multitasking in early computers required a great share of “well-behaving” because of simple and unrestricted access to system resources.

Implementing secure and reliable multitasking required efforts both from the hardware and software side. One of the most important was the introduction of virtual memory that allowed each program to have its own address space independent from other processes and hiding real, “physical” memory layout.

Let’s land our TARDIS at some specific point in time and space, let’s say Unix circa 89, and look at the “classical” API it offered back then.

fork & exec

So far I used the term “program” for a process. The process is an instance of a running program. Multiple processes can coexist at the same time either by running in parallel on different CPUs or by alternating their order of execution.

OS kernel keeps the list of all processes, and their resources (such as open files) and distributes processor time between currently running processes. To create a new process an existing process (for example shell) has to ask the kernel to do so.

The classical Unix API for creating a new process is a fork system call:

#include <sys/types.h>
#include <unistd.h>
       
int main(int argc, const char* argv[]) {
    pid_t pid = fork();
    if (pid == -1) {
        // Error
    } else if (pid == 0) {
        // We are inside the child process
        doChild();
    } else {
        // We are inside the parent process, we can do other stuff
        // for example wait for child to complete
    }
    return 0;
}

fork creates a new process with the memory layout that is an exact copy of a calling process. Then fork returns in both processes (the parent and child) and execution continues in both processes with the only difference of the returned value.

“Exact copy of memory” is not as scary as it sounds, because this copy is supposed to be a lazy copy-on-write operation. fork therefore is a relatively lightweight call.

The child process has two options for what to do after fork:

β€” Continue execution of the code that is already loaded into memory, just switch to a special branch that contains logic for a child subprocess.

β€” Invoke an exec system call that loads a completely different program. In this case, the entry in the kernel process list stays the same, but all memory layout is discarded and a new executable image is loaded.

In any case after the fork the child and parent have independent memory, isolated from each other, and a copy of other attributes, such as tables of open file descriptors.

In some cases such isolation is too much, because the only way for a child and parent to interact and share data is through some IPC. Trying to solve this inconvenience leads to the idea of threads.

In other cases this level of isolation is not enough, because although processes have isolated memory they still share e.g. the same view of a file system and network (just the list of open descriptors is independent). Addressing these issues leads to containers.

To provide more control over which attributes of a process are shared and which are not, the Linux kernel introduced a new system call clone. In modern day Linux fork is implemented as a call to clone. Let’s see how the fork example would look like using the clone system call directly:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sys/syscall.h>

int main(int argc, char* argv[]) {
    int flags = SIGCHILD;
    int pid = syscall(SYS_clone, flags, NULL, NULL, NULL, NULL);
    if (pid == -1) {
        // Error
    } if (pid == 0) {
	    // We are inside the child process
    } else {
        // We are inside the parent process
    }
    return 0;
}

You can see that it’s pretty much the same, except some flags and a bunch of NULLs.

To emulate fork we don’t need any flags except SIGCHILD (it specifies which signal to send when the child terminates). In this form clone is equivalent to fork and don’t need any extra parameters, so we left them as NULLs.

In the following sections, we’ll see more interesting examples of flags and what you can achieve with them.

clone and Threads

The idea of splitting an application into several threads is similar to the idea of splitting an application into several processes, except threads benefit from sharing the same memory and other process resources.

From the point of view of the kernel each thread is the same type of entity as a process and under the hood of a threading library thread is created by the same clone call.

Consider the following example:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/mman.h>

#include <sched.h>

static int childFunc(void* arg) {
    // We are inside the child process, that share VM with parent
}

int main(int argc, char* argv[]) {
    int flags = CLONE_VM; // We want parent and child to share virtual memory

    const int STACK_SIZE = 1024 * 1024;
    char* stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
    char* stackTop = stack + STACK_SIZE;

    pid_t pid = clone(childFunc, stackTop, flags, NULL);
    if (pid == -1) {
        // Error
    } else {
        // Child created
    }
}

Wow, wow! Probably you have questions πŸ˜‚ Where is “the same clone call” that I promised? We use some clone function and it doesn’t seem to fork the process but takes a callback!

No panic.

Normally when we invoke a system call, we do it through a libc wrapper function. Most of the wrappers are thin and exist only for convenience, for example instead of syscall(SYS_getpid) we use getpid().

In the previous and next sections, I don’t use a clone wrapper and invoke syscall directly because I want to show that a clone behaves exactly like a fork.

But in this section, we have to switch to a glibc wrapper because we want to do something more tricky than just creating a child process. We create a child that shares memory with the parent!

Do you remember that after fork (i.e. clone) both processes return at the same point? In case when you don’t have a shared memory it’s fine because two control flows can’t interact with each other. Things become more difficult when the memory is shared.

For example, consider the line pid_t pid = clone(...). In the parent clone returns the PID of the child and in the child it returns 0. But since both processes share the same memory a write to pid in the child will affect pid in the parent, so how is it supposed to work?

To make it work a raw clone system call requires (if you use CLONE_VM) a pointer to the new stack that a caller must allocate before the call. In the example from the previous section (when we emulated fork), it was one of the nulls in syscall(SYS_clone, flags, NULL, NULL, NULL, NULL).

When the clone syscall returns in the child the stack pointer register points to a different region of memory than in the parent and therefore both processes can coexist without breaking each other.

The only trouble with this approach is that no one can simply walk to Mordor to set up a new stack. High-level languages like C don’t offer necessary facilities like the ability to adjust CPU registers. That’s why the glibc wrapper exists. It’s written in assembler and does necessary stack adjustments for us.

If you are interested in how the clone wrapper is implemented you can check out the code, it’s not difficult.

I’m sorry for the lengthy detour, the point was to show that there’s no hidden magic, and glibc clone is still the same good old fork-like clone hidden behind the scenes.


Now after we clarified these details, let’s try to put some meat on the bones and experiment with the example above.

The following code checks whether the modification of a variable in the parent process affects the child. Option -m adds CLONE_VM to the list of clone flags.

Let’s try it first without sharing virtual memory:

./clone_vm
Parent PID: 7426
Child PID: 7427
Global variable (first read): 0
Parent, setting variable
Global variable (second read): 0
Child Bye!
child 7427 exited with exit status 0
Parent Bye!

As expected, the child and parent don’t share memory, and modification of the local variable from the parent is invisible in the child.

Flag -m forces clone to share virtual memory between processes:

./clone_vm -m
Parent PID: 7451
Child PID: 7452
Global variable (first read): 0
Parent, setting variable
Global variable (second read): 42
Child Bye!
child 7452 exited with exit status 0
Parent Bye!

We can see that now when the parent modifies a global variable the child can see the modification.

It resembles threads, but something feels wrong … first of all, we can see that parent and child still have different PIDs. With “real” pthreads getpid() must return the same PID for every thread.

Second, a careful reader may notice that the parent and child got a copy of the pipe ends descriptors (notice that we need to close write ends of pipes both in parent and in child, otherwise we wouldn’t see EOF).

If we dug deeper we would discover more differences, for example, separate signal handlers and separate current and root dir. Sharing these attributes (except PIDs) can be addressed by more flags for clone, like this:

flags |= CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND;

This is more or less how the old-school Linux threading implementation called LinuxThreads worked. In the LinuxThreads independent threads had separate PIDs and in many other aspects behaved as independent processes.

The “real” (POSIX) threads specification requires threads to look and feel like “parts” of a parent process, not as individual entities.

To make the child behave as a thread we would need to add to the above list of options an option CLONE_THREAD. It will actually make the child and parent return the same PID (and the same parent PID):

./clone_vm -m
Parent PID: 7530
Child PID: 7530
Global variable (first read): 0
Parent, setting variable
Global variable (second read): 42
Child Bye!

Yay! πŸŽ‰ Can we now go to school and brag that we implemented our own threading library? πŸ˜„

Unfortunately no.

  • Specifying CLONE_THREAD makes the child unwaitable and no signal is sent upon its termination. If we wanted to implement our own threading library we would need to track the child termination through some other mechanism (in the NPTL implementation it’s done AFAIK by futexes).
  • We would need to set up thread local storage for each thread.
  • We would need to keep threads user and group IDs synchronized (in the NPTL it’s AFAIK done by signals, see man 7 nptl).
  • Probably a ton of other stuff that I’m not aware of.

But if we close our eyes to these technical details, threads are the same type of kernel scheduler entities as processes and created by the same clone system call.

Small Note on fork & exec

Before we move to the next topic I’d like to make a small note. Earlier I mentioned:

Child process has two options for what to do after fork:

β€” Continue execution of the code that is already loaded into memory, just switch to a special branch that contains logic for a child subprocess.

Indeed, if you read “classical” Unix books you might see examples of such code. For example, a TCP server accepts a connection, forks itself, and then servers the client in a subroutine.

Unfortunately in the modern world with runtimes and multithreaded applications this approach is not a good idea. The only 100% safe option is to call execve short after the fork.

One of the reasons is that when a process forks itself it forks only the thread that invoked the fork. It’s logical if you remember that threads are “processes in disguise”. As a consequence, if your application uses worker threads (directly or through runtime or a third-party library) it has a very good chance to end up in a broken state.

Linux man 2 fork allows using only async-signal-safe functions after the fork in a multithreaded application.

macOS man 2 fork even suggests calling execve with the same executable as the parent to avoid possible caveats. So if you want to be portable and safe, may be better to do so.

I’m not trying to scare you, just to warn πŸ˜… Okay, let’s go to the next topic!

clone and Kernel Namespaces

Except for virtual memory, Unix processes did not provide much isolation of other system resources.

For example, all processes in OS share the same users and groups. All processes see the same file system tree (although some parts may be restricted by access permissions). All processes access the network through the same list of adapters and have access to the same protocols, routing tables, and firewall rules.

To change this situation Linux introduced a concept of namespaces. There are (at the moment I’m writing this article) seven different types of namespaces, each allowing “lightweight virtualization” of a certain type of resources.

One way to put a process into a separate namespace is the good old clone with one of the CLONE_NEW* flags. Each flag corresponds to one of the namespace types.

TLDR;

  • CLONE_NEWNS Mount namespaces. Allows a process to have its own file system tree with its own mounts visible only inside the namespace.
  • CLONE_NEWUSER User namespaces. Allows a process to be a root inside a container.
  • CLONE_NEWPID PID namespace. Allows processes inside a namespace to use their own range of process IDs, e.g. first process in the namespace can get PID=1 with all consequences.
  • CLONE_NEWNET Network. Provide isolation for network devices, stacks, ports, etc.
  • CLONE_NEWIPC. Provides isolation for IPC not represented by filesystem pathnames.
  • CLONE_NEWUTS. Allows a process to use the hostname and NIS domain name different from the one used by the global namespace.
  • CLONE_NEWCGROUP Cgroup namespace. Cgroups are mechanisms to put quotas and throttle system resource consumption. It’s not often visible from a user perspective, but they are essential for serious container runtimes like Kubernetes.

For example, let’s see how to clone a process with a new PID namespace:

static int _clone(int flags) {
    return syscall(SYS_clone, flags, NULL, NULL, NULL, NULL);
}

int main(int argc, char* argv[]) {
    int pid = _clone(SIGCHLD | CLONE_NEWPID);
    if (pid == -1) {
        // Error
    } else if (pid == 0) {
	    // Child
        printf("Child, PID: %d\n", getpid());
        exit(EXIT_SUCCESS);
    } else {
        // Parent
        printf("Parent, Child PID: %d\n", pid);
        waitChild(pid);
    }
    return 0;
}

Output:

sudo ./clone_container 
[sudo] password for igor: 
Parent, Child PID: 17115
Child, PID: 1
child 17115 exited with exit status 0

We can see that getpid() in the child process returned 1, while for the parent the child is identified by PID = 17115.

As you noticed we used sudo. That’s because to clone a process with one of the namespace flags a process must hold a CAP_SYS_ADMIN capability (in plain English you must run it as root).

User namespaces give a loophole for this rule, we’ll talk about this topic in the next section.

Before we continue I want to make several notes:

β€” Due to security reasons most of the thread-related clone flags (e.g. CLONE_FILES) are incompatible with namespace-related flags (e.g. CLONE_NEWNS). It’s logical if you consider that they have opposite purposes: to share vs. to isolate.

β€” From this point, I’ll arbitrarily use the term “container” for a process(es) that uses a namespace for any resource. Container from the point of view of a container engine like Docker or Podman might be something more complicated than just process(es) inside a namespace. We’ll discuss it in What’s Next.

β€” Joining a namespace affects how the process(es) sees itself and the global resources compared to how it’s visible for processes outside of that namespace (consider the PID namespace example from above). To make a distinction between these two points of view I’ll use the terms “inside” and “outside” of a container.

User Namespaces

User namespaces are important because they influence the behavior of other namespaces and the permissions of a process.

On the surface, their goal is simple β€” to have an independent namespace for user and group IDs, i.e. give the ability to allocate user IDs independent from the host ones, pretty much like how we can have our own PID 1 inside a PID namespace.

This simplicity is superficial:

β€” In Unix a user with ID = 0 is a superuser and by Unix tradition is automatically granted all capabilities. To solve this issue user namespaces separate not only user and group IDs but also capabilities, i.e. process in one namespace can have one set of capabilities and another set in another.

β€” File system permissions in Unix are based on users and group IDs. A process that creates a new user namespace should establish a mapping between host and container users and groups. When a process inside a container reads or writes files, the ownership IDs are translated by the kernel using those mappings. User namespaces in this aspect intertangle with mount namespaces.

β€” All other types of namespaces (mount, network, etc.) are owned by user namespaces, and this ownership is used to resolve permissions over the resources managed by other namespaces.

β€” User namespaces are nested. There’s an “initial” or “root” user namespace. Permissions for system resources that are not managed by any namespace type (e.g. system time) are determined by capabilities that the process has in that “root” namespace. Moreover, despite mount points being managed by mount namespaces, permission to mount a block device is also determined by the “initial” namespace.

Another important property of user namespaces is that they allow unprivileged users to create other namespaces (normally it requires CAP_SYS_ADMIN):

  1. You can create a user namespace as an unprivileged user (if the kernel supports it and it’s not explicitly forbidden by a system administrator).
  2. You become the root inside that user namespace.
  3. You can now create other namespaces because you are the root and have CAP_SYS_ADMIN.

The steps from above are not necessary to do one by one β€” if the combination of clone flags includes CLONE_NEWUSER it guarantees that the user namespace will be created first.

Now we’re ready to see the complete example of creating a user namespace (I omit the code for synchronizing client and child):

static int _clone(int flags) {
    return syscall(SYS_clone, flags, NULL, NULL, NULL, NULL);
}

int main(int argc, char* argv[]) {
    int pid = _clone(SIGCHLD | CLONE_NEWUSER | CLONE_NEWPID);
    if (pid == -1) {
        // Error
        printf("clone returned -1\n");
        exit(EXIT_FAILURE); 
    } else if (pid == 0) {
        // Child
        // TODO: Wait for the parent to close write end of the pipe

        printf("Child, PID: %d\n", getpid());
        printf("Child, UID: %d\n", getuid());
        
        exit(EXIT_SUCCESS);
    } else {
        printf("Parent, Child PID: %d\n", pid);
        printf("Parent, UID: %d\n", getuid());
        
        updateUIDmap(pid);
        updateGIDmap(pid);

        // TODO: Signal child that maps are updated

        waitChild(pid);
    }
    return 0;
}

Output:

./user_namespace 
Parent, Child PID: 20859
Parent, UID: 1000
Child, PID: 1
Child, UID: 0
Child 20859 exited with exit status 0

As we can see:

  1. The child’s UID is 0 inside the container.
  2. We don’t need sudo anymore to create a PID namespace πŸŽ‰

The updateUIDmap and updateGIDmap can be implemented as follows (omitting the error handling):

static void writeFile(const char* fileName, const char* content) {
    int fd = open(fileName, O_RDWR);
    write(fd, content, strlen(content));
    close(fd);
}

static int updateUIDmap(pid_t pid) {
    char path[PATH_MAX] = {0};
    snprintf(path, PATH_MAX, "/proc/%d/uid_map", pid);

    char userMap[MAP_BUFF_SIZE] = {0};
    snprintf(userMap, MAP_BUFF_SIZE, "0 %d 1", getuid());

    writeFile(path, userMap);
}

// See man 7 user_namespaces
static int denySetgroups(pid_t pid) {
    char path[PATH_MAX] = {0};
    snprintf(path, PATH_MAX, "/proc/%d/setgroups", pid);

    writeFile(path, "deny");
}

static int updateGIDmap(pid_t pid) {
    denySetgroups(pid);

    char path[PATH_MAX] = {0};
    snprintf(path, PATH_MAX, "/proc/%d/gid_map", pid);

    char groupMap[MAP_BUFF_SIZE] = {0};
    snprintf(groupMap, MAP_BUFF_SIZE, "0 %d 1", getgid());
    writeFile(path, groupMap);
}

The full code lives here.

Mount Namespace

When I started to write the previous section I initially wanted to tell that user namespaces are the most tricky of them all. But then I considered mount namespaces and changed my mind πŸ˜„

As expected, to create a new mount namespace we need to clone with a certain flag. The name of the flag is CLONE_NEWNS (not CLONE_NEWMOUNT as you might expect). The reason is historic, mount namespaces were implemented before all others and used to be the only type of namespace.

I mentioned that mount namespaces provide the ability to have an isolated FS tree with its own list of mounts, but achieving it in practice is not as straightforward as it may look:

  • When we clone a process with CLONE_NEWNS the new mount namespace contains a copy of the parent mounts.
  • By default mount and unmount events are propagated between mount namespaces, so if we need any isolation we’ll need to manually disable it.
  • If we used a user namespace loophole to create a mount namespace as an unprivileged user we won’t be able to unmount anything inherited.
  • In the latter case we also won’t be able to mount anything, except creating a bind mount for a directory or mounting a few memory-only FS types. In other words, being a “container root” doesn’t work with mounts.

In practice to make a working FS isolation container use the pivot_root API to make an existing directory into a container root. This API has two main use cases:

  • Setting up the root directory after the system boot.
  • Setting up a root directory for a container.

Long story short, switching to an isolated FS would need the following (as usual, unrelated pieces are skipped):

static int pivot_root(const char *new_root, const char *put_old) {
    return syscall(SYS_pivot_root, new_root, put_old);
}

static void doPivotRoot(const char* newRoot) {
    const char *oldRoot = "/old_root";

    // Recursively change mount propagation to "private"
    mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);

    // Make sure that new new root is a mount point
    mount(newRoot, newRoot, NULL, MS_BIND, NULL);

    // Create directory to which old root will be pivoted
    char path[PATH_MAX] = {0};
    snprintf(path, sizeof(path), "%s/%s", newRoot, oldRoot);
    mkdir(path, 0777);

    pivot_root(newRoot, path);

    chdir("/");
    umount2(oldRoot, MNT_DETACH);
    rmdir(oldRoot);
}

int main(int argc, char* argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <new-root>\n", argv[0]);
        return 1;
    }

    const char* newRoot = argv[1];
    fprintf(stdout, "New root: %s\n", newRoot);

    int pid = _clone(SIGCHLD | CLONE_NEWNS);
    if (pid == -1) {
        // Error
        fprintf(stderr, "clone returned -1\n");
        exit(EXIT_FAILURE); 
    } else if (pid == 0) {
        // Child
        printf("Child, PID: %d\n", getpid());

        doPivotRoot(newRoot);

        // Execute a shell inside a container (we assume the new FS has it)
        char* newExe = "/bin/sh";
        char* newArgs[] = {newExe, NULL};
        execv(newExe, newArgs);
        
        fprintf(stderr, "execv failed\n");
        exit(EXIT_FAILURE);
    } else {
        printf("Parent, Child PID: %d\n", pid);
        waitChild(pid);
    }
    return 0;
}

The full code is here.

For testing the code from above let’s cheat and use a file system from a ready-made busybox Docker image. The following command downloads and extracts the busybox file system to a local folder:

mkdir -p busybox
docker export $(docker create busybox) | tar -C busybox -xvf -

Now we can invoke our example:

sudo ./mount_namespace_sudo busybox/
New root: busybox/
Parent, Child PID: 20758
Child, PID: 20758
/ # ls
bin    dev    etc    home   lib    lib64  proc   root   sys    tmp    usr    var
/ # exit

To make this example work as an unprivileged user we would need to combine the example from the previous section.

What’s Next?

We just saw an example of isolating a process with user and mount namespaces. Can we say that we implemented our own container?

The answer is not as simple as with threads. The reason is that, unlike POSIX threads, there’s no single standard or definition of what a container is. Kernel does not have a concept of a container either. Until the present time, it’s up to the implementation to decide what to call a container, which means isolation, and how to use it.

Other things to note:

  • Namespaces are not the only mechanism for process isolation and e.g. runc can optionally filter container’s system calls using the BPF engine.
  • Containerized applications usually require bundling of root FS and configuration (i.e. making images). I haven’t touched on this topic at all, but for a real-world container engine like Docker or LXC, it’s an essential matter.

Further Reading

β€” man 2 clone, man 7 namespaces, man 7 user_namespaces, man 7 mount_namespaces, etc.

β€” Michael Kerrisk’s series of articles on namespaces with examples in C. Dates back to 2013, but after reading the Red Hat series and a few others with a similar idea I can say that people haven’t moved too far from it. A true gem, I highly recommend it.

β€” Red Hat corporate blog with a series of articles “Building a container by hand using namespaces” (using shell commands like unshare).

β€” A series of articles “Unprivileged containers on Go” from one of the ex-maintainers of runc. It shows using network namespaces which is another curious topic that I skipped in this article.

β€” For a glimpse of the present and future of Linux namespaces I recommend taking a look at the Christian Brauner personal blog (one of the LXC maintainers).

Conclusion

Try the API mentioned in this article yourself if you want to understand it. Namespaces are topic that seems clear and logical after reading a manual but after trying to use them in practice you may discover that you understood them completely wrong.

I’m not a kernel developer and still learn about Linux Kernel and Linux API myself. There might be follow-up articles and amendments to this one.

Thank you for reading! If you have feedback or you notice any errors in the text you are welcome contact me using email.

If you find this article useful, you canΒ buy me a coffee.