Rootless containers with systemd

I decided to attend the last Yunohost devcamp in Paris last august to learn how to package Zusam for it.
It was one week long and I didn't know beforehand that I would spend most of it trying to make Yunohost work in an unprivileged container. I learnt of lot during that process: here's the full story.

The issue

In order to test the packaging of Zusam in Yunohost, I needed to run Yunohost on my laptop.
The proposed solution for this by the Yunohost team is to use ynh-dev that uses LXC in the background.
I never used LXC before and didn't want to learn it right away so I decided to see if it was possible to run it in a docker container. Luckily, someone had already made a docker image for it that I could use as a starting point. The issue is that this image should be run as --privileged which is a bad thing. It basically means that the process that runs inside has root rights over the host.
This was out of question for me and so I started to look for an answer.

First try with docker

Yunohost relies on systemd (and Debian) to work. The difficulty comes from having systemd correctly work inside a docker container.
If you try the following dockerfile:

FROM debian:9.9

RUN set -xe \
    && DEBIAN_FRONTEND=noninteractive \
    && apt-get update \
    && apt-get install -y --no-install-recommends systemd \
    && rm -rf /var/lib/apt/lists/*

CMD ["/bin/systemd"]

You will get the following error:

Failed to mount tmpfs at /run: Operation not permitted
Failed to mount tmpfs at /run/lock: Operation not permitted
[!!!!!!] Failed to mount API filesystems.
Freezing execution.

This is because systemd needs a certain execution environment which is described here. Here, /run and /run/lock are not mounted as tmpfs and since the container is not started with --privileged, it has not the capability to mount those.
We could do some work and adapt the command starting the container to be systemd friendly but there's an easier solution: podman !

Enter podman

Podman is an container engine following the OCI standard. It is developed by RedHat along with buildah and skopeo and allows to run containers as an unprivileged user.
Even better, it allows to easily run systemd in the container.

Podman will be available on Centos 8 (out soon !) but we can install it also on Debian. I'm following the install guide of podman.
Start by installing the requirements:

sudo apt-get install \
    btrfs-tools \
    git \
    go-md2man \
    golang-go \
    iptables \
    libassuan-dev \
    libc6-dev \
    libdevmapper-dev \
    libglib2.0-dev \
    libgpg-error-dev \
    libgpgme-dev \
    libostree-dev \
    libprotobuf-c-dev \
    libprotobuf-dev \
    libseccomp-dev \
    libselinux1-dev \
    libsystemd-dev \
    pkg-config \
    runc \
    software-properties-common \
    uidmap

Installing runc will remove docker if previously installed. You can revert to docker by reinstalling it:

sudo apt-get install docker-ce docker-ce-cli containerd.io

(and switch then back to podman by reinstalling runc)
Then we add the podman ppa by adding the file /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-bionic.list:

deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
# deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main

Validate the key of the ppa:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 8BECF1637AD8C79D

Install buildah and podman:

sudo apt install buildah podman

Next, we want to add the docker registry to podman (that is wired to the fedora one by default). Add the file /etc/containers/registries.conf:

# This is a system-wide configuration file used to
# keep track of registries for various container backends.
# It adheres to TOML format and does not support recursive
# lists of registries.

# The default location for this configuration file is /etc/containers/registries.conf.

# The only valid categories are: 'registries.search', 'registries.insecure', 
# and 'registries.block'.

[registries.search]
registries = ['docker.io', 'quay.io', 'registry.fedoraproject.org', 'registry.access.redhat.com']
#registries = ['registry.access.redhat.com']

# If you need to access insecure registries, add the registry's fully-qualified name.
# An insecure registry is one that does not have a valid SSL certificate or only does HTTP.
[registries.insecure]
registries = []


# If you need to block pull access from a registry, uncomment the section below
# and add the registries fully-qualified name.
#
# Docker only
[registries.block]
registries = []

This way, we can access docker.io, quay.io, registry.fedoraproject.org and registry.access.redhat.com registries.

Now we can start to use podman and buildah but we have to make a choice: should we execute them as root or not ?
I'm going to start with the root option since I started this way but we'll see the unprivileged one later.

Let's rebuild our image using buildah and run it with podman:

sudo buildah bud --format=docker -f Dockerfile -t systemd .
sudo podman run -it --rm --name systemd systemd

It starts fine ! To remove the container, open a new terminal and sudo podman stop systemd.

Let's run Yunohost

Tiny disclaimer

You will soon see that the container is not fully functionnal and that I'm doing some very light testing.
The goal for me was to have a dev environment to package my app which is a simple PHP app.
I'm not sure if this container (and the following ones) can be used to extensively test Yunohost. Even less to run it in production !

Back to our container

Now we will write a dockerfile for Yunohost:

FROM debian:9.9

RUN set -xe \
    && DEBIAN_FRONTEND=noninteractive \
    # avoid relinking /etc/resolv.conf
    && echo "resolvconf resolvconf/linkify-resolvconf boolean false" | debconf-set-selections \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
        ca-certificates \
        curl \
        systemd \
    && rm -rf /var/lib/apt/lists/*

ARG INSTALL_TYPE=unstable
RUN set -xe \
    && curl -sL https://install.yunohost.org | bash -s -- -a -d ${INSTALL_TYPE}

CMD ["/bin/systemd"]

Yunohost is not yet compatible with buster, so we're using stretch. I'm using the unstable version of Yunohost because some things were adapted live by the Yunohost team to help me in my endeavor.
The trick about resolv.conf comes from the docker image created by aymhce. /etc/resolv.conf is mounted as tmpfs and this trick allows resolvconf to be installed.

Another thing: Yunohost stores executables in /tmp and execute them from there. podman mounts /tmp as tmpfs with the flag noexec. The workaround I found is to create a volume and mount it over /tmp.
Yunohost is also using iptables to manipulate the network so we need to add the NET_ADMIN capability to the container.

Let's build the image, create the volume and run the container:

sudo buildah bud --format=docker -f Dockerfile -t ynh .
sudo podman volume create --opt type=tmpfs --opt device=local --opt o=rw,exec tmp_ynh
sudo podman run -it --rm --name ynh --net=host -h x250.home --cap-add=NET_ADMIN --mount 'type=volume,source=tmp_ynh,target=/tmp' ynh

Note that I'm defining x250.home as hostname for the container: this will be useful in the yunohost management later (You can set it to whatever you want, just make sure that it has a dot and that you add it to your /etc/hosts).

The container is now running

If it fails to start the cgroup proxy service, carry on, it should be fine

[FAILED] Failed to start Cgroup management proxy.
See 'systemctl status cgproxy.service' for details.

I'm using some very light testing to see if the container is kinda ok:

sudo podman exec -it ynh bash
yunohost tools postinstall -d $(cat /etc/hostname) -p Yunohost
yunohost user create yuno -f yuno -l yuno -m yuno@$(cat /etc/hostname) -p Yunohost
yunohost app install wordpress --args "domain=$(cat /etc/hostname)&path=/blog&admin=yuno&language=en_US&multisite=1&is_public=1"

During the postinstall, slapd could be failing but that didn't seem to be an issue for my limited testing.

Warning: Job for slapd.service failed because the control process exited with error code.
Warning: See "systemctl status slapd.service" and "journalctl -xe" for details.
Error: Script execution failed: /usr/share/yunohost/hooks/conf_regen/06-slapd

Going further

User namespaces

So this is all and well but there's still an issue.
You see, the container we had just running is not really unprivileged. To be frank, it depends on your definition of unprivileged for a container.
At first I thought that unprivileged was the opposite of privileged which for docker is using the --privileged argument. But apart from that argument, docker doesn't really talk about unprivileged containers.

A real definition comes from LXC:

Privileged containers are defined as any container where the container uid 0 is mapped to the host's uid 0.

They consider privileged containers as not safe and default to unprivileged.
The issue for us is that docker/podman containers don't do uid/gid remapping by default and can be seen as privileged from LXC's standpoint.
This privileged/unprivileged thing and the risks of running privileged containers is further explained here.

There is an option called --uidmap that should do what we want that docker also has but when using it, systemd doesn't start. It wants to be able to read /sys/fs/cgroup and this is not possible since /sys/fs/cgroup is mounted as readonly tmpfs by podman from the host and therefore owned by uid 0.
Apparently, this could be solved in the future by using the new cgroup specification: cgroups v2.
LXC has managed to solve it using pam_cgfs.so but this isn't likely to be supported in podman. Cgroups v2 seems to be the only answer and is not yet implemented in podman. We have to wait.

Running as an unprivileged user

So what can we do ? Actually, one of the big advantages of podman over docker is the possibility to run containers as an unprivileged user. It's easy enough to create one dedicated to this task or to create podman service units that can take advantage of systemd's dynamic users.

There is a list of shortcomings to have in mind when using rootless containers. We for example we'll have difficulties with the fact that we cannot bind ports < 1024 on the host and that if /etc/subuid is not setup for a user, this can make LDAP fail.
Let's try it anyway.

First, if you are like me on buster, you'll need to allow kernel user namespaces:

sudo su -
echo 'kernel.unprivileged_userns_clone=1' > /etc/sysctl.d/00-local-userns.conf

The runc executable gets installed in /usr/sbin and is therefore not in the PATH of unprivileged users. I added a symlink to it as a quick and dirty fix:

sudo ln -s /usr/sbin/runc /bin/runc

Now we can rebuild and run the container as done previously but without sudo:

buildah bud --format=docker -f Dockerfile -t ynh .
podman volume create --opt type=tmpfs --opt device=local --opt o=rw,exec tmp_ynh
podman run -it --rm --name ynh -h x250.home --cap-add=NET_ADMIN --mount 'type=volume,source=tmp_ynh,target=/tmp' ynh

I've removed --net=host since our rootless container cannot access ports < 1024.

You can also launch the postinstall like the last time:

yunohost tools postinstall -d $(cat /etc/hostname) -p Yunohost

This time, there are more errors:

Warning: Job for fail2ban.service failed because the control process exited with error code.
Warning: See "systemctl status fail2ban.service" and "journalctl -xe" for details.
Error: Script execution failed: /usr/share/yunohost/hooks/conf_regen/52-fail2ban

This will be an issue because a lot of Yunohost packages like wordpress are configuring fail2ban and therefore need it to be running.
Once again, since my goal is not to have something completely functionnal, I'm fixing it the dirty way. The issue is that some files are missing (related to postfix) and fail2ban needs them to start the related jails.

sed -i 's/enabled = true/enabled = false/g' /etc/fail2ban/jail.d/yunohost-jails.conf
systemctl start fail2ban

I'm removing the fail2ban jails and restarting it.
Now we're good to go !

yunohost user create yuno -f yuno -l yuno -m yuno@$(cat /etc/hostname) -p Yunohost
yunohost app install wordpress --args "domain=$(cat /etc/hostname)&path=/blog&admin=yuno&language=en_US&multisite=1&is_public=1"

Conclusion

Obviously, we cannot yet replace LXC with podman to completely test out Yunohost and even less run it in production that way.
There's still some maturing necessary from the rootless container world and maybe some tweaking on the Yunohost side to accomodate to such scenarios (if they're willing to support them).

I'm happy to have dabbled with rootless containers and will certainly use them in the future. My actual personal Centos 7 server runs all his services in docker containers, I'll migrate everything to systemd units with podman once Centos 8 is out.