Jailing your apps using Linux Namespaces

If you have a recent Linux kernel (>2.6.26) you can use its support for namespaces.

What are namespaces?

They are an elegant (more elegant than most of the jailing systems you might find in other operating systems) way to “detach” your processes from a specific layer of the kernel and assign them to a new one.

The ‘chroot’ system available on UNIX/Posix systems is a primal form of namespaces: a process sees a completely new file system root and has no access to the original one.

Linux extends this concept to the other OS layers (PIDs, users, IPC, networking etc.), so a specific process can live in a “virtual OS” with a new group of pids, a new set of users, a completely unshared IPC system (semaphores, shared memory etc.), a dedicated network interface and its own hostname.

uWSGI got full namespaces support in 1.9/2.0 development cycle.

clone() vs unshare()

To place the current process in a new namespace you have two syscall: the venerable clone(), that will create a new process in the specified namespaces and the new unshare() that changes namespaces of the currently running process.

clone() can be used by the Emperor to directly spawn vassals in new namespaces:

[uwsgi]
emperor = /etc/uwsgi/vassals
emperor-use-clone = fs,net,ipc,uts,pid

will run each vassal with a dedicated filesystems, networking, sysv ipc and uts view

while

[uwsgi]
unshare = ipc,uts
...

will run the instance in the specified namespaces.

Some namespace subsystem requires additional steps (see below)

Supported namespaces

fs -> CLONE_NEWNS, filesystems

ipc -> CLONE_NEWIPC, sysv ipc

pid -> CLONE_NEWPID, when used with unshare() requires an additional fork(), use one of the –refork-* options

uts -> CLONE_NEWUTS, hostname

net -> CLONE_NEWNET, new networking, UNIX sockets from different namespaces are still usable, they are a good way for inter-namespaces communications

user -> CLONE_NEWUSER, still complex to manage (and has differences in behaviours between kernel versions) use with caution

setns()

In addition to creating new namespaces for a process you can attach to already running ones using the setns() call.

Each process exposes its namespaces via the /proc/self/ns directory. The setns() syscall uses the filedescriptors obtained from the files in that directory to attach to namespaces.

As we have already seen, unix sockets are a good way to communicate between namespaces, the uWSGI setns() features works by createing a unix socket that receives requests from processes wanting to join its namespace. As UNIX sockets allow file descriptors passing, the “client” only need to call setns() on them.

setns-socket <addr> exposes /proc/self/ns on the specified unix socket address

setns <addr> connect to the specified unix socket address, get the filedescriptors and use setns() on them

setns-preopen if enabled the /proc/self/ns files are opened on startup (before privileges drop) and cached. This is useful for avoiding running the main instance as root.

setns-socket-skip <name> some file in /proc/self/ns can create problems (mostly the ‘user’ one). You can skip them specifying the name. (you can specify this option multiple times)

pivot_root

This option allows you to change the rootfs of your currently running instance.

It is better than chroot as it allows you to access the old filesystem tree before (manually) unmounting it.

It is a bit complex to master correctly as it requires a couple of assumptions:

pivot_root <new> <old>

<new> is the directory to mount as the new rootfs and <old> is where to access the old tree.

<new> must be a mounted filesystem, and <old> must be under this filesystem.

A common pattern is:

[uwsgi]
unshare = fs
hook-post-jail = mount:none /distros/precise /ns bind
pivot_root = /ns /ns/.old_root
...

remember to create /ns and /distro/precise/.old_root

When you have created the new filesysten layout you can umount /.old_root recursively:

[uwsgi]
unshare = fs
hook-post-jail = mount:none /distros/precise /ns bind
pivot_root = /ns /ns/.old_root
; bind mount some useful fs like /dev and /proc
hook-as-root = mount:proc none /proc nodev hidepid=2
hook-as-root = mount:none /.old_root/dev /dev bind
hook-as-root = mount:none /.old_root/dev/pts /dev/pts bind
; umount the old tree
hook-as-root = umount:/.old_root rec,detach

Why not lxc ?

Lxc is a project allowing you to build full subsystems using linux namespaces. You may ask why “reinventing the wheel” while lxc implements fully “virtualized” system. Apple and oranges.

Lxc objective is giving users the view of a virtual server, uWSGI namespaces support is lower level, you can use it to detach single components (for example you may only want to unshare ipc) to increase security and isolation.

Not all the scenario requires a full system-like view (and in lot of case is suboptimal, while in other is the best approach), try to see namespaces as a way to increase security and isolation, when you need/can isolate a component do it with clone/unshare. When you want to give users a full system-like access go with lxc.

The old way: the –namespace option

Before 1.9/2.0 a full featured system-like namespace support was added. It works as a chroot() on steroids.

It should be moved as an external plugin pretty soon, but will be always part of the main distribution, as it is used by lot of people for its simplicity.

You basically need to set a root filesystem and an hostname to start your instance in a new namespace:

Let’s start by creating a new root filesystem for our jail. You’ll need debootstrap. We’re placing our rootfs in /ns/001, and then create a ‘uwsgi’ user that will run the uWSGI server. We will use the chroot command to ‘adduser’ in the new rootfs, and we will install the Flask package, required by uwsgicc.

(All this needs to be executed as root)

mkdir -p /ns/001
debootstrap maverick /ns/001
chroot /ns/001
# in the chroot jail now
adduser uwsgi
apt-get install mercurial python-flask
su - uwsgi
# as uwsgi now
git clone https://github.com/unbit/uwsgicc.git .
exit # out of uwsgi
exit # out of the jail

Now on your real system run

uwsgi --socket 127.0.0.1:3031 --chdir /home/uwsgi/uwsgi --uid uwsgi --gid uwsgi --module uwsgicc --master --processes 4 --namespace /ns/001:mybeautifulhostname

If all goes well, uWSGI will set /ns/001 as the new root filesystem, assign mybeautifulhostname as the hostname and hide the PIDs and IPC of the host system.

The first thing you should note is the uWSGI master becoming the pid 1 (the “init” process). All processes generated by the uWSGI stack will be reparented to it if something goes wrong. If the master dies, all jailed processes die.

Now point your webbrowser to your webserver and you should see the uWSGI Control Center interface.

Pay attention to the information area. The nodename (used by cluster subsystem) matches the real hostname as it does not make sense to have multiple jail in the same cluster group. In the hostname field instead you will see the hostname you have set.

Another important thing is that you can see all the jail processes from your real system (they will have a different set of PIDs), so if you want to take control of the jail you can easily do it.

Note

A good way to limit hardware usage of jails is to combine them with the cgroups subsystem.

Reloading uWSGI

When running jailed, uWSGI uses another system for reloading: it’ll simply tell workers to bugger off and then exit. The parent process living outside the namespace will see this and respawn the stack in a new jail.

How secure is this sort of jailing?

Hard to say! All software tends to be secure until a hole is found.

Additional filesystems

When app is jailed to namespace it only has access to its virtual jail root filesystem. If there is any other filesystem mounted inside the jail directory, it won’t be accessible, unless you use namespace-keep-mount.

# app1 jail is located here
namespace = /apps/app1

# nfs share mounted on the host side
namespace-keep-mount = /apps/app1/nfs

This will bind /apps/app1/nfs to jail, so that jailed app can access it under /nfs directory

# app1 jail is located here
namespace = /apps/app1

# nfs share mounted on the host side
namespace-keep-mount = /mnt/nfs1:/nfs

If the filesystem that we want to bind is mounted in path not contained inside our jail, than we can use “<source>:<dest>” syntax for –namespace-keep-mount. In this case the /mnt/nfs1 will be binded to /nfs directory inside the jail.