When you run a container, the process IDs are namespaced and different in the container vs the host, the network stack is namespaced, the file system mounts are namespaced, but a process running as root in the container is running as root outside the container. This is risk because many privilege escalation vulnerabilities in Linux can be exploited because of this common user id.
Linux user namespaces aims to mitigate risks with running a process as root or other shared user ID where a vulnerability could allow a containerized process to escape a namespace and have privileges in the host. For example, process running as root in a container, would be marked as root outside the host without user namespaces or a process that has a UID that exists on the host.
User namespaces attempt to fix this by saying UID=0 inside the container actually is UID=12356231 on the host. Thus, a breakout is not as bad as it could be without user namespaces.
In this post, I’m going to walk through how I use Kyverno, a Kubernetes native policy system to easily enable user namespaces in pods where it can be.
Release Schedule
In release v1.30, Kubernetes announced beta support for User namespaces. In release v1.32, Kubernetes enabled it by default.
If you’re running v1.30-v1.31, then enable the feature gate on the APIServer with the --feature-gates=UserNamespacesSupport=true
arg.
Thinking about policies
Kyverno policies are defined as Kubernetes resources, then when any workload resources are created or changed, Kyverno receives a webhook call and makes changes based on the policy.
What I want is to automatically set PodSpec.hostUsers=false
when a pod is created. However, due to limitations, I can’t use it with hostNetwork: true
, hostIPC: true
, hostPID: true
, or volumeDevices
. Also, any volumes that get mounted into the pod have to support idmap. If I set hostUsers: false
on every pod, then I’ll get a bunch of failures and have a bad day.
I’m going to walk through how I created the policy. If you’re just interested in the final policy, Skip below.
Creating the policy
Mutate all the things
First, I start with a basic policy that enables users namespaces on all pods that are created in the cluster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
annotations:
pod-policies.kyverno.io/autogen-controllers: none
name: enable-userns-when-able
spec:
background: false
rules:
match:
any:
- resources:
kinds:
- Pod
mutate:
patchStrategicMerge:
spec:
hostUsers: false
name: enable-userns
preconditions:
all:
- key: '{{ request.operation || ''BACKGROUND'' }}'
operator: Equals
value: CREATE
validationFailureAction: Audit
|
However, this will cause some pods to fail to create because user namespaces isn’t supported on all types of volumes and hostNetwork.
Skip for pods with other host namespacing
User namespacing isn’t supported with pods using the host network or host PIDs. My guess is because those would enable a container to see user ids that it can’t know about.
I want to exclude pods that have these values set. This can be done with a precondition. Kyverno’s precondition language gets to be quite strange with complex preconditions, but these are trivial.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| preconditions:
all:
# ... previous precondition
- key: '{{ request.object.spec.hostUsers || ''false'' }}'
operator: NotEquals
value: 'true'
- key: '{{ request.object.spec.hostIPC || ''false'' }}'
operator: Equals
value: 'false'
- key: '{{ request.object.spec.hostPID || ''false'' }}'
operator: Equals
value: 'false'
- key: '{{ request.object.spec.hostNetwork || ''false'' }}'
operator: Equals
value: 'false'
|
Now when I create any pods with hostNetwork=true
or hostIPC
or explicitly setting hostUsers=true
, then nothing is changed.
Problems with PVCs
However, I still have problems when I start working with volumes. If I mount a hostPath
volume, it works because your host volumes support user namespaces, but not all volumes do support them. For example, Longhorn, my storage provider in my home lab, uses NFS (Network File System) to mount volumes and it didn’t support idmap:
1
2
3
| failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: err
or during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/7ea01f58-57f9-4504-87b5-898ca80bd980/volumes/kube
rnetes.io~csi/pvc-48bee7b4-82b5-4119-a927-ceb0b0939c20/mount: invalid argument (maybe the filesystem used doesn't support idmap mounts on this kernel?)
|
Looking further, I believe this issue is limited to just ReadWriteMany volumes (volumes that can be mounted on multiple nodes at once) as other volumes are able to work with user namespaces.
I also encountered issues with pods that mount devices from the host’s /dev
, like my Home Assistant that talked to USB Zigbee adapters.
I need a way to skip mutating pods that use volumes that don’t support user namespaces.
Skip for unsupported volumes
This is where Kyverno’s policy language gets very messy and unintuitive.
Skip host hardware devices
First, we check to see if there’s any mounts that use hostPath
to mount something from under the host’s /dev/
folder. If it finds one or more, it skips mutating because user namespaces isn’t supported. This uses a language called JMESPath to search the JSON/YAML file kind of like XPath.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| spec:
rules:
- context:
- name: hasdevice
variable:
default: 0
jmesPath: >-
request.object.spec.volumes[?hostPath].hostPath.path[?starts_with(@,
'/dev/')] | length(@)
# mutate:
# match:
preconditions:
# - ... other preconditions
- key: '{{ hasdevice }}'
operator: Equals
value: 0
|
Skip for Longhorn
Next, we look for any volume mounts that reference a Longhorn RWX volume. A pod doesn’t directly say what kind of PVC is being mounted. It just says the name of the PVC. What can we do to figure this out?
1
2
3
4
5
6
7
| apiVersion: v1
kind: Pod
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: paperless
|
Luckily, Kyverno has a feature that calls back to the Kubernete’s API to fetch one or more resources. I have the name of the PVC, can this tell me what kind of volume I have?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # kubectl get pvc -n paperless paperless -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
volume.kubernetes.io/storage-provisioner: driver.longhorn.io
labels:
recurring-job-group.longhorn.io/default: enabled
recurring-job.longhorn.io/weekly: enabled
name: paperless
namespace: paperless
spec:
accessModes:
- ReadWriteOnce
- ReadWriteMany
storageClassName: longhorn
|
It has accessModes
, so we can use that to skip mutating RWX volumes. The spec.storageClassName
looks relevant at first, but you can create more Storage Classes with any name that uses Longhorn and it’s common to define classes with different node or disk selectors, so while it would work, it’s fragile and easy to break.
There’s also the the metadata.annotations."volume.kubernetes.io/storage-provisioner"
, but this isn’t present on my older volumes–only the beta annotation is there and looking at deprecated annotations feels dirty. Thus, I’d have to look at both annotations.
The following will tell Kyverno to go back to Kubernetes and fetch all PVCs in the same namespace, then locally filter using JMESPath to find all PVCs that are managed by Longhorn and use ReadWriteMany and stores the names in a variable called longhornpvcs
:
1
2
3
4
5
6
7
8
9
10
11
| spec:
rules:
- context:
- apiCall:
jmesPath: >-
items[?((metadata.annotations."volume.kubernetes.io/storage-provisioner" == 'driver.longhorn.io' || metadata.annotations."volume.beta.kubernetes.io/storage-provisioner" == 'driver.longhorn.io') && contains(spec.accessModes, 'ReadWriteMany'))].metadata.name
method: GET
urlPath: /api/v1/namespaces/{{ request.namespace }}/persistentvolumeclaims
name: longhornpvcs
# match:
# mutate:
|
Then, in the preconditions, we iterate over all the volumes in the Pod, gets the names and sees if any of them exist in the set I defined above. If any of them are, it skips mutation. Additionally, skips mutation for any volumes from any provisioner that mounts the volume as a volumeDevice
, which is a low level mount and not supported by user namespaces.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| spec:
rules:
- # context:
# match:
# mutate:
preconditions:
all:
# ... other preconditions
- key: >-
{{
request.object.spec.volumes[?persistentVolumeClaim].persistentVolumeClaim.claimName
|| `[]` }}
message: Longhorn PVCs may not support user namespaces
operator: AllNotIn
value: '{{ longhornpvcs }}'
- key: '{{ request.object.spec.containers[?volumeDevices] | length(@) }}'
message: "volumeDevices don't support support user namespaces"
operator: Equals
value: 0
|
The Final Policy
Here’s the final policy put together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
| apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
annotations:
kyverno.io/kyverno-version: 1.9.0
pod-policies.kyverno.io/autogen-controllers: none
policies.kyverno.io/category: Pod Security
policies.kyverno.io/description: This policy ensures that user namespaces are enabled for pods that can
policies.kyverno.io/severity: medium
policies.kyverno.io/subject: Pod
policies.kyverno.io/title: Enable User Namespaces when possible
name: enable-userns-when-able
spec:
background: false
rules:
- context:
- name: hasdevice
variable:
default: 0
jmesPath: >-
request.object.spec.volumes[?hostPath].hostPath.path[?starts_with(@,
'/dev/')] | length(@)
- apiCall:
jmesPath: >-
items[?((metadata.annotations."volume.kubernetes.io/storage-provisioner" == 'driver.longhorn.io' || metadata.annotations."volume.beta.kubernetes.io/storage-provisioner" == 'driver.longhorn.io') && contains(spec.accessModes, 'ReadWriteMany'))].metadata.name
method: GET
urlPath: /api/v1/namespaces/{{ request.namespace }}/persistentvolumeclaims
name: longhornpvcs
exclude:
any:
- resources:
namespaces:
- kube-system
- longhorn-system
- calico-system
match:
any:
- resources:
kinds:
- Pod
mutate:
patchStrategicMerge:
spec:
hostUsers: false
name: enable-userns
preconditions:
all:
- key: '{{ request.operation || ''BACKGROUND'' }}'
operator: Equals
value: CREATE
- key: '{{ request.object.spec.hostUsers || ''false'' }}'
message: Skipping because hostUsers is explicitly set to true
operator: NotEquals
value: true
- key: '{{ request.object.spec.hostIPC || ''false'' }}'
operator: Equals
value: 'false'
- key: '{{ request.object.spec.hostPID || ''false'' }}'
operator: Equals
value: 'false'
- key: '{{ request.object.spec.hostNetwork || ''false'' }}'
message: "User namespaces can't be used when hostNetwork=true"
operator: Equals
value: 'false'
- key: '{{ hasdevice }}'
operator: Equals
value: 0
- key: '{{ request.object.spec.containers[?volumeDevices] | length(@) }}'
message: "volumeDevices don't support support user namespaces"
operator: Equals
value: 0
- key: >-
{{
request.object.spec.volumes[?persistentVolumeClaim].persistentVolumeClaim.claimName
|| `[]` }}
message: Longhorn PVCs may not support user namespaces
operator: AllNotIn
value: '{{ longhornpvcs }}'
validationFailureAction: Audit
|
See it in action
The policy only applies to new pod creation, so existing pods are not modified. To see it, we need to redeploy pods.
To see which pods are using user namespaces with kubectl, run the following command. If the HostUsers column shows <none>
or true
, then it’s not. If it shows false
, then the pod has it’s own separate user namespace.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| kubectl get --all-namespaces pods -o custom-columns=Namespace:.metadata.namespace,Name:.metadata.name,Hos
tUsers:.spec.hostUsers
NAMESPACE NAME USERNS
authelia authelia-6dd4b4b75c-zvm7m false
calico-apiserver calico-apiserver-7dbc7798c5-nvrqp <none>
calico-system calico-kube-controllers-5778576475-b6gr8 <none>
calico-system calico-node-92t8f <none>
calico-system calico-typha-55dfc8c999-ckldk <none>
calico-system csi-node-driver-9jhvb <none>
cattle-system cattle-cluster-agent-746f6d788f-h6n2q <none>
cattle-system kube-api-auth-cqrcz <none>
cattle-system rancher-webhook-5d4b7cd966-phdj2 <none>
cattle-system system-upgrade-controller-57b66d6cbd-6p92n <none>
|