Auto enable user namespaces in Kubernetes

When you run a container, the process IDs are namespaced and different in the container vs the host, the network stack is namespaced, the file system mounts are namespaced, but a process running as root in the container is running as root outside the container. This is risk because many privilege escalation vulnerabilities in Linux can be exploited because of this common user id.

Linux user namespaces aims to mitigate risks with running a process as root or other shared user ID where a vulnerability could allow a containerized process to escape a namespace and have privileges in the host. For example, process running as root in a container, would be marked as root outside the host without user namespaces or a process that has a UID that exists on the host.

User namespaces attempt to fix this by saying UID=0 inside the container actually is UID=12356231 on the host. Thus, a breakout is not as bad as it could be without user namespaces.

In this post, I’m going to walk through how I use Kyverno, a Kubernetes native policy system to easily enable user namespaces in pods where it can be.

Release Schedule

In release v1.30, Kubernetes announced beta support for User namespaces. In release v1.32, Kubernetes enabled it by default.

If you’re running v1.30-v1.31, then enable the feature gate on the APIServer with the --feature-gates=UserNamespacesSupport=true arg.

Thinking about policies

Kyverno policies are defined as Kubernetes resources, then when any workload resources are created or changed, Kyverno receives a webhook call and makes changes based on the policy.

What I want is to automatically set PodSpec.hostUsers=false when a pod is created. However, due to limitations, I can’t use it with hostNetwork: true, hostIPC: true, hostPID: true, or volumeDevices. Also, any volumes that get mounted into the pod have to support idmap. If I set hostUsers: false on every pod, then I’ll get a bunch of failures and have a bad day.

I’m going to walk through how I created the policy. If you’re just interested in the final policy, Skip below.

Creating the policy

Mutate all the things

First, I start with a basic policy that enables users namespaces on all pods that are created in the cluster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  annotations:
    pod-policies.kyverno.io/autogen-controllers: none
  name: enable-userns-when-able
spec:
  background: false
  rules:
      match:
        any:
          - resources:
              kinds:
                - Pod
      mutate:
        patchStrategicMerge:
          spec:
            hostUsers: false
      name: enable-userns
      preconditions:
      all:
        - key: '{{ request.operation || ''BACKGROUND'' }}'
        operator: Equals
        value: CREATE
  validationFailureAction: Audit

However, this will cause some pods to fail to create because user namespaces isn’t supported on all types of volumes and hostNetwork.

Skip for pods with other host namespacing

User namespacing isn’t supported with pods using the host network or host PIDs. My guess is because those would enable a container to see user ids that it can’t know about.

I want to exclude pods that have these values set. This can be done with a precondition. Kyverno’s precondition language gets to be quite strange with complex preconditions, but these are trivial.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
preconditions:
  all:
    # ... previous precondition
    - key: '{{ request.object.spec.hostUsers || ''false'' }}'
    operator: NotEquals
    value: 'true'
    - key: '{{ request.object.spec.hostIPC || ''false'' }}'
    operator: Equals
    value: 'false'
    - key: '{{ request.object.spec.hostPID || ''false'' }}'
    operator: Equals
    value: 'false'
    - key: '{{ request.object.spec.hostNetwork || ''false'' }}'
    operator: Equals
    value: 'false'

Now when I create any pods with hostNetwork=true or hostIPC or explicitly setting hostUsers=true, then nothing is changed.

Problems with PVCs

However, I still have problems when I start working with volumes. If I mount a hostPath volume, it works because your host volumes support user namespaces, but not all volumes do support them. For example, Longhorn, my storage provider in my home lab, uses NFS (Network File System) to mount volumes and it didn’t support idmap:

1
2
3
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: err  
or during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/7ea01f58-57f9-4504-87b5-898ca80bd980/volumes/kube  
rnetes.io~csi/pvc-48bee7b4-82b5-4119-a927-ceb0b0939c20/mount: invalid argument (maybe the filesystem used doesn't support idmap mounts on this kernel?)

Looking further, I believe this issue is limited to just ReadWriteMany volumes (volumes that can be mounted on multiple nodes at once) as other volumes are able to work with user namespaces.

I also encountered issues with pods that mount devices from the host’s /dev, like my Home Assistant that talked to USB Zigbee adapters.

I need a way to skip mutating pods that use volumes that don’t support user namespaces.

Skip for unsupported volumes

This is where Kyverno’s policy language gets very messy and unintuitive.

Skip host hardware devices

First, we check to see if there’s any mounts that use hostPath to mount something from under the host’s /dev/ folder. If it finds one or more, it skips mutating because user namespaces isn’t supported. This uses a language called JMESPath to search the JSON/YAML file kind of like XPath.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
spec:
  rules:
    - context:
        - name: hasdevice
          variable:
            default: 0
            jmesPath: >-
              request.object.spec.volumes[?hostPath].hostPath.path[?starts_with(@,
              '/dev/')] | length(@)
          # mutate:
          # match:      
          preconditions:
          # - ... other preconditions
          - key: '{{ hasdevice }}'
            operator: Equals
            value: 0

Skip for Longhorn

Next, we look for any volume mounts that reference a Longhorn RWX volume. A pod doesn’t directly say what kind of PVC is being mounted. It just says the name of the PVC. What can we do to figure this out?

1
2
3
4
5
6
7
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: paperless

Luckily, Kyverno has a feature that calls back to the Kubernete’s API to fetch one or more resources. I have the name of the PVC, can this tell me what kind of volume I have?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# kubectl get pvc -n paperless paperless -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    volume.kubernetes.io/storage-provisioner: driver.longhorn.io
  labels:
    recurring-job-group.longhorn.io/default: enabled
    recurring-job.longhorn.io/weekly: enabled
  name: paperless
  namespace: paperless
spec:
  accessModes:
  - ReadWriteOnce
  - ReadWriteMany
  storageClassName: longhorn

It has accessModes, so we can use that to skip mutating RWX volumes. The spec.storageClassName looks relevant at first, but you can create more Storage Classes with any name that uses Longhorn and it’s common to define classes with different node or disk selectors, so while it would work, it’s fragile and easy to break.

There’s also the the metadata.annotations."volume.kubernetes.io/storage-provisioner", but this isn’t present on my older volumes–only the beta annotation is there and looking at deprecated annotations feels dirty. Thus, I’d have to look at both annotations.

The following will tell Kyverno to go back to Kubernetes and fetch all PVCs in the same namespace, then locally filter using JMESPath to find all PVCs that are managed by Longhorn and use ReadWriteMany and stores the names in a variable called longhornpvcs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
spec:
  rules:
    - context:
        - apiCall:
            jmesPath: >-
              items[?((metadata.annotations."volume.kubernetes.io/storage-provisioner" == 'driver.longhorn.io' || metadata.annotations."volume.beta.kubernetes.io/storage-provisioner" == 'driver.longhorn.io') && contains(spec.accessModes, 'ReadWriteMany'))].metadata.name
            method: GET
            urlPath: /api/v1/namespaces/{{ request.namespace }}/persistentvolumeclaims
          name: longhornpvcs
      # match:
      # mutate:

Then, in the preconditions, we iterate over all the volumes in the Pod, gets the names and sees if any of them exist in the set I defined above. If any of them are, it skips mutation. Additionally, skips mutation for any volumes from any provisioner that mounts the volume as a volumeDevice, which is a low level mount and not supported by user namespaces.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
spec:
  rules:
    - # context:
      # match:
      # mutate:
      preconditions:
        all:
          # ... other preconditions
          - key: >-
              {{
              request.object.spec.volumes[?persistentVolumeClaim].persistentVolumeClaim.claimName
              || `[]` }}
            message: Longhorn PVCs may not support user namespaces
            operator: AllNotIn
            value: '{{ longhornpvcs }}'
          - key: '{{ request.object.spec.containers[?volumeDevices] | length(@) }}'
            message: "volumeDevices don't support support user namespaces"
            operator: Equals
            value: 0

The Final Policy

Here’s the final policy put together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  annotations:
    kyverno.io/kyverno-version: 1.9.0
    pod-policies.kyverno.io/autogen-controllers: none
    policies.kyverno.io/category: Pod Security
    policies.kyverno.io/description: This policy ensures that user namespaces are enabled for pods that can
    policies.kyverno.io/severity: medium
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/title: Enable User Namespaces when possible
  name: enable-userns-when-able
spec:
  background: false
  rules:
    - context:
        - name: hasdevice
          variable:
            default: 0
            jmesPath: >-
              request.object.spec.volumes[?hostPath].hostPath.path[?starts_with(@,
              '/dev/')] | length(@)
        - apiCall:
            jmesPath: >-
              items[?((metadata.annotations."volume.kubernetes.io/storage-provisioner" == 'driver.longhorn.io' || metadata.annotations."volume.beta.kubernetes.io/storage-provisioner" == 'driver.longhorn.io') && contains(spec.accessModes, 'ReadWriteMany'))].metadata.name
            method: GET
            urlPath: /api/v1/namespaces/{{ request.namespace }}/persistentvolumeclaims
          name: longhornpvcs
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - longhorn-system
                - calico-system
      match:
        any:
          - resources:
              kinds:
                - Pod
      mutate:
        patchStrategicMerge:
          spec:
            hostUsers: false
      name: enable-userns
      preconditions:
        all:
          - key: '{{ request.operation || ''BACKGROUND'' }}'
            operator: Equals
            value: CREATE
          - key: '{{ request.object.spec.hostUsers || ''false'' }}'
            message: Skipping because hostUsers is explicitly set to true
            operator: NotEquals
            value: true
          - key: '{{ request.object.spec.hostIPC || ''false'' }}'
            operator: Equals
            value: 'false'
          - key: '{{ request.object.spec.hostPID || ''false'' }}'
            operator: Equals
            value: 'false'
          - key: '{{ request.object.spec.hostNetwork || ''false'' }}'
            message: "User namespaces can't be used when hostNetwork=true"
            operator: Equals
            value: 'false'
          - key: '{{ hasdevice }}'
            operator: Equals
            value: 0
          - key: '{{ request.object.spec.containers[?volumeDevices] | length(@) }}'
            message: "volumeDevices don't support support user namespaces"
            operator: Equals
            value: 0
          - key: >-
              {{
              request.object.spec.volumes[?persistentVolumeClaim].persistentVolumeClaim.claimName
              || `[]` }}
            message: Longhorn PVCs may not support user namespaces
            operator: AllNotIn
            value: '{{ longhornpvcs }}'
  validationFailureAction: Audit

See it in action

The policy only applies to new pod creation, so existing pods are not modified. To see it, we need to redeploy pods.

To see which pods are using user namespaces with kubectl, run the following command. If the HostUsers column shows <none> or true, then it’s not. If it shows false, then the pod has it’s own separate user namespace.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
kubectl get --all-namespaces pods -o custom-columns=Namespace:.metadata.namespace,Name:.metadata.name,Hos  
tUsers:.spec.hostUsers

NAMESPACE               NAME                                                           USERNS  
authelia                authelia-6dd4b4b75c-zvm7m                                      false  
calico-apiserver        calico-apiserver-7dbc7798c5-nvrqp                              <none>  
calico-system           calico-kube-controllers-5778576475-b6gr8                       <none>  
calico-system           calico-node-92t8f                                              <none>  
calico-system           calico-typha-55dfc8c999-ckldk                                  <none>  
calico-system           csi-node-driver-9jhvb                                          <none>  
cattle-system           cattle-cluster-agent-746f6d788f-h6n2q                          <none>  
cattle-system           kube-api-auth-cqrcz                                            <none>  
cattle-system           rancher-webhook-5d4b7cd966-phdj2                               <none>  
cattle-system           system-upgrade-controller-57b66d6cbd-6p92n                     <none>  
Copyright - All Rights Reserved

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!

Donate to my blog