Skip to content

--userns=keep-id storage-chown-by-maps kills machine with large images #16541

@Hi-Angel

Description

@Hi-Angel

/kind bug

Description

While migrating a CI from Docker to Podman, I'm occasionally stumbling upon freezes of Podman commands. They may take dozens (!!!) of minutes, with Podman not doing anything at all.

The hangs aren't specific to any commands. E.g. right as I'm writing this text, I see two jobs, one with podman run … and another with podman inspect both frozen. So I connected to the server with ssh and trying running a time podman inspect foobar (literally a request for non-existing foobar image), and it hanged as well. podman ps hangs, and podman version even hangs!!

Basically, to be able to create this report I had to kill a podman process. I had 2 podman run processes and 2 podman inspects. I killed one of podman inspect processes, and a little later CI finally proceeded and podman commands started working.

Steps to reproduce the issue:

I'm afraid I couldn't find any. It seems to be happening when multiple podman processes are run, but my attempts simulating that in different ways didn't succeed. It just happens from time to time as part of CI, in which case CI basically breaks completely.

Steps to reproduce were found as part of this duplicate issue and are copied below:

  1. This is the "fairly large" image:

    podman pull ghcr.io/martinpitt/swaypod:latest
    time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:latest
    
  2. This is the image that adds TeXlive (which makes it a few hundred MB larger):

    podman pull ghcr.io/martinpitt/swaypod:allpkgs
    time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:allpkgs
    

Describe the results you received:

Step 1 takes 4 s on a Fedora 37 cloud VM (2 CPUs, 4 GiB RAM) with the default btrfs. On a standard RHEL 9.2 VM with XFS and on my laptop's Fedora 37 VM with /home being on ext4, it takes about 20 seconds. In top I see a process called "exe" which is taking 100% CPU:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
972 admin     20   0 1351936  65172  28028 S  96.0   1.7   0:12.33 exe

That is really this:

admin       1972 95.0  1.3 1351680 49344 pts/0   Sl+  04:04   0:01 storage-chown-by-maps /home/admin/.local/share/containers/storage/overlay/3cc2d72c07248c18a9185b6a5bba0e7932b0ce5c26dbc763e476eb50c2a7ea94/merged

With the larger image in step 2, the Fedora 37 btrfs VM takes merely 6s. However, both on the RHEL 9.2 XFS VM as well as my ext4 real-iron Fedora 37 laptop, the storage-chown-by-maps process never ends. After maybe half a minute it kills the VM (ssh dead, cannot log into the virsh console either), and my laptop becomes really sluggish, I cannot even start top any more. Trying to kill -9 or even sudo kill -9 (!) that storage-chown-by-maps does not work either, it's just unkillable.

Describe the results you expected:

The storage-chown-by-maps process should finish eventually, but ideally reasonably fast. This is more or less a glorified chown -R, no? that shouldn't take more than a few seconds.

Output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.18.1
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64
Output of podman info:
host:
  arch: amd64
  buildahVersion: 1.28.0
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2:2.1.5-0ubuntu22.04+obs14.3_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: '
  cpuUtilization:
    idlePercent: 96.22
    systemPercent: 0.76
    userPercent: 3.03
  cpus: 4
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: file
  hostname: node29
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 998
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 998
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 5.15.0-52-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 1288302592
  memTotal: 67404197888
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun_1.7-0ubuntu22.04+obs47.1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.7
      commit: 40d996ea8a827981895ce22886a9bac367f87264
      rundir: /run/user/998/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/998/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.0-0ubuntu22.04+obs10.15_amd64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 8575172608
  swapTotal: 8589930496
  uptime: 223h 19m 11.00s (Approximately 9.29 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/gitlab-runner/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/gitlab-runner/.local/share/containers/storage
  graphRootAllocated: 983350071296
  graphRootUsed: 645746360320
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 202
  runRoot: /tmp/podman-run-998/containers
  volumePath: /home/gitlab-runner/.local/share/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 0
  BuiltTime: Thu Jan  1 00:00:00 1970
  GitCommit: ""
  GoVersion: go1.18.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

Package info:

$ apt list podman
Listing... Done
podman/unknown,now 4:4.3.1-0ubuntu22.04+obs64.3 amd64 [installed]
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 arm64
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 armhf
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 s390x

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions