/kind bug
Description
While migrating a CI from Docker to Podman, I'm occasionally stumbling upon freezes of Podman commands. They may take dozens (!!!) of minutes, with Podman not doing anything at all.
The hangs aren't specific to any commands. E.g. right as I'm writing this text, I see two jobs, one with podman run … and another with podman inspect both frozen. So I connected to the server with ssh and trying running a time podman inspect foobar (literally a request for non-existing foobar image), and it hanged as well. podman ps hangs, and podman version even hangs!!
Basically, to be able to create this report I had to kill a podman process. I had 2 podman run processes and 2 podman inspects. I killed one of podman inspect processes, and a little later CI finally proceeded and podman commands started working.
Steps to reproduce the issue:
I'm afraid I couldn't find any. It seems to be happening when multiple podman processes are run, but my attempts simulating that in different ways didn't succeed. It just happens from time to time as part of CI, in which case CI basically breaks completely.
Steps to reproduce were found as part of this duplicate issue and are copied below:
-
This is the "fairly large" image:
podman pull ghcr.io/martinpitt/swaypod:latest
time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:latest
-
This is the image that adds TeXlive (which makes it a few hundred MB larger):
podman pull ghcr.io/martinpitt/swaypod:allpkgs
time podman create --userns=keep-id ghcr.io/martinpitt/swaypod:allpkgs
Describe the results you received:
Step 1 takes 4 s on a Fedora 37 cloud VM (2 CPUs, 4 GiB RAM) with the default btrfs. On a standard RHEL 9.2 VM with XFS and on my laptop's Fedora 37 VM with /home being on ext4, it takes about 20 seconds. In top I see a process called "exe" which is taking 100% CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
972 admin 20 0 1351936 65172 28028 S 96.0 1.7 0:12.33 exe
That is really this:
admin 1972 95.0 1.3 1351680 49344 pts/0 Sl+ 04:04 0:01 storage-chown-by-maps /home/admin/.local/share/containers/storage/overlay/3cc2d72c07248c18a9185b6a5bba0e7932b0ce5c26dbc763e476eb50c2a7ea94/merged
With the larger image in step 2, the Fedora 37 btrfs VM takes merely 6s. However, both on the RHEL 9.2 XFS VM as well as my ext4 real-iron Fedora 37 laptop, the storage-chown-by-maps process never ends. After maybe half a minute it kills the VM (ssh dead, cannot log into the virsh console either), and my laptop becomes really sluggish, I cannot even start top any more. Trying to kill -9 or even sudo kill -9 (!) that storage-chown-by-maps does not work either, it's just unkillable.
Describe the results you expected:
The storage-chown-by-maps process should finish eventually, but ideally reasonably fast. This is more or less a glorified chown -R, no? that shouldn't take more than a few seconds.
Output of podman version:
Client: Podman Engine
Version: 4.3.1
API Version: 4.3.1
Go Version: go1.18.1
Built: Thu Jan 1 00:00:00 1970
OS/Arch: linux/amd64
Output of podman info:
host:
arch: amd64
buildahVersion: 1.28.0
cgroupControllers:
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon_2:2.1.5-0ubuntu22.04+obs14.3_amd64
path: /usr/bin/conmon
version: 'conmon version 2.1.5, commit: '
cpuUtilization:
idlePercent: 96.22
systemPercent: 0.76
userPercent: 3.03
cpus: 4
distribution:
codename: jammy
distribution: ubuntu
version: "22.04"
eventLogger: file
hostname: node29
idMappings:
gidmap:
- container_id: 0
host_id: 998
size: 1
- container_id: 1
host_id: 10000
size: 65536
uidmap:
- container_id: 0
host_id: 998
size: 1
- container_id: 1
host_id: 10000
size: 65536
kernel: 5.15.0-52-generic
linkmode: dynamic
logDriver: k8s-file
memFree: 1288302592
memTotal: 67404197888
networkBackend: netavark
ociRuntime:
name: crun
package: crun_1.7-0ubuntu22.04+obs47.1_amd64
path: /usr/bin/crun
version: |-
crun version 1.7
commit: 40d996ea8a827981895ce22886a9bac367f87264
rundir: /run/user/998/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
os: linux
remoteSocket:
path: /run/user/998/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns_1.2.0-0ubuntu22.04+obs10.15_amd64
version: |-
slirp4netns version 1.2.0
commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
libslirp: 4.6.1
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.3
swapFree: 8575172608
swapTotal: 8589930496
uptime: 223h 19m 11.00s (Approximately 9.29 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
volume:
- local
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /home/gitlab-runner/.config/containers/storage.conf
containerStore:
number: 1
paused: 0
running: 1
stopped: 0
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/gitlab-runner/.local/share/containers/storage
graphRootAllocated: 983350071296
graphRootUsed: 645746360320
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "true"
Supports d_type: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 202
runRoot: /tmp/podman-run-998/containers
volumePath: /home/gitlab-runner/.local/share/containers/storage/volumes
version:
APIVersion: 4.3.1
Built: 0
BuiltTime: Thu Jan 1 00:00:00 1970
GitCommit: ""
GoVersion: go1.18.1
Os: linux
OsArch: linux/amd64
Version: 4.3.1
Package info:
$ apt list podman
Listing... Done
podman/unknown,now 4:4.3.1-0ubuntu22.04+obs64.3 amd64 [installed]
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 arm64
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 armhf
podman/unknown 4:4.3.1-0ubuntu22.04+obs64.3 s390x
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes
/kind bug
Description
While migrating a CI from Docker to Podman, I'm occasionally stumbling upon freezes of Podman commands. They may take dozens (!!!) of minutes, with Podman not doing anything at all.
The hangs aren't specific to any commands. E.g. right as I'm writing this text, I see two jobs, one with
podman run …and another withpodman inspectboth frozen. So I connected to the server with ssh and trying running atime podman inspect foobar(literally a request for non-existingfoobarimage), and it hanged as well.podman pshangs, andpodman versioneven hangs!!Basically, to be able to create this report I had to kill a podman process. I had 2
podman runprocesses and 2podman inspects. I killed one ofpodman inspectprocesses, and a little later CI finally proceeded and podman commands started working.Steps to reproduce the issue:
I'm afraid I couldn't find any. It seems to be happening when multiple podman processes are run, but my attempts simulating that in different ways didn't succeed. It just happens from time to time as part of CI, in which case CI basically breaks completely.Steps to reproduce were found as part of this duplicate issue and are copied below:
Output of
podman version:Output of
podman info:Package info:
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes