#890824 Container: unsets cgroup memory limit on user login

Package:
systemd
Source:
systemd
Description:
system and service manager
Submitter:
Maximilian Philipps
Date:
2022-07-06 13:09:03 UTC
Severity:
normal
Tags:
#890824#5
Date:
2018-02-19 12:09:13 UTC
From:
To:
Hi

I have an issue with Systemd unsetting the memory limit for my container,
whereupon programs like free and htop report having access to 8 exabyte
of memory.

The setup is the following:

Host:
Release: Debian jessie
Kernel: 4.9.65-3+deb9u2~bpo8+1 (jessie backports)
Container provider: libvirt 3.0.0-4~bpo8+1 (jessie backports)
Systemd: 215-17+deb8u7 (jessie)
cgroup hierarchy: legacy

Guest:
Release: Debian stretch
Systemd: 232-25+deb9u1 (stretch)

There are several containers running on the host, but this problem only
occurs with all the Debian stretch containers. Containers running Debian
jessie or older Ubuntu 12.04 aren't affected.
Each container is configured to cgroup enforced memory limit in it's
libvirt domain file.
Example:
<memory unit='KiB'>4194304</memory>
<memory unit='KiB'>2097152</memory>

Steps to reproduce + observations:
1) start a container with virsh -c lxc:// container.example.com
2) virsh -c lxc:// memtune container.example.com
    reports a hard_limit of 2097152
3) cat
"/sys/fs/cgroup/memory/machine.slice/machine-<container-name>.scope/memory.limit_in_bytes"
outputs 2147483648
4) nsenter -t <pid> -m -u -i -n -p free  reports 2097152 kB
5) ssh container.example.com free  reports 9007199254740991 kB
3) cat
"/sys/fs/cgroup/memory/machine.slice/machine-<container-name>.scope/memory.limit_in_bytes"
outputs 9223372036854771712
6) nsenter -t <pid> -m -u -i -n -p free  reports 9007199254740991 kB
7) virsh -c lxc:// memtune container.example.com
    reports a hard_limit of unlimited

As far as I can tell it seems to be that systemd unsets the cgroup memory
limit when creating the user session. However why it gets set to
9223372036854771712 instead of the 255G of the host I don't know.


In any case I am looking forward to a better solution than resetting the
limits through cron every minute.

#890824#10
Date:
2018-02-19 12:50:46 UTC
From:
To:
Am 19.02.2018 um 13:09 schrieb Maximilian Philipps:
systemd v232) resets the memory limits on the host (running v215)?

#890824#15
Date:
2018-02-19 13:07:37 UTC
From:
To:
No, the hosts still sees the 255GB. The systemd in the guest resets
the limits for the container when someone logs in.
In terms of the cgroup hierarchy /sys/fs/cgroup/memory/memory.limit_in_bytes
is always 9223372036854771712, which appears to be treated as no
  restrictions on the host.
However the memory.limit_in_bytes within the machine scope does change.

#890824#20
Date:
2018-02-19 13:12:10 UTC
From:
To:
On a second thought, maybe you assumed that the cgroup namespace is
unshared?
This is not the case, cgroup namespaces are fairly new and as far as I
know not supported
by libvirt-lxc.

#890824#25
Date:
2018-09-09 04:56:30 UTC
From:
To:
Would you mind testing with systemd v239 from unstable/testing and
eventually raise this upstream at https://github.com/systemd/systemd

tbh, I'm not sure what the expected behaviour is in that regard and if
this maybe just a configuration issue.

#890824#34
Date:
2019-10-25 09:23:52 UTC
From:
To:
Recently updated one of the hosts and the containers running on it from
stretch to buster.

With buster's 241-7~deb10u1 the issue still exists. I have tried working
around this issue by setting a memory limit on the -.slice from within
the container, but this is fairly unreliable.

#890824#39
Date:
2019-10-25 12:00:38 UTC
From:
To:
hi,

After digging a bit more, it appears that after the update from stretch
to buster we are using some mix cgroupv1 and cgroupv2.

/sys/fs/cgroup/ is still a tmpfs and /sys/fs/cgroup/unified/ exits, but
hast no controllers. So apparently systemd should still use the
controllers from v1 with the hierarchy from v2?


Can anyone confirm the memory resource management works at all on buster?

#890824#44
Date:
2019-10-25 14:35:02 UTC
From:
To:
hi

I can now reliably trigger the 8 exabyte issue. When I start a
libvirt-lxc container, libvirts sets the memory limit.

This can be seen with:

cat
/sys/fs/cgroup/memory/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope/memory.limit_in_bytes

2147483648

If I now call systemctl daemon-reload on the host the memory limit jumps  to

9223372036854771712

I can prevent this with by setting MaxMemory for the scope on the host:

systemctl set-property --runtime
"machine-lxc\x2d27166\x2dhost.domain.tld.scope" MemoryMax=2147483648

I need to know the pid used in the machine name and therefor can really
only set it at runtime.

However this isn't enough to prevent the 8 exabyte issue. For some
reason when I do a systemctl daemon-reload on the host systemd also
changes cgroup membership of some processes. Prior to reloading there
were 3 processes directly in the machine-lxc...scope. A
/usr/lib/libvirt/libvirt_lxc process, the /sbin/init process of the
container and other process that I can't find in /proc/. Maybe a pid
from within the container?

After reloading only the /sbin/init process remains in the scope, the
libvirt_lxc process gets kicked back to the libvirtd.service cgroup and
the "ghost" task disappears.

Befor reload:

11:blkio:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
10:freezer:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
9:perf_event:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
8:pids:/system.slice/libvirtd.service
7:cpu,cpuacct:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
6:rdma:/
5:devices:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
4:memory:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
2:cpuset:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
1:name=systemd:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
0::/system.slice/libvirtd.service

After reload:

11:blkio:/system.slice/libvirtd.service
10:freezer:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
9:perf_event:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
8:pids:/system.slice/libvirtd.service
7:cpu,cpuacct:/system.slice/libvirtd.service
6:rdma:/
5:devices:/system.slice/libvirtd.service
4:memory:/system.slice/libvirtd.service
3:net_cls,net_prio:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
2:cpuset:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
1:name=systemd:/machine.slice/machine-lxc\x2d27166\x2dhost.domain.tld.scope
0::/system.slice/libvirtd.service

#890824#49
Date:
2021-03-28 02:52:08 UTC
From:
To:
Hi Maximilian,

can you please check, if you can still reproduce the issue on bullseye,
where cgroupv2, i.e. unified, is the default cgroup hierarchy.

Regards,
Michael

Am 25.10.2019 um 16:35 schrieb Maximilian Philipps:

#890824#54
Date:
2021-03-29 05:49:24 UTC
From:
To:
hi Michael,

I currently can't test that. Given that bullseye isn't released yet, I
don't have a test environment here.

When bullseye is released I will try to test it again, for time being I
have moved all libvirt-lxc container to use lxc.

Regards,

Maximilian Philipps

#890824#59
Date:
2022-07-05 17:22:34 UTC
From:
To:
Any updates here?
Ideally, if you run bullseye and you still encounter the problem,
install systemd v250 from bullseye-backports and if the problem
persists, file it upstream at https://github.com/systemd/systemd/issues/
and report back with the issue number

Regards,
Michael

#890824#64
Date:
2022-07-06 07:17:26 UTC
From:
To:
hi,
I can't reproduce this anymore because we have migrated away from
libvirt-lxc. We are now using 'lxc', which appears appears to be more
reliable.

#890824#69
Date:
2022-07-06 13:05:02 UTC
From:
To:
Am 06.07.22 um 09:17 schrieb Maximilian Philipps:

Ok, let's close this bug report then.
It's possible, that libvirt-lxc was doing things behind systemd's back
which triggered this issue.

Regards,
Michael