Notes: OOM? Node Crash
Symptoms
Kubernetes (EKS) node becomes "Not Ready" and a Latch Pod running on it unresponsive when certain GPU workloads run.
- Impossible to get a K8s Node Shell
- AWS Instance Connect not connecting (unclear if this is an issue with all our nodes or just the crashed ones)
- Terminating the instance in AWS takes a very long time (>10 min)
- Pod is stuck in "Terminating" for a long time (>30 min)
- New pod cannot run because it needs the EBS volume (PersistentVolumeClaim) to be available, but it is stuck attached to the old dead instance
Note: this also looks similar to a confirmed unresolved OOM crash on Forch nodes. This comes up later.
Hot Fix
Force-delete the k8s pod. Terminate the AWS instance. Force-detach the EBS volume (risks corruption? does this matter with journaling on ext4? does overlayfs have proper recovery? has caused no issues so far). Occasionally also have to delete the new "Initializing" k8s pod for it to reschedule to a new live node (presumably because this re-triggers the "try to attach the PVC" process).
1
Rahul suggested this is due to an OOM in the pod. Assumption is that some important process dies to OOM killer (network connectivity?) and kubelet dies (killing everything k8s-related) then AWS can't reach the instance to do a "clean" shutdown and it gets stuck terminating until some timeout before a "forced" shutdown (which cannot be triggered manually apparently).
2
Made a new Latch Pod to investigate OOM-killer settings (OOM scores) for processes on the node. Node-shell in via AptaKube.
Reading everything from /proc
.
Logging:
- PID (the directory name)
- process name from
/comm
- command line from
/cmdline
- OOM score from
/oom_score
- OOM score adjustment from
/oom_score_adj
(Note thatoom_adj
is legacy)
We are going to print stuff hierarchically (like ps -H
).
Running the script using uv
, placing it on the node by copy-pasting into nano
(with rm -f oom_info.py && nano oom_info.py
).
# /// script
# requires-python = ">=3.13"
# dependencies = []
# ///
import re
import shlex
from argparse import ArgumentParser
from dataclasses import dataclass
from pathlib import Path
argp = ArgumentParser()
_ = argp.add_argument("--extended", action="store_true")
args = argp.parse_args()
pid_re = re.compile(r"^\d+$")
status_line_re = re.compile(r"^(?P<key>[^:]+):\s*(?P<val>.*)$")
@dataclass(kw_only=True)
class Proc:
pid: int
ppid: int
status: dict[str, str]
cmd_line: list[str]
comm: str
oom_score: int
oom_score_adj: int
data: dict[int, Proc] = {}
child_procs: dict[int, list[int]] = {}
proc_p = Path("/proc/")
for x in proc_p.iterdir():
if pid_re.match(x.name) is None:
continue
pid = int(x.name)
try:
status_str = (x / "status").read_text()
cmd_line = (part.strip() for part in (x / "cmdline").read_text().split("\0"))
comm = (x / "comm").read_text().strip()
oom_score = int((x / "oom_score").read_text())
oom_score_adj = int((x / "oom_score_adj").read_text())
except (NotADirectoryError, FileNotFoundError):
continue
status: dict[str, str] = {}
for l in status_str.split("\n"):
l = l.strip()
if l == "":
continue
m = status_line_re.match(l)
if m is None:
raise RuntimeError(f"could not match /proc/{{pid}}/status line: {l!r}")
status[m.group("key")] = m.group("val")
ppid = int(status["PPid"])
data[pid] = Proc(
pid=pid,
ppid=ppid,
status=status,
cmd_line=[x for x in cmd_line if x != ""],
comm=comm,
oom_score=oom_score,
oom_score_adj=oom_score_adj,
)
child_procs.setdefault(ppid, []).append(pid)
for xs in child_procs.values():
xs.sort()
# print(child_procs)
def render(pid: int, *, indent: str = "") -> None:
x = data[pid]
if x.comm == "kthread" and x.ppid == 1:
return
cmd_line = " ".join(shlex.quote(x) for x in x.cmd_line)
if args.extended:
print(f"{indent}[{x.pid}] {x.comm}")
print(f"{indent} {cmd_line}")
print(f"{indent} {x.oom_score} ({x.oom_score_adj:+})")
else:
print(f"{indent}[{x.pid}] {x.comm}: {x.oom_score} ({x.oom_score_adj:+})")
for c in child_procs.get(pid, []):
render(c, indent=indent + " ")
render(1)
This produces the following:
[1] systemd: 0 (+0)
[682] systemd-journal: 500 (-250)
[714] systemd-udevd: 0 (-1000)
[1058] rpcbind: 666 (+0)
[1059] systemd-timesyn: 666 (+0)
[1124] systemd-network: 666 (+0)
[1127] systemd-resolve: 666 (+0)
[1226] acpid: 666 (+0)
[1231] crio: 1 (-999)
[1232] cron: 666 (+0)
[1233] dbus-daemon: 67 (-900)
[1242] networkd-dispat: 666 (+0)
[1246] amazon-ssm-agen: 666 (+0)
[1637] ssm-agent-worke: 666 (+0)
[1310] sysbox-mgr: 334 (-500)
[1334] systemd-logind: 666 (+0)
[1360] atd: 666 (+0)
[1373] agetty: 666 (+0)
[1374] agetty: 666 (+0)
[1395] unattended-upgr: 666 (+0)
[1501] sysbox-fs: 334 (-500)
[1537] sh: 666 (+0)
[1565] sleep: 666 (+0)
[1545] containerd: 1 (-999)
[9328] kubelet: 1 (-999)
<SNIP other pods>
[999004] conmon: 0 (-1000)
[999026] systemd: 2 (-997)
[107945] packagekitd: 2 (-997)
[107957] polkitd: 2 (-997)
[999716] systemd-journal: 500 (-250)
[999990] systemd-resolve: 2 (-997)
[1003069] dbus-daemon: 67 (-900)
[1003126] security.py: 2 (-997)
[1003156] python: 2 (-997)
[1003147] sshd_keys_monit: 2 (-997)
[1003170] networkd-dispat: 2 (-997)
[1003222] systemd-logind: 2 (-997)
[1003237] containerd: 1 (-999)
[1003239] sshd: 2 (-997)
[1003271] dockerd: 334 (-500)
[1004617] master: 2 (-997)
[1004619] qmgr: 2 (-997)
[1808057] pickup: 2 (-997)
[1004816] unattended-upgr: 2 (-997)
[1004852] systemd: 2 (-997)
[1004861] (sd-pam): 2 (-997)
[1004927] jupyter-lab: 2 (-997)
[1038371] bash: 2 (-997)
[1888242] python: 2 (-997)
[1887631] pyright-langser: 2 (-997)
[1887657] MainThread: 2 (-997)
[2061993] python: 2 (-997)
[2062665] python: 2 (-997)
[2062668] python: 2 (-997)
[2062673] python: 2 (-997)
[2062677] python: 2 (-997)
[2062681] python: 2 (-997)
[2062685] python: 2 (-997)
[2062689] python: 2 (-997)
[2062693] python: 2 (-997)
[2062696] python: 2 (-997)
[1004951] rserver: 2 (-997)
[1004961] ssh-agent: 2 (-997)
[1004980] dbus-daemon: 2 (-997)
[1282005] snapd: 67 (-900)
[1293746] sshd: 0 (-1000)
<SNIP other pods>
[2057270] polkitd: 666 (+0)
Notes:
- The processes inside the pod (under
conmon
) all have very low score and a very highly negative adjustment. This is unexpected. - These are under cgroups, so the way OOM killer treats them could be different (it's not necessarily killing the highest score process like if everything was in one cgroup)
amazon-ssm-agent
has a relatively high score which might account for why the instance does not terminate properly
3
Some reading on OOM killer:
- The killer only triggers if memory over-provisioning is enabled (see kernel docs). If you do not overcommit you will get
NULL
frommalloc
and never trigger system-level OOM. - The only explanation of the Linux OOM handling seems to be an ancient article. This definitely predates cgroups
- cgroups v1 had some basic OOM-killer controls. See
10. OOM Control
in the kernel docs. - cgroup v2 docs are here
- cgroups v1/v2 also both have memory pressure stuff. v2 uses "Pressure Stall Information"
- cgroup v2
memory.max
can trigger OOM killer. This might explain why we see the issue on GPU pods primarily since they are used by exactly one Latch Pod, meaning it could actually be reaching a system OOM before it reaches the cgroup memory max. This also would connect it to Forch since Forch allocates instance of the exact right size for each workload. - The only "OOM" setting cgroup v2 is
memory.oom.group
. If this is1
, if anything in the cgroup is OOM-killed the whole thing will be - We can look at cgroup memory event stats to see if OOM killer has ever been triggered
- Teaching the OOM killer about control groups details multiple proposals for how the killer should interact with cgroups. Notably the stuff that finally shipped in the kernel could be different from any of the described proposals
Our pod's cgroup
has memory.oom.group=1
:
$ cat /sys/fs/cgroup/kubepods/pod30e633b5-4140-4774-9122-1b9e140c8556/crio-2c6ca22c5968ca227b2b77480c87fa56044a98fb92280ce03e613d95b6ea505b/memory.oom.group
1
There is also a nested systemd
cgroupv2 hierarchy under init.scope
there which does not have it, but it shouldn't matter
There are userspace OOM killers too (e.g. system-oomd
). And you can use eBPF for OOM. This can be useful if we need to patch the kernel OOM process or for debugging.
4
We want to know how modern kernels actually deal with OOM when cgroups are present. We read the source code:
- The OOM killer is in the kernel source code as
out_of_memory
- It generally just kills the process with the highest score
- But note that it already takes a specific cgroup to kill. There are also many other "out of memory" things that are not related to the system OOM state, that you can find in the source code like allocator-specific OOM procedures
- See
pagefault_out_of_memory
in the same file for the OOM killer entrypoint triggered by a "generic" system-level OOM. This one does not have a cgroup chosen yet. I do not understand how this actually ends up doing anything though since it contains nothing but rate-limiting code. There is probably some sort of callback system
- There is a cgroup OOM handler called
mem_cgroup_oom
. This callsout_of_memory
internally but also already has a cgroup reference.try_charge_memcg
is what calls it, but it again has already selected a cgroup- This all seems unrelated because I think this is for the non-overcommitted flow
The actual handler turns out to be the innocuously named mem_cgroup_oom_synchronize
.
- This reads the
current->memcg_in_oom
to get the cgroup. It comes frommemcg1_oom_prepare
- Since that only gets set when the cgroup has OOM killer disabled, I don't think it will actually be the one to die. Unclear
After reading the comment above pagefault_out_of_memory
, I think it is actually a "secondary" handler that is called by tasks that are inside a cgroup, when they get an OOM pagefault due to cgroup limits. Specifically:
- it does not handle tasks that are not part of a cgroup,
- it expects the current task's cgroup to already be marked as OOM, otherwise it will refuse to handle it,
- the cgroup is marked as OOM due to reaching
memory.max
in the allocation process, - a different OOM path happens during allocation for non-cgroup-related OOMs.
5
Searching for out_of_memory
calls reveals:
lru_gen_age_node
which apparently is the Multi-Gen LRU page reclamation system. This looks promising: "page reclaim decides the kernel’s caching policy and ability to overcommit memory". At a closer look, it iterates over cgroups but never submits a cgroup toout_of_memory
, instead only settinggfp_mask
.__alloc_pages_may_oom
frompage_alloc.c
which looks interesting but never uses a cgroup.
One final lead is the mm_fault_error
from a stack trace on a StackOverflow post. Note that the post itself is about cgroupv1 and also seemingly a memory.max
OOM not a system OOM.
6
Time for some experiments on a fresh AWS instance running an Amazon Linux-based AMI.
oom_info.py
:
[1] systemd: 0 (+0)
[1068] systemd-journal: 520 (-250)
[1725] systemd-udevd: 0 (-1000)
[2398] auditd: 0 (-1000)
[2400] systemd-resolve: 667 (+0)
[2421] nfsdcld: 666 (+0)
[2452] rpc.idmapd: 666 (+0)
[2491] dbus-broker-lau: 67 (-900)
[2497] dbus-broker: 67 (-900)
[2530] irqbalance: 666 (+0)
[2537] rngd: 666 (+0)
[2538] sysbox-mgr: 335 (-500)
[2541] systemd-homed: 667 (+0)
[2544] systemd-logind: 667 (+0)
[2581] systemd-network: 667 (+0)
[2720] agent: 683 (+0)
[2722] trace-agent: 670 (+0)
[2807] gssproxy: 666 (+0)
[4227] sshd: 0 (-1000)
[2172743] sshd: 667 (+0)
[2174918] sshd: 666 (+0)
[2174919] bash: 666 (+0)
[2175385] uv: 669 (+0)
[2175453] python: 668 (+0)
[2175459] process: 0 (+0)
[4255] agetty: 666 (+0)
[4256] agetty: 666 (+0)
[4257] rpcbind: 666 (+0)
[4291] rpc.statd: 666 (+0)
[4305] rpc.mountd: 666 (+0)
[4495] chronyd: 666 (+0)
[4594] sysbox-fs: 335 (-500)
[4667] sleep: 666 (+0)
[4681] containerd: 6 (-999)
[4822] dockerd: 341 (-500)
[2174519] systemd-userdbd: 666 (+0)
[2174581] systemd-userwor: 667 (+0)
[2174582] systemd-userwor: 667 (+0)
[2174583] systemd-userwor: 667 (+0)
[2174689] systemd: 734 (+100)
[2174733] (sd-pam): 733 (+100)
Note that a lot of things are different because the EKS nodes are Ubunu-based. I'm connected using AWS Instance Connect (shows up as an SSH session).
With a container:
[2176408] containerd-shim: 4 (-998)
[2176452] sh: 666 (+0)
This is running /bin/sh
under alpine
(with a TTY attached).
Using sysbox-runc
vs the default runtime doesn't change anything.
Running a container with a low limit 6m
(lowest that Docker allows). And then running head -c 5000m /dev/zero | tail
to trigger OOM produces the following dmesg logs:
[425561.202382] tail invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[425561.203483] CPU: 1 PID: 2179683 Comm: tail Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[425561.204467] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[425561.205279] Call Trace:
[425561.205737] <TASK>
[425561.206158] dump_stack_lvl+0x34/0x48
[425561.206699] dump_header+0x4c/0x22d
[425561.207223] oom_kill_process.cold+0xb/0x10
[425561.207805] out_of_memory+0xed/0x2e0
[425561.208344] ? sched_clock_cpu+0x9/0xb0
[425561.208890] mem_cgroup_out_of_memory+0x136/0x150
[425561.209517] try_charge_memcg+0x7d7/0x890
[425561.210073] charge_memcg+0x35/0xe0
[425561.210581] __mem_cgroup_charge+0x29/0x80
[425561.211132] do_anonymous_page+0x108/0x570
[425561.211688] __handle_mm_fault+0x513/0x5e0
[425561.212249] handle_mm_fault+0xc5/0x2b0
[425561.212792] do_user_addr_fault+0x1a1/0x570
[425561.213358] exc_page_fault+0x62/0x140
[425561.213878] asm_exc_page_fault+0x22/0x30
[425561.214418] RIP: 0033:0x7f8844e6d46e
[425561.214933] Code: 89 7d 10 49 83 ef 14 48 83 c7 10 49 81 e0 00 f0 ff ff 31 d2 25 ff 0f 00 00 4c 09 c0 48 89 45 20 4a 8d 2c 37 4c 01 ff 4d 29 f7 <c6> 07 00 45 29 e7 74 1b 44 89 f8 f7 d8 48 98 c6 04 07 00 41 83 ff
[425561.217045] RSP: 002b:00007ffc9fefcdd0 EFLAGS: 00010202
[425561.217719] RAX: 0000000000537fe0 RBX: 00007f88445a8270 RCX: 00007f8844e7d690
[425561.218526] RDX: 0000000000000000 RSI: 0000000000536000 RDI: 00007f8844adeffc
[425561.219352] RBP: 00007f88445a8270 R08: 0000000000537000 R09: 0000000000000000
[425561.220170] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000536000
[425561.220987] R13: 00007f8844adde70 R14: 0000000000000260 R15: 0000000000536d8c
[425561.221803] </TASK>
[425561.222263] memory: usage 6144kB, limit 6144kB, failcnt 146
[425561.223222] swap: usage 0kB, limit 6144kB, failcnt 0
[425561.223865] Memory cgroup stats for /system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope:
[425561.223878] anon 5775360
file 0
kernel 516096
kernel_stack 49152
pagetables 110592
sec_pagetables 0
percpu 1328
sock 0
vmalloc 12288
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 5763072
active_anon 12288
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 97984
slab_unreclaimable 182768
slab 280752
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 6
pgsteal 1
pgscan_kswapd 0
pgscan_direct 6
pgsteal_kswapd 0
pgsteal_direct 1
pgfault 3644
pgmajfault 0
pgrefill 5
pgactivate 5
pgdeactivate 5
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 0
thp_collapse_alloc 0
[425561.237528] Tasks state (memory values in pages):
[425561.238085] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[425561.238941] [2178864] 0 2178864 427 306 40960 0 0 sh
[425561.239790] [2179682] 0 2179682 407 1 40960 0 0 head
[425561.240624] [2179683] 0 2179683 1741 1503 57344 0 0 tail
[425561.241449] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,mems_allowed=0,oom_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task=tail,pid=2179683,uid=0
[425561.245090] Memory cgroup out of memory: Killed process 2179683 (tail) total-vm:6964kB, anon-rss:5188kB, file-rss:824kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0
We are seeing the following logs:
dump_header
dump_oom_victim
mem_cgroup_print_oom_context
- Final line of
out_of_memory
Now we try the same on the instance itself, with no containers. It has 8Gi of memory, so we try head -c 9000m | tail
. dmesg
for that:
[426123.555128] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[426123.556841] CPU: 1 PID: 1 Comm: systemd Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[426123.557998] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[426123.558998] Call Trace:
[426123.559603] <TASK>
[426123.560191] dump_stack_lvl+0x34/0x48
[426123.560930] dump_header+0x4c/0x22d
[426123.561651] oom_kill_process.cold+0xb/0x10
[426123.562448] out_of_memory+0xed/0x2e0
[426123.563194] __alloc_pages_slowpath.constprop.0+0x6d2/0x9a0
[426123.564154] __alloc_pages+0x308/0x330
[426123.564914] folio_alloc+0x17/0x50
[426123.565645] __filemap_get_folio+0x1cd/0x330
[426123.566468] filemap_fault+0x43d/0x7a0
[426123.567237] __do_fault+0x34/0x1c0
[426123.567979] do_read_fault+0xee/0x1c0
[426123.568751] do_fault+0xa3/0x380
[426123.569472] __handle_mm_fault+0x513/0x5e0
[426123.570280] ? asm_sysvec_call_function_single+0x16/0x20
[426123.571216] handle_mm_fault+0xc5/0x2b0
[426123.572011] do_user_addr_fault+0x1a1/0x570
[426123.572867] exc_page_fault+0x62/0x140
[426123.573661] asm_exc_page_fault+0x22/0x30
[426123.574482] RIP: 0033:0x7ffbc66940a4
[426123.575274] Code: Unable to access opcode bytes at 0x7ffbc669407a.
[426123.576316] RSP: 002b:00007ffcee3b26c8 EFLAGS: 00010293
[426123.577248] RAX: 0000000000000001 RBX: 00005616dc39b010 RCX: 00005616dc39b010
[426123.578387] RDX: 00007ffbc6776880 RSI: 00007ffcee3b3650 RDI: 00005616dc301380
[426123.579526] RBP: 0000000000000000 R08: 00007ffcee3b3650 R09: 0000000000000000
[426123.580673] R10: 00007ffcee3b2c40 R11: 0000000000000010 R12: 00005616dc399450
[426123.581799] R13: 00005616dc39b010 R14: 00007ffcee3b3650 R15: 00005616dc301380
[426123.582918] </TASK>
[426123.583549] Mem-Info:
[426123.584168] active_anon:317 inactive_anon:1937229 isolated_anon:0
active_file:0 inactive_file:64 isolated_file:0
unevictable:469 dirty:0 writeback:0
slab_reclaimable:7306 slab_unreclaimable:11656
mapped:23 shmem:2699 pagetables:4943
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:25275 free_pcp:0 free_cma:0
[426123.590458] Node 0 active_anon:1268kB inactive_anon:7748916kB active_file:0kB inactive_file:256kB unevictable:1876kB isolated(anon):0kB isolated(file):0kB mapped:92kB dirty:0kB writeback:0kB shmem:10796kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3024kB pagetables:19772kB sec_pagetables:0kB all_unreclaimable? no
[426123.594938] Node 0 DMA free:14336kB boost:0kB min:128kB low:160kB high:192kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[426123.598628] lowmem_reserve[]: 0 2910 7770 7770 7770
[426123.599534] Node 0 DMA32 free:44920kB boost:0kB min:25260kB low:31572kB high:37884kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2953972kB active_file:0kB inactive_file:52kB unevictable:0kB writepending:0kB present:3125680kB managed:3006516kB mlocked:0kB bounce:0kB free_pcp:376kB local_pcp:0kB free_cma:0kB
[426123.603889] lowmem_reserve[]: 0 0 4860 4860 4860
[426123.604824] Node 0 Normal free:41788kB boost:0kB min:42188kB low:52732kB high:63276kB reserved_highatomic:0KB active_anon:1268kB inactive_anon:4794944kB active_file:0kB inactive_file:520kB unevictable:1876kB writepending:0kB present:5122048kB managed:4977204kB mlocked:0kB bounce:0kB free_pcp:76kB local_pcp:0kB free_cma:0kB
[426123.609485] lowmem_reserve[]: 0 0 0 0 0
[426123.610362] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
[426123.612462] Node 0 DMA32: 1016*4kB (U) 609*8kB (UM) 248*16kB (U) 81*32kB (UM) 12*64kB (UM) 3*128kB (U) 2*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (U) 6*4096kB (M) = 44808kB
[426123.614997] Node 0 Normal: 2511*4kB (UE) 1472*8kB (UME) 755*16kB (UME) 232*32kB (UME) 14*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42220kB
[426123.617490] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[426123.618914] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[426123.620302] 2831 total pagecache pages
[426123.621205] 0 pages in swap cache
[426123.622074] Free swap = 0kB
[426123.622890] Total swap = 0kB
[426123.623708] 2065931 pages RAM
[426123.624526] 0 pages HighMem/MovableOnly
[426123.625433] 66161 pages reserved
[426123.626278] 0 pages hwpoisoned
[426123.627108] Tasks state (memory values in pages):
[426123.628126] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[426123.679216] [2178864] 0 2178864 434 50 40960 0 0 sh
[426123.680452] [2182349] 0 2182349 55499 47 53248 0 0 head
[426123.681694] [2182350] 0 2182350 1938542 1883091 15147008 0 0 tail
<SNIP>
[426123.687918] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2182350,uid=0
[426123.690422] Out of memory: Killed process 2182350 (tail) total-vm:7754168kB, anon-rss:7532364kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:14792kB oom_score_adj:0
Notes:
- Interesting that
systemd
is the process that invoked the OOM. Note thattail
is running in its own cgroup technically (from the default systemd hierarchy). This means that using thevm.oom_kill_allocating_task
sysctl could kill the entire system if a systemd process without the-1000
adjustment was to trigger the OOM by accident. - We can see that
oc->memcg
isNULL
based on theglobal_oom
log.
Running a container, setting oom_score_adj
to 1000
, then running uv run python
with " " * 1024 * 1024 * 1024
a few times to slowly OOM (note has to be gradual or you will get MemoryError
due to malloc
refusing to overallocate). dmesg
:
[427777.436077] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.437496] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.438869] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.650448] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.651924] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.653289] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427824.651962] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[427824.725737] Tasks state (memory values in pages):
[427824.726733] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.785987] [2193190] 0 2193190 434 50 49152 0 1000 sh
<SNIP>
[427824.788483] [2193819] 0 2193819 2951126 1879504 15200256 0 0 python3.13
[427824.789801] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-29180f1c319cd7a3b712d416d319e8c33b64dded655598bec859c83db1ef590e.scope,task=sh,pid=2193190,uid=0
[427824.793134] Out of memory: Killed process 2193190 (sh) total-vm:1736kB, anon-rss:196kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:1000
[427824.819737] systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
<SNIP>
[427824.893390] Tasks state (memory values in pages):
[427824.894385] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.954894] [2193819] 0 2193819 2951126 1879632 15200256 0 0 python3.13
[427824.956213] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=systemd-journald.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=python3.13,pid=2193819,uid=0
[427824.959263] Out of memory: Killed process 2193819 (python3.13) total-vm:11804504kB, anon-rss:7518524kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14844kB oom_score_adj:0
So it looks like the process selection might be entirely cgroup-unaware because sh
died despite its cgroup using almost no memory. Note that the OOM killer then ran a second time because we OOMed again immediately (predictably).
7
Final test:
- Run a system-level
tail -f /dev/null
process, with OOM-score adjusted to be very high. - Run a python in a container, with OOM-score adjusted to be lower.
- See which process gets killed by OOM killer. If it is the system-level, bash, then the OOM theory for the EKS pod issues seems correct (since those nodes have very low OOM score container processes as well).
Results:
[428419.671161] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[428419.808514] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2197507,uid=0
[428419.811942] Out of memory: Killed process 2197507 (tail) total-vm:222016kB, anon-rss:192kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:999
[428419.900720] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[428420.037888] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,task=python3.13,pid=2198276,uid=0
[428420.041222] Out of memory: Killed process 2198276 (python3.13) total-vm:8421584kB, anon-rss:7495808kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14772kB oom_score_adj:0
As expected, the system-level process died.
Final test, run another python in a container, with pushed-down OOM score (-999) and see if OOM killer will kill a system-level daemon instead. Note that this needs Sysbox.
>>> import os
>>> pid = int(os.getpid())
>>> from pathlib import Path
>>> Path(f"/proc/{pid}/oom_score_adj").write_text("-100\n")
>>> x = " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
# OOM
The Instance Connect terminal stopped updating at this point. I was able to reconnect.
dmesg
:
<SNIP>
[429104.973530] bash invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[429105.100050] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=sshd,pid=2174918,uid=0
[429105.102616] Out of memory: Killed process 2174918 (sshd) total-vm:16360kB, anon-rss:1764kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.290658] Out of memory: Killed process 2172743 (sshd) total-vm:16360kB, anon-rss:1740kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.655799] Out of memory: Killed process 2581 (systemd-network) total-vm:237528kB, anon-rss:1364kB, file-rss:0kB, shmem-rss:0kB, UID:192 pgtables:88kB oom_score_adj:0
[429105.985873] Out of memory: Killed process 2537 (rngd) total-vm:166744kB, anon-rss:1264kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.308905] Out of memory: Killed process 2541 (systemd-homed) total-vm:17440kB, anon-rss:1084kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
[429106.533138] Out of memory: Killed process 2202136 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.552411] Out of memory: Killed process 2202148 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.679037] Out of memory: Killed process 2202142 (systemd-userwor) total-vm:17292kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.697317] Out of memory: Killed process 2174519 (systemd-userdbd) total-vm:16936kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429107.166732] Out of memory: Killed process 4305 (rpc.mountd) total-vm:14404kB, anon-rss:908kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0
[429107.404013] Out of memory: Killed process 4257 (rpcbind) total-vm:13012kB, anon-rss:868kB, file-rss:0kB, shmem-rss:0kB, UID:32 pgtables:68kB oom_score_adj:0
<SNIP>
[429109.636079] Out of memory: Killed process 4255 (agetty) total-vm:221968kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:0
<SNIP>
[429109.644505] containerd-shim invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-998
<SNIP>
[429109.660350] Out of memory: Killed process 4667 (sleep) total-vm:221984kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:0
<SNIP>
Notes:
- The system indeed got butchered.
- The
python
process is also dead, though I don't see a message indmesg
. - The systemd user session all got killed together, likely due to the cgroup
mem.oom.group
setting. - This did not occur with less extreme score adjustments (tried up to
-500
),python
would get killed first.
Desired behavior
- Never hit system-level OOM in the first place because everything is running in cgroups with appropriate
memory.max
settings. - When a system is choosing a process for OOM killer, we want to strictly prefer containerized workloads (K8s pods/Latch Pods).
- We want per-process OOM killing inside the pod itself (no
mem.oom.group
). - We don't want to unnecessarily restrict access to
oom_score_adj
provided by SysBox as it would be confusing to people who expect Latch Pods to act like VMs. It might also cause some workloads (e.g. Kuberenetes apparently) to crash inside the pod on startup as they treat Permission Denied errors on theoom_score_adj
write as fatal.
8
Everything should actually already be fine on our EKS because the crashing pod has a limit (cgroup memory.max
) of 178Gi
and the host instance has 192Gi
of memory. This means that the cgroup should OOM-killer for itself long before the system OOMs.
Possible reasons for this still being an issue:
- The cgroup itself never reaches the limit. Some other process triggers OOM killer, which then goes killing system daemons. In that case there must be at least
192Gi - 178Gi = 14Gi
of RAM used up by something else. More if the cgroup is not full (checkmemory.min
to see if the cgroup forbids reclaiming any of its memory).- This could be
kubelet
, a container runtime process (SysBox), or the NVIDIA driver (these all run outside the pod's cgroup).
- This could be
- We are running into the "under certain circumstances, the usage may go over the limit temporarily" caveat in the kernel.
- Something to do with the cgroup file being open
O_NONBLOCK
which does a bunch of weird things, potentially bysystemd
running within the container. - It's unclear if
memory.max
refers to resident memory or committed memory (under overcommit conditions). This post seems to imply it's resident memory. Does this matter to us?
9
Trying to get rid of the score adjustment on the user session processes.
Checking if spawning a child process inherits oom_score_adj
. Yes.
Need to figure out why systemd gets a score adjustment. It does not when running as proper PID 1. Trying strace
on /sbin/init
on an ubuntu container. Using apt install init
to get systemd as /sbin/init
. You can use Ctrl+D to skip the region and timezone config. systemd exits immediately with "Couldn't find an alternative telinit implementation to spawn." This is seemingly because it wants to have PID 1 and must be run with exec /sbin/init
(and probably use SysBox). Some processes actually don't have any OOM score adjustments.
This might actually be due to something K8s is doing because all other pods on the node also have either 1000
or -997
as the score adjusment.
Checking the other K8s cluster to see if this happens for a more basic configuration or has something to do with our crio setup. Looks like all the pods here also have it so this definitely has nothing to do with SysBox or systemd.
This actually appears to be documented in an archived k8s proposal.
Testing adjusting a Latch Pod deployment to set its OOM score in the boot script. Works fine—this works as a quick workaround to at least prevent nodes from crashing.
Fixing the OOM Settings
We deploy a DaemonSet
for all Latch Pods called latch-pod-runtime
already, so our memory.oom.group
adjustment will go in there.
First I modernize everything to use uv
, a recent Dockerfile
for Python projects, and newer CI/CD workflows.
Actual new code is basically this:
proc = await asyncio.subprocess.create_subprocess_exec(
"/mnt/usr/local/bin/crictl",
"--runtime-endpoint",
"unix:///var/run/crio/crio.sock",
"pods",
"-o",
"json",
stdout=asyncio.subprocess.PIPE,
)
stdout, _ = await proc.communicate()
pods_data = json.loads(stdout)
for pod in pods_data["items"]:
if pod["labels"]["app"] != "latch-pods":
continue
uid = pod["metadata"]["uid"]
cgroup = Path(f"/sys/fs/cgroup/kubepods/pod{uid}")
for child in cgroup.iterdir():
if not child.is_dir():
continue
(child / "memory.oom.group").write_text("0\n")
I also have to change the DaemonSet
spec to mount in /var/run/crio
since it is not exposed with just mounting in the entire node root for some reason.
We change the pod base image startup script to include a simple oom_score_adj
reset:
echo '900' > /proc/self/oom_score_adj
Since this is PID 1 (before it exec
s systemd) all other processes will inherit the score adjustment.
The only remaining things are to test everything, then backport the base image changes to all previous base images so that pod templates benefit from the fix.
10
Remaining tasks:
- Check if the fix addressed the actual customer issue
- Figure out why the system-level OOM happens in the first place
- Write an eBPF OOM-handler that always chooses something from a container. We currently go back to the old bad behavior in the case that people go around maliciously overriding
oom_score_adj
values. They shouldn't be able to crash the entire node no matter what.
Pod Base Image Backports
Ubuntu is by far the most familiar and ubiqutuous Linux distro for "regular people". It is an obvious choice for any interactive enrionment that we build for end users. Since Latch Pods are containers, we maintain Ubuntu-flavored base images which use apt
to manage packages. We install a bunch of system stuff because users expect things to mostly "just work". Most Dockerfiles on the internet do something like this:
run apt update
run apt install -y curl
This is wholy inappropriate for complicated base images because it makes the end result cache-dependent. If you make a change by e.g. adding
run apt install -y gzip
and build the new image, Docker might find the previous 2 layers in your local cache, and thus reuse the old apt package achive (the thing downloaded with apt update
) or it might find nothing in cache, and download brand new archives.
In simplistic cases like above, you probably don't care about all your system dependencies updating at unpredictable times. Various versions of curl, gzip, etc. are all pretty compatible and you don't care about libc updating a few versions either. There might still be problems if somebody uses a cache that is a year old since the new packages aren't necessarily compatible with the system libraries pulled in by old packages, but a clean rebuild will resolve it (again, you just upgrade everything and forget about it).
In our case, we really need the versions to not change spontaneously. I experimented at some time with saving the apt archive somewhere for future use and adding some notion of lockfiles to apt, but it's really not practical without a lot of custom infrastructure on top. Ubuntu has their own snapshot service for recent distros but they claim to only keep 2 years worth of snapshots which makes it rather useless.
What you also have to keep in mind is that the apt archive issue is not the only thing wrong with it for reproducible builds. A lot of packages themselves go missing over time. NVIDIA in particular gave us a lot of trouble by randomly withdrawing old driver versions from apt. Various maintainers also regularly rotate their package keys and break existing build.
The only real solution to this stuff is to forget about declarative Dockerfiles altogether and start doing "incremental" Dockerfiles:
from myapp:v1 as base
run apt install -y gzip
# publish as myapp:v2
When building v3
we merge the commands from v2
into the base Dockerfile for having a reasonable "build-from-scratch" script, even though we never expect to run it ever again. This is mainly for reference—answering questions like "what apt packages do we install?" Everything is also in git obviously.
A further complication is that we allow snapshots and backups of our running containers' root file-system (with some overlayfs tricks), which are not generally compatible across major base image changes. This means we have to maintain a giant backlog of base images for all the pod backups that might one day be restored.
We sometimes make incompatible changes to the platform or fix bugs in pods that then require changes to all base images. Originally we were doing this by directly overriding the images in the container registry because we were building them from scratch anyway. But eventually apt made it impossible by losing all our old package versions. We also ran into a bunch of new build errors regularly due to bitrot in software distribution and build systems.
I was going to simply bring over incremental builds, and override the old base images with the incremental rebuilds. However, this has an extra problem that does not occur with the master version updates, which is that destructively overriding images means that after one incremental build, rebuilding the same file again will actually add the increment twice (building from scratch did not have this problem because the result would be the same). So we actually have to stop overriding images entirely. Instead we patch the base image to an updated version of it (with a new tag) when the pod gets used.