Notes: OOM? Node Crash

07 Jul, 2025

Symptoms

Kubernetes (EKS) node becomes "Not Ready" and a Latch Pod running on it unresponsive when certain GPU workloads run.

Impossible to get a K8s Node Shell
AWS Instance Connect not connecting (unclear if this is an issue with all our nodes or just the crashed ones)
Terminating the instance in AWS takes a very long time (>10 min)
Pod is stuck in "Terminating" for a long time (>30 min)
New pod cannot run because it needs the EBS volume (PersistentVolumeClaim) to be available, but it is stuck attached to the old dead instance

Note: this also looks similar to a confirmed unresolved OOM crash on Forch nodes. This comes up later.

Hot Fix

Force-delete the k8s pod. Terminate the AWS instance. Force-detach the EBS volume (risks corruption? does this matter with journaling on ext4? does overlayfs have proper recovery? has caused no issues so far). Occasionally also have to delete the new "Initializing" k8s pod for it to reschedule to a new live node (presumably because this re-triggers the "try to attach the PVC" process).

1

Rahul suggested this is due to an OOM in the pod. Assumption is that some important process dies to OOM killer (network connectivity?) and kubelet dies (killing everything k8s-related) then AWS can't reach the instance to do a "clean" shutdown and it gets stuck terminating until some timeout before a "forced" shutdown (which cannot be triggered manually apparently).

2

Made a new Latch Pod to investigate OOM-killer settings (OOM scores) for processes on the node. Node-shell in via AptaKube.

Reading everything from /proc. Logging:

PID (the directory name)
process name from /comm
command line from /cmdline
OOM score from /oom_score
OOM score adjustment from /oom_score_adj (Note that oom_adj is legacy)

We are going to print stuff hierarchically (like ps -H).

Running the script using uv, placing it on the node by copy-pasting into nano (with rm -f oom_info.py && nano oom_info.py).

# /// script
# requires-python = ">=3.13"
# dependencies = []
# ///

import re
import shlex
from argparse import ArgumentParser
from dataclasses import dataclass
from pathlib import Path

argp = ArgumentParser()
_ = argp.add_argument("--extended", action="store_true")

args = argp.parse_args()

pid_re = re.compile(r"^\d+$")
status_line_re = re.compile(r"^(?P<key>[^:]+):\s*(?P<val>.*)$")


@dataclass(kw_only=True)
class Proc:
    pid: int
    ppid: int
    status: dict[str, str]
    cmd_line: list[str]
    comm: str
    oom_score: int
    oom_score_adj: int


data: dict[int, Proc] = {}
child_procs: dict[int, list[int]] = {}

proc_p = Path("/proc/")
for x in proc_p.iterdir():
    if pid_re.match(x.name) is None:
        continue

    pid = int(x.name)

    try:
        status_str = (x / "status").read_text()
        cmd_line = (part.strip() for part in (x / "cmdline").read_text().split("\0"))
        comm = (x / "comm").read_text().strip()
        oom_score = int((x / "oom_score").read_text())
        oom_score_adj = int((x / "oom_score_adj").read_text())
    except (NotADirectoryError, FileNotFoundError):
        continue

    status: dict[str, str] = {}
    for l in status_str.split("\n"):
        l = l.strip()
        if l == "":
            continue

        m = status_line_re.match(l)
        if m is None:
            raise RuntimeError(f"could not match /proc/{{pid}}/status line: {l!r}")
        status[m.group("key")] = m.group("val")

    ppid = int(status["PPid"])
    data[pid] = Proc(
        pid=pid,
        ppid=ppid,
        status=status,
        cmd_line=[x for x in cmd_line if x != ""],
        comm=comm,
        oom_score=oom_score,
        oom_score_adj=oom_score_adj,
    )
    child_procs.setdefault(ppid, []).append(pid)

for xs in child_procs.values():
    xs.sort()

# print(child_procs)


def render(pid: int, *, indent: str = "") -> None:
    x = data[pid]
    if x.comm == "kthread" and x.ppid == 1:
        return

    cmd_line = " ".join(shlex.quote(x) for x in x.cmd_line)

    if args.extended:
        print(f"{indent}[{x.pid}] {x.comm}")
        print(f"{indent}  {cmd_line}")
        print(f"{indent}  {x.oom_score} ({x.oom_score_adj:+})")
    else:
        print(f"{indent}[{x.pid}] {x.comm}: {x.oom_score} ({x.oom_score_adj:+})")

    for c in child_procs.get(pid, []):
        render(c, indent=indent + "  ")


render(1)

This produces the following:

[1] systemd: 0 (+0)
  [682] systemd-journal: 500 (-250)
  [714] systemd-udevd: 0 (-1000)
  [1058] rpcbind: 666 (+0)
  [1059] systemd-timesyn: 666 (+0)
  [1124] systemd-network: 666 (+0)
  [1127] systemd-resolve: 666 (+0)
  [1226] acpid: 666 (+0)
  [1231] crio: 1 (-999)
  [1232] cron: 666 (+0)
  [1233] dbus-daemon: 67 (-900)
  [1242] networkd-dispat: 666 (+0)
  [1246] amazon-ssm-agen: 666 (+0)
    [1637] ssm-agent-worke: 666 (+0)
  [1310] sysbox-mgr: 334 (-500)
  [1334] systemd-logind: 666 (+0)
  [1360] atd: 666 (+0)
  [1373] agetty: 666 (+0)
  [1374] agetty: 666 (+0)
  [1395] unattended-upgr: 666 (+0)
  [1501] sysbox-fs: 334 (-500)
  [1537] sh: 666 (+0)
    [1565] sleep: 666 (+0)
  [1545] containerd: 1 (-999)
  [9328] kubelet: 1 (-999)
  <SNIP other pods>
  [999004] conmon: 0 (-1000)
    [999026] systemd: 2 (-997)
      [107945] packagekitd: 2 (-997)
      [107957] polkitd: 2 (-997)
      [999716] systemd-journal: 500 (-250)
      [999990] systemd-resolve: 2 (-997)
      [1003069] dbus-daemon: 67 (-900)
      [1003126] security.py: 2 (-997)
        [1003156] python: 2 (-997)
      [1003147] sshd_keys_monit: 2 (-997)
      [1003170] networkd-dispat: 2 (-997)
      [1003222] systemd-logind: 2 (-997)
      [1003237] containerd: 1 (-999)
      [1003239] sshd: 2 (-997)
      [1003271] dockerd: 334 (-500)
      [1004617] master: 2 (-997)
        [1004619] qmgr: 2 (-997)
        [1808057] pickup: 2 (-997)
      [1004816] unattended-upgr: 2 (-997)
      [1004852] systemd: 2 (-997)
        [1004861] (sd-pam): 2 (-997)
        [1004927] jupyter-lab: 2 (-997)
          [1038371] bash: 2 (-997)
            [1888242] python: 2 (-997)
          [1887631] pyright-langser: 2 (-997)
            [1887657] MainThread: 2 (-997)
          [2061993] python: 2 (-997)
            [2062665] python: 2 (-997)
            [2062668] python: 2 (-997)
            [2062673] python: 2 (-997)
            [2062677] python: 2 (-997)
            [2062681] python: 2 (-997)
            [2062685] python: 2 (-997)
            [2062689] python: 2 (-997)
            [2062693] python: 2 (-997)
            [2062696] python: 2 (-997)
        [1004951] rserver: 2 (-997)
        [1004961] ssh-agent: 2 (-997)
        [1004980] dbus-daemon: 2 (-997)
  [1282005] snapd: 67 (-900)
  [1293746] sshd: 0 (-1000)
  <SNIP other pods>
  [2057270] polkitd: 666 (+0)

Notes:

The processes inside the pod (under conmon) all have very low score and a very highly negative adjustment. This is unexpected.
These are under cgroups, so the way OOM killer treats them could be different (it's not necessarily killing the highest score process like if everything was in one cgroup)
amazon-ssm-agent has a relatively high score which might account for why the instance does not terminate properly

3

Some reading on OOM killer:

The killer only triggers if memory over-provisioning is enabled (see kernel docs). If you do not overcommit you will get NULL from malloc and never trigger system-level OOM.
The only explanation of the Linux OOM handling seems to be an ancient article. This definitely predates cgroups
cgroups v1 had some basic OOM-killer controls. See 10. OOM Control in the kernel docs.
cgroup v2 docs are here
cgroups v1/v2 also both have memory pressure stuff. v2 uses "Pressure Stall Information"
cgroup v2 memory.max can trigger OOM killer. This might explain why we see the issue on GPU pods primarily since they are used by exactly one Latch Pod, meaning it could actually be reaching a system OOM before it reaches the cgroup memory max. This also would connect it to Forch since Forch allocates instance of the exact right size for each workload.
The only "OOM" setting cgroup v2 is memory.oom.group. If this is 1, if anything in the cgroup is OOM-killed the whole thing will be
We can look at cgroup memory event stats to see if OOM killer has ever been triggered
Teaching the OOM killer about control groups details multiple proposals for how the killer should interact with cgroups. Notably the stuff that finally shipped in the kernel could be different from any of the described proposals

Our pod's cgroup has memory.oom.group=1:

$ cat /sys/fs/cgroup/kubepods/pod30e633b5-4140-4774-9122-1b9e140c8556/crio-2c6ca22c5968ca227b2b77480c87fa56044a98fb92280ce03e613d95b6ea505b/memory.oom.group
1

There is also a nested systemd cgroupv2 hierarchy under init.scope there which does not have it, but it shouldn't matter

There are userspace OOM killers too (e.g. system-oomd). And you can use eBPF for OOM. This can be useful if we need to patch the kernel OOM process or for debugging.

4

We want to know how modern kernels actually deal with OOM when cgroups are present. We read the source code:

The OOM killer is in the kernel source code as out_of_memory
- It generally just kills the process with the highest score
- But note that it already takes a specific cgroup to kill. There are also many other "out of memory" things that are not related to the system OOM state, that you can find in the source code like allocator-specific OOM procedures
- See pagefault_out_of_memory in the same file for the OOM killer entrypoint triggered by a "generic" system-level OOM. This one does not have a cgroup chosen yet. I do not understand how this actually ends up doing anything though since it contains nothing but rate-limiting code. There is probably some sort of callback system
There is a cgroup OOM handler called mem_cgroup_oom. This calls out_of_memory internally but also already has a cgroup reference.
- try_charge_memcg is what calls it, but it again has already selected a cgroup
- This all seems unrelated because I think this is for the non-overcommitted flow

The actual handler turns out to be the innocuously named mem_cgroup_oom_synchronize.

This reads the current->memcg_in_oom to get the cgroup. It comes from memcg1_oom_prepare
- Since that only gets set when the cgroup has OOM killer disabled, I don't think it will actually be the one to die. Unclear

After reading the comment above pagefault_out_of_memory, I think it is actually a "secondary" handler that is called by tasks that are inside a cgroup, when they get an OOM pagefault due to cgroup limits. Specifically:

it does not handle tasks that are not part of a cgroup,
it expects the current task's cgroup to already be marked as OOM, otherwise it will refuse to handle it,
the cgroup is marked as OOM due to reaching memory.max in the allocation process,
a different OOM path happens during allocation for non-cgroup-related OOMs.

5

Searching for out_of_memory calls reveals:

lru_gen_age_node which apparently is the Multi-Gen LRU page reclamation system. This looks promising: "page reclaim decides the kernel’s caching policy and ability to overcommit memory". At a closer look, it iterates over cgroups but never submits a cgroup to out_of_memory, instead only setting gfp_mask.
__alloc_pages_may_oom from page_alloc.c which looks interesting but never uses a cgroup.

One final lead is the mm_fault_error from a stack trace on a StackOverflow post. Note that the post itself is about cgroupv1 and also seemingly a memory.max OOM not a system OOM.

6

Time for some experiments on a fresh AWS instance running an Amazon Linux-based AMI.

oom_info.py:

[1] systemd: 0 (+0)
  [1068] systemd-journal: 520 (-250)
  [1725] systemd-udevd: 0 (-1000)
  [2398] auditd: 0 (-1000)
  [2400] systemd-resolve: 667 (+0)
  [2421] nfsdcld: 666 (+0)
  [2452] rpc.idmapd: 666 (+0)
  [2491] dbus-broker-lau: 67 (-900)
    [2497] dbus-broker: 67 (-900)
  [2530] irqbalance: 666 (+0)
  [2537] rngd: 666 (+0)
  [2538] sysbox-mgr: 335 (-500)
  [2541] systemd-homed: 667 (+0)
  [2544] systemd-logind: 667 (+0)
  [2581] systemd-network: 667 (+0)
  [2720] agent: 683 (+0)
  [2722] trace-agent: 670 (+0)
  [2807] gssproxy: 666 (+0)
  [4227] sshd: 0 (-1000)
    [2172743] sshd: 667 (+0)
      [2174918] sshd: 666 (+0)
        [2174919] bash: 666 (+0)
          [2175385] uv: 669 (+0)
            [2175453] python: 668 (+0)
              [2175459] process: 0 (+0)
  [4255] agetty: 666 (+0)
  [4256] agetty: 666 (+0)
  [4257] rpcbind: 666 (+0)
  [4291] rpc.statd: 666 (+0)
  [4305] rpc.mountd: 666 (+0)
  [4495] chronyd: 666 (+0)
  [4594] sysbox-fs: 335 (-500)
  [4667] sleep: 666 (+0)
  [4681] containerd: 6 (-999)
  [4822] dockerd: 341 (-500)
  [2174519] systemd-userdbd: 666 (+0)
    [2174581] systemd-userwor: 667 (+0)
    [2174582] systemd-userwor: 667 (+0)
    [2174583] systemd-userwor: 667 (+0)
  [2174689] systemd: 734 (+100)
    [2174733] (sd-pam): 733 (+100)

Note that a lot of things are different because the EKS nodes are Ubunu-based. I'm connected using AWS Instance Connect (shows up as an SSH session).

With a container:

  [2176408] containerd-shim: 4 (-998)
    [2176452] sh: 666 (+0)

This is running /bin/sh under alpine (with a TTY attached).

Using sysbox-runc vs the default runtime doesn't change anything.

Running a container with a low limit 6m (lowest that Docker allows). And then running head -c 5000m /dev/zero | tail to trigger OOM produces the following dmesg logs:

[425561.202382] tail invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[425561.203483] CPU: 1 PID: 2179683 Comm: tail Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[425561.204467] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[425561.205279] Call Trace:
[425561.205737]  <TASK>
[425561.206158]  dump_stack_lvl+0x34/0x48
[425561.206699]  dump_header+0x4c/0x22d
[425561.207223]  oom_kill_process.cold+0xb/0x10
[425561.207805]  out_of_memory+0xed/0x2e0
[425561.208344]  ? sched_clock_cpu+0x9/0xb0
[425561.208890]  mem_cgroup_out_of_memory+0x136/0x150
[425561.209517]  try_charge_memcg+0x7d7/0x890
[425561.210073]  charge_memcg+0x35/0xe0
[425561.210581]  __mem_cgroup_charge+0x29/0x80
[425561.211132]  do_anonymous_page+0x108/0x570
[425561.211688]  __handle_mm_fault+0x513/0x5e0
[425561.212249]  handle_mm_fault+0xc5/0x2b0
[425561.212792]  do_user_addr_fault+0x1a1/0x570
[425561.213358]  exc_page_fault+0x62/0x140
[425561.213878]  asm_exc_page_fault+0x22/0x30
[425561.214418] RIP: 0033:0x7f8844e6d46e
[425561.214933] Code: 89 7d 10 49 83 ef 14 48 83 c7 10 49 81 e0 00 f0 ff ff 31 d2 25 ff 0f 00 00 4c 09 c0 48 89 45 20 4a 8d 2c 37 4c 01 ff 4d 29 f7 <c6> 07 00 45 29 e7 74 1b 44 89 f8 f7 d8 48 98 c6 04 07 00 41 83 ff
[425561.217045] RSP: 002b:00007ffc9fefcdd0 EFLAGS: 00010202
[425561.217719] RAX: 0000000000537fe0 RBX: 00007f88445a8270 RCX: 00007f8844e7d690
[425561.218526] RDX: 0000000000000000 RSI: 0000000000536000 RDI: 00007f8844adeffc
[425561.219352] RBP: 00007f88445a8270 R08: 0000000000537000 R09: 0000000000000000
[425561.220170] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000536000
[425561.220987] R13: 00007f8844adde70 R14: 0000000000000260 R15: 0000000000536d8c
[425561.221803]  </TASK>
[425561.222263] memory: usage 6144kB, limit 6144kB, failcnt 146
[425561.223222] swap: usage 0kB, limit 6144kB, failcnt 0
[425561.223865] Memory cgroup stats for /system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope:
[425561.223878] anon 5775360
                file 0
                kernel 516096
                kernel_stack 49152
                pagetables 110592
                sec_pagetables 0
                percpu 1328
                sock 0
                vmalloc 12288
                shmem 0
                zswap 0
                zswapped 0
                file_mapped 0
                file_dirty 0
                file_writeback 0
                swapcached 0
                anon_thp 0
                file_thp 0
                shmem_thp 0
                inactive_anon 5763072
                active_anon 12288
                inactive_file 0
                active_file 0
                unevictable 0
                slab_reclaimable 97984
                slab_unreclaimable 182768
                slab 280752
                workingset_refault_anon 0
                workingset_refault_file 0
                workingset_activate_anon 0
                workingset_activate_file 0
                workingset_restore_anon 0
                workingset_restore_file 0
                workingset_nodereclaim 0
                pgscan 6
                pgsteal 1
                pgscan_kswapd 0
                pgscan_direct 6
                pgsteal_kswapd 0
                pgsteal_direct 1
                pgfault 3644
                pgmajfault 0
                pgrefill 5
                pgactivate 5
                pgdeactivate 5
                pglazyfree 0
                pglazyfreed 0
                zswpin 0
                zswpout 0
                thp_fault_alloc 0
                thp_collapse_alloc 0
[425561.237528] Tasks state (memory values in pages):
[425561.238085] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[425561.238941] [2178864]     0 2178864      427      306    40960        0             0 sh
[425561.239790] [2179682]     0 2179682      407        1    40960        0             0 head
[425561.240624] [2179683]     0 2179683     1741     1503    57344        0             0 tail
[425561.241449] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,mems_allowed=0,oom_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task=tail,pid=2179683,uid=0
[425561.245090] Memory cgroup out of memory: Killed process 2179683 (tail) total-vm:6964kB, anon-rss:5188kB, file-rss:824kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0

We are seeing the following logs:

Now we try the same on the instance itself, with no containers. It has 8Gi of memory, so we try head -c 9000m | tail. dmesg for that:

[426123.555128] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[426123.556841] CPU: 1 PID: 1 Comm: systemd Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[426123.557998] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[426123.558998] Call Trace:
[426123.559603]  <TASK>
[426123.560191]  dump_stack_lvl+0x34/0x48
[426123.560930]  dump_header+0x4c/0x22d
[426123.561651]  oom_kill_process.cold+0xb/0x10
[426123.562448]  out_of_memory+0xed/0x2e0
[426123.563194]  __alloc_pages_slowpath.constprop.0+0x6d2/0x9a0
[426123.564154]  __alloc_pages+0x308/0x330
[426123.564914]  folio_alloc+0x17/0x50
[426123.565645]  __filemap_get_folio+0x1cd/0x330
[426123.566468]  filemap_fault+0x43d/0x7a0
[426123.567237]  __do_fault+0x34/0x1c0
[426123.567979]  do_read_fault+0xee/0x1c0
[426123.568751]  do_fault+0xa3/0x380
[426123.569472]  __handle_mm_fault+0x513/0x5e0
[426123.570280]  ? asm_sysvec_call_function_single+0x16/0x20
[426123.571216]  handle_mm_fault+0xc5/0x2b0
[426123.572011]  do_user_addr_fault+0x1a1/0x570
[426123.572867]  exc_page_fault+0x62/0x140
[426123.573661]  asm_exc_page_fault+0x22/0x30
[426123.574482] RIP: 0033:0x7ffbc66940a4
[426123.575274] Code: Unable to access opcode bytes at 0x7ffbc669407a.
[426123.576316] RSP: 002b:00007ffcee3b26c8 EFLAGS: 00010293
[426123.577248] RAX: 0000000000000001 RBX: 00005616dc39b010 RCX: 00005616dc39b010
[426123.578387] RDX: 00007ffbc6776880 RSI: 00007ffcee3b3650 RDI: 00005616dc301380
[426123.579526] RBP: 0000000000000000 R08: 00007ffcee3b3650 R09: 0000000000000000
[426123.580673] R10: 00007ffcee3b2c40 R11: 0000000000000010 R12: 00005616dc399450
[426123.581799] R13: 00005616dc39b010 R14: 00007ffcee3b3650 R15: 00005616dc301380
[426123.582918]  </TASK>
[426123.583549] Mem-Info:
[426123.584168] active_anon:317 inactive_anon:1937229 isolated_anon:0
                 active_file:0 inactive_file:64 isolated_file:0
                 unevictable:469 dirty:0 writeback:0
                 slab_reclaimable:7306 slab_unreclaimable:11656
                 mapped:23 shmem:2699 pagetables:4943
                 sec_pagetables:0 bounce:0
                 kernel_misc_reclaimable:0
                 free:25275 free_pcp:0 free_cma:0
[426123.590458] Node 0 active_anon:1268kB inactive_anon:7748916kB active_file:0kB inactive_file:256kB unevictable:1876kB isolated(anon):0kB isolated(file):0kB mapped:92kB dirty:0kB writeback:0kB shmem:10796kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3024kB pagetables:19772kB sec_pagetables:0kB all_unreclaimable? no
[426123.594938] Node 0 DMA free:14336kB boost:0kB min:128kB low:160kB high:192kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[426123.598628] lowmem_reserve[]: 0 2910 7770 7770 7770
[426123.599534] Node 0 DMA32 free:44920kB boost:0kB min:25260kB low:31572kB high:37884kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2953972kB active_file:0kB inactive_file:52kB unevictable:0kB writepending:0kB present:3125680kB managed:3006516kB mlocked:0kB bounce:0kB free_pcp:376kB local_pcp:0kB free_cma:0kB
[426123.603889] lowmem_reserve[]: 0 0 4860 4860 4860
[426123.604824] Node 0 Normal free:41788kB boost:0kB min:42188kB low:52732kB high:63276kB reserved_highatomic:0KB active_anon:1268kB inactive_anon:4794944kB active_file:0kB inactive_file:520kB unevictable:1876kB writepending:0kB present:5122048kB managed:4977204kB mlocked:0kB bounce:0kB free_pcp:76kB local_pcp:0kB free_cma:0kB
[426123.609485] lowmem_reserve[]: 0 0 0 0 0
[426123.610362] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
[426123.612462] Node 0 DMA32: 1016*4kB (U) 609*8kB (UM) 248*16kB (U) 81*32kB (UM) 12*64kB (UM) 3*128kB (U) 2*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (U) 6*4096kB (M) = 44808kB
[426123.614997] Node 0 Normal: 2511*4kB (UE) 1472*8kB (UME) 755*16kB (UME) 232*32kB (UME) 14*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42220kB
[426123.617490] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[426123.618914] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[426123.620302] 2831 total pagecache pages
[426123.621205] 0 pages in swap cache
[426123.622074] Free swap  = 0kB
[426123.622890] Total swap = 0kB
[426123.623708] 2065931 pages RAM
[426123.624526] 0 pages HighMem/MovableOnly
[426123.625433] 66161 pages reserved
[426123.626278] 0 pages hwpoisoned
[426123.627108] Tasks state (memory values in pages):
[426123.628126] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[426123.679216] [2178864]     0 2178864      434       50    40960        0             0 sh
[426123.680452] [2182349]     0 2182349    55499       47    53248        0             0 head
[426123.681694] [2182350]     0 2182350  1938542  1883091 15147008        0             0 tail
<SNIP>
[426123.687918] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2182350,uid=0
[426123.690422] Out of memory: Killed process 2182350 (tail) total-vm:7754168kB, anon-rss:7532364kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:14792kB oom_score_adj:0

Notes:

Interesting that systemd is the process that invoked the OOM. Note that tail is running in its own cgroup technically (from the default systemd hierarchy). This means that using the vm.oom_kill_allocating_task sysctl could kill the entire system if a systemd process without the -1000 adjustment was to trigger the OOM by accident.
We can see that oc->memcg is NULL based on the global_oom log.

Running a container, setting oom_score_adj to 1000, then running uv run python with " " * 1024 * 1024 * 1024 a few times to slowly OOM (note has to be gradual or you will get MemoryError due to malloc refusing to overallocate). dmesg:

[427777.436077] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.437496] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.438869] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.650448] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.651924] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.653289] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427824.651962] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[427824.725737] Tasks state (memory values in pages):
[427824.726733] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.785987] [2193190]     0 2193190      434       50    49152        0          1000 sh
<SNIP>
[427824.788483] [2193819]     0 2193819  2951126  1879504 15200256        0             0 python3.13
[427824.789801] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-29180f1c319cd7a3b712d416d319e8c33b64dded655598bec859c83db1ef590e.scope,task=sh,pid=2193190,uid=0
[427824.793134] Out of memory: Killed process 2193190 (sh) total-vm:1736kB, anon-rss:196kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:1000
[427824.819737] systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
<SNIP>
[427824.893390] Tasks state (memory values in pages):
[427824.894385] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.954894] [2193819]     0 2193819  2951126  1879632 15200256        0             0 python3.13
[427824.956213] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=systemd-journald.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=python3.13,pid=2193819,uid=0
[427824.959263] Out of memory: Killed process 2193819 (python3.13) total-vm:11804504kB, anon-rss:7518524kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14844kB oom_score_adj:0

So it looks like the process selection might be entirely cgroup-unaware because sh died despite its cgroup using almost no memory. Note that the OOM killer then ran a second time because we OOMed again immediately (predictably).

7

Final test:

Run a system-level tail -f /dev/null process, with OOM-score adjusted to be very high.
Run a python in a container, with OOM-score adjusted to be lower.
See which process gets killed by OOM killer. If it is the system-level, bash, then the OOM theory for the EKS pod issues seems correct (since those nodes have very low OOM score container processes as well).

Results:

[428419.671161] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[428419.808514] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2197507,uid=0
[428419.811942] Out of memory: Killed process 2197507 (tail) total-vm:222016kB, anon-rss:192kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:999

[428419.900720] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[428420.037888] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,task=python3.13,pid=2198276,uid=0
[428420.041222] Out of memory: Killed process 2198276 (python3.13) total-vm:8421584kB, anon-rss:7495808kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14772kB oom_score_adj:0

As expected, the system-level process died.

Final test, run another python in a container, with pushed-down OOM score (-999) and see if OOM killer will kill a system-level daemon instead. Note that this needs Sysbox.

>>> import os
>>> pid = int(os.getpid())
>>> from pathlib import Path
>>> Path(f"/proc/{pid}/oom_score_adj").write_text("-100\n")
>>> x = " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
# OOM

The Instance Connect terminal stopped updating at this point. I was able to reconnect.

dmesg:

<SNIP>
[429104.973530] bash invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[429105.100050] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=sshd,pid=2174918,uid=0
[429105.102616] Out of memory: Killed process 2174918 (sshd) total-vm:16360kB, anon-rss:1764kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.290658] Out of memory: Killed process 2172743 (sshd) total-vm:16360kB, anon-rss:1740kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.655799] Out of memory: Killed process 2581 (systemd-network) total-vm:237528kB, anon-rss:1364kB, file-rss:0kB, shmem-rss:0kB, UID:192 pgtables:88kB oom_score_adj:0
[429105.985873] Out of memory: Killed process 2537 (rngd) total-vm:166744kB, anon-rss:1264kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.308905] Out of memory: Killed process 2541 (systemd-homed) total-vm:17440kB, anon-rss:1084kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
[429106.533138] Out of memory: Killed process 2202136 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.552411] Out of memory: Killed process 2202148 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.679037] Out of memory: Killed process 2202142 (systemd-userwor) total-vm:17292kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.697317] Out of memory: Killed process 2174519 (systemd-userdbd) total-vm:16936kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429107.166732] Out of memory: Killed process 4305 (rpc.mountd) total-vm:14404kB, anon-rss:908kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0
[429107.404013] Out of memory: Killed process 4257 (rpcbind) total-vm:13012kB, anon-rss:868kB, file-rss:0kB, shmem-rss:0kB, UID:32 pgtables:68kB oom_score_adj:0
<SNIP>
[429109.636079] Out of memory: Killed process 4255 (agetty) total-vm:221968kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:0
<SNIP>
[429109.644505] containerd-shim invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-998
<SNIP>
[429109.660350] Out of memory: Killed process 4667 (sleep) total-vm:221984kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:0
<SNIP>

Notes:

The system indeed got butchered.
The python process is also dead, though I don't see a message in dmesg.
The systemd user session all got killed together, likely due to the cgroup mem.oom.group setting.
This did not occur with less extreme score adjustments (tried up to -500), python would get killed first.

Desired behavior

Never hit system-level OOM in the first place because everything is running in cgroups with appropriate memory.max settings.
When a system is choosing a process for OOM killer, we want to strictly prefer containerized workloads (K8s pods/Latch Pods).
We want per-process OOM killing inside the pod itself (no mem.oom.group).
We don't want to unnecessarily restrict access to oom_score_adj provided by SysBox as it would be confusing to people who expect Latch Pods to act like VMs. It might also cause some workloads (e.g. Kuberenetes apparently) to crash inside the pod on startup as they treat Permission Denied errors on the oom_score_adj write as fatal.

8

Everything should actually already be fine on our EKS because the crashing pod has a limit (cgroup memory.max) of 178Gi and the host instance has 192Gi of memory. This means that the cgroup should OOM-killer for itself long before the system OOMs.

Possible reasons for this still being an issue:

The cgroup itself never reaches the limit. Some other process triggers OOM killer, which then goes killing system daemons. In that case there must be at least 192Gi - 178Gi = 14Gi of RAM used up by something else. More if the cgroup is not full (check memory.min to see if the cgroup forbids reclaiming any of its memory).
- This could be kubelet, a container runtime process (SysBox), or the NVIDIA driver (these all run outside the pod's cgroup).
We are running into the "under certain circumstances, the usage may go over the limit temporarily" caveat in the kernel.
Something to do with the cgroup file being open O_NONBLOCK which does a bunch of weird things, potentially by systemd running within the container.
It's unclear if memory.max refers to resident memory or committed memory (under overcommit conditions). This post seems to imply it's resident memory. Does this matter to us?

9

Trying to get rid of the score adjustment on the user session processes.

Checking if spawning a child process inherits oom_score_adj. Yes.

Need to figure out why systemd gets a score adjustment. It does not when running as proper PID 1. Trying strace on /sbin/init on an ubuntu container. Using apt install init to get systemd as /sbin/init. You can use Ctrl+D to skip the region and timezone config. systemd exits immediately with "Couldn't find an alternative telinit implementation to spawn." This is seemingly because it wants to have PID 1 and must be run with exec /sbin/init (and probably use SysBox). Some processes actually don't have any OOM score adjustments.

This might actually be due to something K8s is doing because all other pods on the node also have either 1000 or -997 as the score adjusment.

Checking the other K8s cluster to see if this happens for a more basic configuration or has something to do with our crio setup. Looks like all the pods here also have it so this definitely has nothing to do with SysBox or systemd.

This actually appears to be documented in an archived k8s proposal.

Testing adjusting a Latch Pod deployment to set its OOM score in the boot script. Works fine—this works as a quick workaround to at least prevent nodes from crashing.

Fixing the OOM Settings

We deploy a DaemonSet for all Latch Pods called latch-pod-runtime already, so our memory.oom.group adjustment will go in there.

First I modernize everything to use uv, a recent Dockerfile for Python projects, and newer CI/CD workflows.

Actual new code is basically this:

proc = await asyncio.subprocess.create_subprocess_exec(
    "/mnt/usr/local/bin/crictl",
    "--runtime-endpoint",
    "unix:///var/run/crio/crio.sock",
    "pods",
    "-o",
    "json",
    stdout=asyncio.subprocess.PIPE,
)
stdout, _ = await proc.communicate()

pods_data = json.loads(stdout)

for pod in pods_data["items"]:
    if pod["labels"]["app"] != "latch-pods":
        continue

    uid = pod["metadata"]["uid"]
    cgroup = Path(f"/sys/fs/cgroup/kubepods/pod{uid}")

    for child in cgroup.iterdir():
        if not child.is_dir():
            continue

        (child / "memory.oom.group").write_text("0\n")

I also have to change the DaemonSet spec to mount in /var/run/crio since it is not exposed with just mounting in the entire node root for some reason.

We change the pod base image startup script to include a simple oom_score_adj reset:

echo '900' > /proc/self/oom_score_adj

Since this is PID 1 (before it execs systemd) all other processes will inherit the score adjustment.

The only remaining things are to test everything, then backport the base image changes to all previous base images so that pod templates benefit from the fix.

10

Remaining tasks:

Check if the fix addressed the actual customer issue
Figure out why the system-level OOM happens in the first place
Write an eBPF OOM-handler that always chooses something from a container. We currently go back to the old bad behavior in the case that people go around maliciously overriding oom_score_adj values. They shouldn't be able to crash the entire node no matter what.

Pod Base Image Backports

Ubuntu is by far the most familiar and ubiqutuous Linux distro for "regular people". It is an obvious choice for any interactive enrionment that we build for end users. Since Latch Pods are containers, we maintain Ubuntu-flavored base images which use apt to manage packages. We install a bunch of system stuff because users expect things to mostly "just work". Most Dockerfiles on the internet do something like this:

run apt update
run apt install -y curl

This is wholy inappropriate for complicated base images because it makes the end result cache-dependent. If you make a change by e.g. adding

run apt install -y gzip

and build the new image, Docker might find the previous 2 layers in your local cache, and thus reuse the old apt package achive (the thing downloaded with apt update) or it might find nothing in cache, and download brand new archives.

In simplistic cases like above, you probably don't care about all your system dependencies updating at unpredictable times. Various versions of curl, gzip, etc. are all pretty compatible and you don't care about libc updating a few versions either. There might still be problems if somebody uses a cache that is a year old since the new packages aren't necessarily compatible with the system libraries pulled in by old packages, but a clean rebuild will resolve it (again, you just upgrade everything and forget about it).

In our case, we really need the versions to not change spontaneously. I experimented at some time with saving the apt archive somewhere for future use and adding some notion of lockfiles to apt, but it's really not practical without a lot of custom infrastructure on top. Ubuntu has their own snapshot service for recent distros but they claim to only keep 2 years worth of snapshots which makes it rather useless.

What you also have to keep in mind is that the apt archive issue is not the only thing wrong with it for reproducible builds. A lot of packages themselves go missing over time. NVIDIA in particular gave us a lot of trouble by randomly withdrawing old driver versions from apt. Various maintainers also regularly rotate their package keys and break existing build.

The only real solution to this stuff is to forget about declarative Dockerfiles altogether and start doing "incremental" Dockerfiles:

from myapp:v1 as base

run apt install -y gzip

# publish as myapp:v2

When building v3 we merge the commands from v2 into the base Dockerfile for having a reasonable "build-from-scratch" script, even though we never expect to run it ever again. This is mainly for reference—answering questions like "what apt packages do we install?" Everything is also in git obviously.

A further complication is that we allow snapshots and backups of our running containers' root file-system (with some overlayfs tricks), which are not generally compatible across major base image changes. This means we have to maintain a giant backlog of base images for all the pod backups that might one day be restored.

We sometimes make incompatible changes to the platform or fix bugs in pods that then require changes to all base images. Originally we were doing this by directly overriding the images in the container registry because we were building them from scratch anyway. But eventually apt made it impossible by losing all our old package versions. We also ran into a bunch of new build errors regularly due to bitrot in software distribution and build systems.

I was going to simply bring over incremental builds, and override the old base images with the incremental rebuilds. However, this has an extra problem that does not occur with the master version updates, which is that destructively overriding images means that after one incremental build, rebuilding the same file again will actually add the increment twice (building from scratch did not have this problem because the result would be the same). So we actually have to stop overriding images entirely. Instead we patch the base image to an updated version of it (with a new tag) when the pod gets used.