Hephaestian Sea

Notes: OOM? Node Crash

Symptoms

Kubernetes (EKS) node becomes "Not Ready" and a Latch Pod running on it unresponsive when certain GPU workloads run.

Note: this also looks similar to a confirmed unresolved OOM crash on Forch nodes. This comes up later.

Hot Fix

Force-delete the k8s pod. Terminate the AWS instance. Force-detach the EBS volume (risks corruption? does this matter with journaling on ext4? does overlayfs have proper recovery? has caused no issues so far). Occasionally also have to delete the new "Initializing" k8s pod for it to reschedule to a new live node (presumably because this re-triggers the "try to attach the PVC" process).

1

Rahul suggested this is due to an OOM in the pod. Assumption is that some important process dies to OOM killer (network connectivity?) and kubelet dies (killing everything k8s-related) then AWS can't reach the instance to do a "clean" shutdown and it gets stuck terminating until some timeout before a "forced" shutdown (which cannot be triggered manually apparently).

2

Made a new Latch Pod to investigate OOM-killer settings (OOM scores) for processes on the node. Node-shell in via AptaKube.

Reading everything from /proc. Logging:

We are going to print stuff hierarchically (like ps -H).

Running the script using uv, placing it on the node by copy-pasting into nano (with rm -f oom_info.py && nano oom_info.py).

# /// script
# requires-python = ">=3.13"
# dependencies = []
# ///

import re
import shlex
from argparse import ArgumentParser
from dataclasses import dataclass
from pathlib import Path

argp = ArgumentParser()
_ = argp.add_argument("--extended", action="store_true")

args = argp.parse_args()

pid_re = re.compile(r"^\d+$")
status_line_re = re.compile(r"^(?P<key>[^:]+):\s*(?P<val>.*)$")


@dataclass(kw_only=True)
class Proc:
    pid: int
    ppid: int
    status: dict[str, str]
    cmd_line: list[str]
    comm: str
    oom_score: int
    oom_score_adj: int


data: dict[int, Proc] = {}
child_procs: dict[int, list[int]] = {}

proc_p = Path("/proc/")
for x in proc_p.iterdir():
    if pid_re.match(x.name) is None:
        continue

    pid = int(x.name)

    try:
        status_str = (x / "status").read_text()
        cmd_line = (part.strip() for part in (x / "cmdline").read_text().split("\0"))
        comm = (x / "comm").read_text().strip()
        oom_score = int((x / "oom_score").read_text())
        oom_score_adj = int((x / "oom_score_adj").read_text())
    except (NotADirectoryError, FileNotFoundError):
        continue

    status: dict[str, str] = {}
    for l in status_str.split("\n"):
        l = l.strip()
        if l == "":
            continue

        m = status_line_re.match(l)
        if m is None:
            raise RuntimeError(f"could not match /proc/{{pid}}/status line: {l!r}")
        status[m.group("key")] = m.group("val")

    ppid = int(status["PPid"])
    data[pid] = Proc(
        pid=pid,
        ppid=ppid,
        status=status,
        cmd_line=[x for x in cmd_line if x != ""],
        comm=comm,
        oom_score=oom_score,
        oom_score_adj=oom_score_adj,
    )
    child_procs.setdefault(ppid, []).append(pid)

for xs in child_procs.values():
    xs.sort()

# print(child_procs)


def render(pid: int, *, indent: str = "") -> None:
    x = data[pid]
    if x.comm == "kthread" and x.ppid == 1:
        return

    cmd_line = " ".join(shlex.quote(x) for x in x.cmd_line)

    if args.extended:
        print(f"{indent}[{x.pid}] {x.comm}")
        print(f"{indent}  {cmd_line}")
        print(f"{indent}  {x.oom_score} ({x.oom_score_adj:+})")
    else:
        print(f"{indent}[{x.pid}] {x.comm}: {x.oom_score} ({x.oom_score_adj:+})")

    for c in child_procs.get(pid, []):
        render(c, indent=indent + "  ")


render(1)

This produces the following:

[1] systemd: 0 (+0)
  [682] systemd-journal: 500 (-250)
  [714] systemd-udevd: 0 (-1000)
  [1058] rpcbind: 666 (+0)
  [1059] systemd-timesyn: 666 (+0)
  [1124] systemd-network: 666 (+0)
  [1127] systemd-resolve: 666 (+0)
  [1226] acpid: 666 (+0)
  [1231] crio: 1 (-999)
  [1232] cron: 666 (+0)
  [1233] dbus-daemon: 67 (-900)
  [1242] networkd-dispat: 666 (+0)
  [1246] amazon-ssm-agen: 666 (+0)
    [1637] ssm-agent-worke: 666 (+0)
  [1310] sysbox-mgr: 334 (-500)
  [1334] systemd-logind: 666 (+0)
  [1360] atd: 666 (+0)
  [1373] agetty: 666 (+0)
  [1374] agetty: 666 (+0)
  [1395] unattended-upgr: 666 (+0)
  [1501] sysbox-fs: 334 (-500)
  [1537] sh: 666 (+0)
    [1565] sleep: 666 (+0)
  [1545] containerd: 1 (-999)
  [9328] kubelet: 1 (-999)
  <SNIP other pods>
  [999004] conmon: 0 (-1000)
    [999026] systemd: 2 (-997)
      [107945] packagekitd: 2 (-997)
      [107957] polkitd: 2 (-997)
      [999716] systemd-journal: 500 (-250)
      [999990] systemd-resolve: 2 (-997)
      [1003069] dbus-daemon: 67 (-900)
      [1003126] security.py: 2 (-997)
        [1003156] python: 2 (-997)
      [1003147] sshd_keys_monit: 2 (-997)
      [1003170] networkd-dispat: 2 (-997)
      [1003222] systemd-logind: 2 (-997)
      [1003237] containerd: 1 (-999)
      [1003239] sshd: 2 (-997)
      [1003271] dockerd: 334 (-500)
      [1004617] master: 2 (-997)
        [1004619] qmgr: 2 (-997)
        [1808057] pickup: 2 (-997)
      [1004816] unattended-upgr: 2 (-997)
      [1004852] systemd: 2 (-997)
        [1004861] (sd-pam): 2 (-997)
        [1004927] jupyter-lab: 2 (-997)
          [1038371] bash: 2 (-997)
            [1888242] python: 2 (-997)
          [1887631] pyright-langser: 2 (-997)
            [1887657] MainThread: 2 (-997)
          [2061993] python: 2 (-997)
            [2062665] python: 2 (-997)
            [2062668] python: 2 (-997)
            [2062673] python: 2 (-997)
            [2062677] python: 2 (-997)
            [2062681] python: 2 (-997)
            [2062685] python: 2 (-997)
            [2062689] python: 2 (-997)
            [2062693] python: 2 (-997)
            [2062696] python: 2 (-997)
        [1004951] rserver: 2 (-997)
        [1004961] ssh-agent: 2 (-997)
        [1004980] dbus-daemon: 2 (-997)
  [1282005] snapd: 67 (-900)
  [1293746] sshd: 0 (-1000)
  <SNIP other pods>
  [2057270] polkitd: 666 (+0)

Notes:

3

Some reading on OOM killer:

Our pod's cgroup has memory.oom.group=1:

$ cat /sys/fs/cgroup/kubepods/pod30e633b5-4140-4774-9122-1b9e140c8556/crio-2c6ca22c5968ca227b2b77480c87fa56044a98fb92280ce03e613d95b6ea505b/memory.oom.group
1

There is also a nested systemd cgroupv2 hierarchy under init.scope there which does not have it, but it shouldn't matter

There are userspace OOM killers too (e.g. system-oomd). And you can use eBPF for OOM. This can be useful if we need to patch the kernel OOM process or for debugging.

4

We want to know how modern kernels actually deal with OOM when cgroups are present. We read the source code:

The actual handler turns out to be the innocuously named mem_cgroup_oom_synchronize.

After reading the comment above pagefault_out_of_memory, I think it is actually a "secondary" handler that is called by tasks that are inside a cgroup, when they get an OOM pagefault due to cgroup limits. Specifically:

5

Searching for out_of_memory calls reveals:

One final lead is the mm_fault_error from a stack trace on a StackOverflow post. Note that the post itself is about cgroupv1 and also seemingly a memory.max OOM not a system OOM.

6

Time for some experiments on a fresh AWS instance running an Amazon Linux-based AMI.

oom_info.py:

[1] systemd: 0 (+0)
  [1068] systemd-journal: 520 (-250)
  [1725] systemd-udevd: 0 (-1000)
  [2398] auditd: 0 (-1000)
  [2400] systemd-resolve: 667 (+0)
  [2421] nfsdcld: 666 (+0)
  [2452] rpc.idmapd: 666 (+0)
  [2491] dbus-broker-lau: 67 (-900)
    [2497] dbus-broker: 67 (-900)
  [2530] irqbalance: 666 (+0)
  [2537] rngd: 666 (+0)
  [2538] sysbox-mgr: 335 (-500)
  [2541] systemd-homed: 667 (+0)
  [2544] systemd-logind: 667 (+0)
  [2581] systemd-network: 667 (+0)
  [2720] agent: 683 (+0)
  [2722] trace-agent: 670 (+0)
  [2807] gssproxy: 666 (+0)
  [4227] sshd: 0 (-1000)
    [2172743] sshd: 667 (+0)
      [2174918] sshd: 666 (+0)
        [2174919] bash: 666 (+0)
          [2175385] uv: 669 (+0)
            [2175453] python: 668 (+0)
              [2175459] process: 0 (+0)
  [4255] agetty: 666 (+0)
  [4256] agetty: 666 (+0)
  [4257] rpcbind: 666 (+0)
  [4291] rpc.statd: 666 (+0)
  [4305] rpc.mountd: 666 (+0)
  [4495] chronyd: 666 (+0)
  [4594] sysbox-fs: 335 (-500)
  [4667] sleep: 666 (+0)
  [4681] containerd: 6 (-999)
  [4822] dockerd: 341 (-500)
  [2174519] systemd-userdbd: 666 (+0)
    [2174581] systemd-userwor: 667 (+0)
    [2174582] systemd-userwor: 667 (+0)
    [2174583] systemd-userwor: 667 (+0)
  [2174689] systemd: 734 (+100)
    [2174733] (sd-pam): 733 (+100)

Note that a lot of things are different because the EKS nodes are Ubunu-based. I'm connected using AWS Instance Connect (shows up as an SSH session).

With a container:

  [2176408] containerd-shim: 4 (-998)
    [2176452] sh: 666 (+0)

This is running /bin/sh under alpine (with a TTY attached).

Using sysbox-runc vs the default runtime doesn't change anything.

Running a container with a low limit 6m (lowest that Docker allows). And then running head -c 5000m /dev/zero | tail to trigger OOM produces the following dmesg logs:

[425561.202382] tail invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[425561.203483] CPU: 1 PID: 2179683 Comm: tail Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[425561.204467] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[425561.205279] Call Trace:
[425561.205737]  <TASK>
[425561.206158]  dump_stack_lvl+0x34/0x48
[425561.206699]  dump_header+0x4c/0x22d
[425561.207223]  oom_kill_process.cold+0xb/0x10
[425561.207805]  out_of_memory+0xed/0x2e0
[425561.208344]  ? sched_clock_cpu+0x9/0xb0
[425561.208890]  mem_cgroup_out_of_memory+0x136/0x150
[425561.209517]  try_charge_memcg+0x7d7/0x890
[425561.210073]  charge_memcg+0x35/0xe0
[425561.210581]  __mem_cgroup_charge+0x29/0x80
[425561.211132]  do_anonymous_page+0x108/0x570
[425561.211688]  __handle_mm_fault+0x513/0x5e0
[425561.212249]  handle_mm_fault+0xc5/0x2b0
[425561.212792]  do_user_addr_fault+0x1a1/0x570
[425561.213358]  exc_page_fault+0x62/0x140
[425561.213878]  asm_exc_page_fault+0x22/0x30
[425561.214418] RIP: 0033:0x7f8844e6d46e
[425561.214933] Code: 89 7d 10 49 83 ef 14 48 83 c7 10 49 81 e0 00 f0 ff ff 31 d2 25 ff 0f 00 00 4c 09 c0 48 89 45 20 4a 8d 2c 37 4c 01 ff 4d 29 f7 <c6> 07 00 45 29 e7 74 1b 44 89 f8 f7 d8 48 98 c6 04 07 00 41 83 ff
[425561.217045] RSP: 002b:00007ffc9fefcdd0 EFLAGS: 00010202
[425561.217719] RAX: 0000000000537fe0 RBX: 00007f88445a8270 RCX: 00007f8844e7d690
[425561.218526] RDX: 0000000000000000 RSI: 0000000000536000 RDI: 00007f8844adeffc
[425561.219352] RBP: 00007f88445a8270 R08: 0000000000537000 R09: 0000000000000000
[425561.220170] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000536000
[425561.220987] R13: 00007f8844adde70 R14: 0000000000000260 R15: 0000000000536d8c
[425561.221803]  </TASK>
[425561.222263] memory: usage 6144kB, limit 6144kB, failcnt 146
[425561.223222] swap: usage 0kB, limit 6144kB, failcnt 0
[425561.223865] Memory cgroup stats for /system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope:
[425561.223878] anon 5775360
                file 0
                kernel 516096
                kernel_stack 49152
                pagetables 110592
                sec_pagetables 0
                percpu 1328
                sock 0
                vmalloc 12288
                shmem 0
                zswap 0
                zswapped 0
                file_mapped 0
                file_dirty 0
                file_writeback 0
                swapcached 0
                anon_thp 0
                file_thp 0
                shmem_thp 0
                inactive_anon 5763072
                active_anon 12288
                inactive_file 0
                active_file 0
                unevictable 0
                slab_reclaimable 97984
                slab_unreclaimable 182768
                slab 280752
                workingset_refault_anon 0
                workingset_refault_file 0
                workingset_activate_anon 0
                workingset_activate_file 0
                workingset_restore_anon 0
                workingset_restore_file 0
                workingset_nodereclaim 0
                pgscan 6
                pgsteal 1
                pgscan_kswapd 0
                pgscan_direct 6
                pgsteal_kswapd 0
                pgsteal_direct 1
                pgfault 3644
                pgmajfault 0
                pgrefill 5
                pgactivate 5
                pgdeactivate 5
                pglazyfree 0
                pglazyfreed 0
                zswpin 0
                zswpout 0
                thp_fault_alloc 0
                thp_collapse_alloc 0
[425561.237528] Tasks state (memory values in pages):
[425561.238085] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[425561.238941] [2178864]     0 2178864      427      306    40960        0             0 sh
[425561.239790] [2179682]     0 2179682      407        1    40960        0             0 head
[425561.240624] [2179683]     0 2179683     1741     1503    57344        0             0 tail
[425561.241449] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,mems_allowed=0,oom_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task_memcg=/system.slice/docker-a3325687f48efe8662de5e8e6c2a2f81c48bccd39940116f88884a757cddc33b.scope,task=tail,pid=2179683,uid=0
[425561.245090] Memory cgroup out of memory: Killed process 2179683 (tail) total-vm:6964kB, anon-rss:5188kB, file-rss:824kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0

We are seeing the following logs:

Now we try the same on the instance itself, with no containers. It has 8Gi of memory, so we try head -c 9000m | tail. dmesg for that:

[426123.555128] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[426123.556841] CPU: 1 PID: 1 Comm: systemd Not tainted 6.1.132-147.221.amzn2023.x86_64 #1
[426123.557998] Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
[426123.558998] Call Trace:
[426123.559603]  <TASK>
[426123.560191]  dump_stack_lvl+0x34/0x48
[426123.560930]  dump_header+0x4c/0x22d
[426123.561651]  oom_kill_process.cold+0xb/0x10
[426123.562448]  out_of_memory+0xed/0x2e0
[426123.563194]  __alloc_pages_slowpath.constprop.0+0x6d2/0x9a0
[426123.564154]  __alloc_pages+0x308/0x330
[426123.564914]  folio_alloc+0x17/0x50
[426123.565645]  __filemap_get_folio+0x1cd/0x330
[426123.566468]  filemap_fault+0x43d/0x7a0
[426123.567237]  __do_fault+0x34/0x1c0
[426123.567979]  do_read_fault+0xee/0x1c0
[426123.568751]  do_fault+0xa3/0x380
[426123.569472]  __handle_mm_fault+0x513/0x5e0
[426123.570280]  ? asm_sysvec_call_function_single+0x16/0x20
[426123.571216]  handle_mm_fault+0xc5/0x2b0
[426123.572011]  do_user_addr_fault+0x1a1/0x570
[426123.572867]  exc_page_fault+0x62/0x140
[426123.573661]  asm_exc_page_fault+0x22/0x30
[426123.574482] RIP: 0033:0x7ffbc66940a4
[426123.575274] Code: Unable to access opcode bytes at 0x7ffbc669407a.
[426123.576316] RSP: 002b:00007ffcee3b26c8 EFLAGS: 00010293
[426123.577248] RAX: 0000000000000001 RBX: 00005616dc39b010 RCX: 00005616dc39b010
[426123.578387] RDX: 00007ffbc6776880 RSI: 00007ffcee3b3650 RDI: 00005616dc301380
[426123.579526] RBP: 0000000000000000 R08: 00007ffcee3b3650 R09: 0000000000000000
[426123.580673] R10: 00007ffcee3b2c40 R11: 0000000000000010 R12: 00005616dc399450
[426123.581799] R13: 00005616dc39b010 R14: 00007ffcee3b3650 R15: 00005616dc301380
[426123.582918]  </TASK>
[426123.583549] Mem-Info:
[426123.584168] active_anon:317 inactive_anon:1937229 isolated_anon:0
                 active_file:0 inactive_file:64 isolated_file:0
                 unevictable:469 dirty:0 writeback:0
                 slab_reclaimable:7306 slab_unreclaimable:11656
                 mapped:23 shmem:2699 pagetables:4943
                 sec_pagetables:0 bounce:0
                 kernel_misc_reclaimable:0
                 free:25275 free_pcp:0 free_cma:0
[426123.590458] Node 0 active_anon:1268kB inactive_anon:7748916kB active_file:0kB inactive_file:256kB unevictable:1876kB isolated(anon):0kB isolated(file):0kB mapped:92kB dirty:0kB writeback:0kB shmem:10796kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3024kB pagetables:19772kB sec_pagetables:0kB all_unreclaimable? no
[426123.594938] Node 0 DMA free:14336kB boost:0kB min:128kB low:160kB high:192kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[426123.598628] lowmem_reserve[]: 0 2910 7770 7770 7770
[426123.599534] Node 0 DMA32 free:44920kB boost:0kB min:25260kB low:31572kB high:37884kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2953972kB active_file:0kB inactive_file:52kB unevictable:0kB writepending:0kB present:3125680kB managed:3006516kB mlocked:0kB bounce:0kB free_pcp:376kB local_pcp:0kB free_cma:0kB
[426123.603889] lowmem_reserve[]: 0 0 4860 4860 4860
[426123.604824] Node 0 Normal free:41788kB boost:0kB min:42188kB low:52732kB high:63276kB reserved_highatomic:0KB active_anon:1268kB inactive_anon:4794944kB active_file:0kB inactive_file:520kB unevictable:1876kB writepending:0kB present:5122048kB managed:4977204kB mlocked:0kB bounce:0kB free_pcp:76kB local_pcp:0kB free_cma:0kB
[426123.609485] lowmem_reserve[]: 0 0 0 0 0
[426123.610362] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
[426123.612462] Node 0 DMA32: 1016*4kB (U) 609*8kB (UM) 248*16kB (U) 81*32kB (UM) 12*64kB (UM) 3*128kB (U) 2*256kB (M) 0*512kB 1*1024kB (M) 1*2048kB (U) 6*4096kB (M) = 44808kB
[426123.614997] Node 0 Normal: 2511*4kB (UE) 1472*8kB (UME) 755*16kB (UME) 232*32kB (UME) 14*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42220kB
[426123.617490] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[426123.618914] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[426123.620302] 2831 total pagecache pages
[426123.621205] 0 pages in swap cache
[426123.622074] Free swap  = 0kB
[426123.622890] Total swap = 0kB
[426123.623708] 2065931 pages RAM
[426123.624526] 0 pages HighMem/MovableOnly
[426123.625433] 66161 pages reserved
[426123.626278] 0 pages hwpoisoned
[426123.627108] Tasks state (memory values in pages):
[426123.628126] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[426123.679216] [2178864]     0 2178864      434       50    40960        0             0 sh
[426123.680452] [2182349]     0 2182349    55499       47    53248        0             0 head
[426123.681694] [2182350]     0 2182350  1938542  1883091 15147008        0             0 tail
<SNIP>
[426123.687918] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2182350,uid=0
[426123.690422] Out of memory: Killed process 2182350 (tail) total-vm:7754168kB, anon-rss:7532364kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:14792kB oom_score_adj:0

Notes:

Running a container, setting oom_score_adj to 1000, then running uv run python with " " * 1024 * 1024 * 1024 a few times to slowly OOM (note has to be gradual or you will get MemoryError due to malloc refusing to overallocate). dmesg:

[427777.436077] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.437496] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427777.438869] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.650448] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.651924] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427803.653289] __vm_enough_memory: pid: 2193819, comm: python3.13, no enough memory for the allocation
[427824.651962] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[427824.725737] Tasks state (memory values in pages):
[427824.726733] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.785987] [2193190]     0 2193190      434       50    49152        0          1000 sh
<SNIP>
[427824.788483] [2193819]     0 2193819  2951126  1879504 15200256        0             0 python3.13
[427824.789801] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-29180f1c319cd7a3b712d416d319e8c33b64dded655598bec859c83db1ef590e.scope,task=sh,pid=2193190,uid=0
[427824.793134] Out of memory: Killed process 2193190 (sh) total-vm:1736kB, anon-rss:196kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:1000
[427824.819737] systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
<SNIP>
[427824.893390] Tasks state (memory values in pages):
[427824.894385] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<SNIP>
[427824.954894] [2193819]     0 2193819  2951126  1879632 15200256        0             0 python3.13
[427824.956213] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=systemd-journald.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=python3.13,pid=2193819,uid=0
[427824.959263] Out of memory: Killed process 2193819 (python3.13) total-vm:11804504kB, anon-rss:7518524kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14844kB oom_score_adj:0

So it looks like the process selection might be entirely cgroup-unaware because sh died despite its cgroup using almost no memory. Note that the OOM killer then ran a second time because we OOMed again immediately (predictably).

7

Final test:

Results:

[428419.671161] python3.13 invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
<SNIP>
[428419.808514] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=tail,pid=2197507,uid=0
[428419.811942] Out of memory: Killed process 2197507 (tail) total-vm:222016kB, anon-rss:192kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:999

[428419.900720] systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[428420.037888] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-b0a5f84e13d54629ba327a7962556d19bd640fdc6a9b910b5b486ead1de21ac0.scope,task=python3.13,pid=2198276,uid=0
[428420.041222] Out of memory: Killed process 2198276 (python3.13) total-vm:8421584kB, anon-rss:7495808kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:14772kB oom_score_adj:0

As expected, the system-level process died.

Final test, run another python in a container, with pushed-down OOM score (-999) and see if OOM killer will kill a system-level daemon instead. Note that this needs Sysbox.

>>> import os
>>> pid = int(os.getpid())
>>> from pathlib import Path
>>> Path(f"/proc/{pid}/oom_score_adj").write_text("-100\n")
>>> x = " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
>>> x += " " * 1024 * 1024 * 1024
# OOM

The Instance Connect terminal stopped updating at this point. I was able to reconnect.

dmesg:

<SNIP>
[429104.973530] bash invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
<SNIP>
[429105.100050] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=sshd,pid=2174918,uid=0
[429105.102616] Out of memory: Killed process 2174918 (sshd) total-vm:16360kB, anon-rss:1764kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.290658] Out of memory: Killed process 2172743 (sshd) total-vm:16360kB, anon-rss:1740kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429105.655799] Out of memory: Killed process 2581 (systemd-network) total-vm:237528kB, anon-rss:1364kB, file-rss:0kB, shmem-rss:0kB, UID:192 pgtables:88kB oom_score_adj:0
[429105.985873] Out of memory: Killed process 2537 (rngd) total-vm:166744kB, anon-rss:1264kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.308905] Out of memory: Killed process 2541 (systemd-homed) total-vm:17440kB, anon-rss:1084kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
[429106.533138] Out of memory: Killed process 2202136 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0
[429106.552411] Out of memory: Killed process 2202148 (systemd-userwor) total-vm:17292kB, anon-rss:956kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.679037] Out of memory: Killed process 2202142 (systemd-userwor) total-vm:17292kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429106.697317] Out of memory: Killed process 2174519 (systemd-userdbd) total-vm:16936kB, anon-rss:952kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0
[429107.166732] Out of memory: Killed process 4305 (rpc.mountd) total-vm:14404kB, anon-rss:908kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:56kB oom_score_adj:0
[429107.404013] Out of memory: Killed process 4257 (rpcbind) total-vm:13012kB, anon-rss:868kB, file-rss:0kB, shmem-rss:0kB, UID:32 pgtables:68kB oom_score_adj:0
<SNIP>
[429109.636079] Out of memory: Killed process 4255 (agetty) total-vm:221968kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:0
<SNIP>
[429109.644505] containerd-shim invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-998
<SNIP>
[429109.660350] Out of memory: Killed process 4667 (sleep) total-vm:221984kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:0
<SNIP>

Notes:

Desired behavior

8

Everything should actually already be fine on our EKS because the crashing pod has a limit (cgroup memory.max) of 178Gi and the host instance has 192Gi of memory. This means that the cgroup should OOM-killer for itself long before the system OOMs.

Possible reasons for this still being an issue:

9

Trying to get rid of the score adjustment on the user session processes.

Checking if spawning a child process inherits oom_score_adj. Yes.

Need to figure out why systemd gets a score adjustment. It does not when running as proper PID 1. Trying strace on /sbin/init on an ubuntu container. Using apt install init to get systemd as /sbin/init. You can use Ctrl+D to skip the region and timezone config. systemd exits immediately with "Couldn't find an alternative telinit implementation to spawn." This is seemingly because it wants to have PID 1 and must be run with exec /sbin/init (and probably use SysBox). Some processes actually don't have any OOM score adjustments.

This might actually be due to something K8s is doing because all other pods on the node also have either 1000 or -997 as the score adjusment.

Checking the other K8s cluster to see if this happens for a more basic configuration or has something to do with our crio setup. Looks like all the pods here also have it so this definitely has nothing to do with SysBox or systemd.

This actually appears to be documented in an archived k8s proposal.

Testing adjusting a Latch Pod deployment to set its OOM score in the boot script. Works fine—this works as a quick workaround to at least prevent nodes from crashing.

Fixing the OOM Settings

We deploy a DaemonSet for all Latch Pods called latch-pod-runtime already, so our memory.oom.group adjustment will go in there.

First I modernize everything to use uv, a recent Dockerfile for Python projects, and newer CI/CD workflows.

Actual new code is basically this:

proc = await asyncio.subprocess.create_subprocess_exec(
    "/mnt/usr/local/bin/crictl",
    "--runtime-endpoint",
    "unix:///var/run/crio/crio.sock",
    "pods",
    "-o",
    "json",
    stdout=asyncio.subprocess.PIPE,
)
stdout, _ = await proc.communicate()

pods_data = json.loads(stdout)

for pod in pods_data["items"]:
    if pod["labels"]["app"] != "latch-pods":
        continue

    uid = pod["metadata"]["uid"]
    cgroup = Path(f"/sys/fs/cgroup/kubepods/pod{uid}")

    for child in cgroup.iterdir():
        if not child.is_dir():
            continue

        (child / "memory.oom.group").write_text("0\n")

I also have to change the DaemonSet spec to mount in /var/run/crio since it is not exposed with just mounting in the entire node root for some reason.

We change the pod base image startup script to include a simple oom_score_adj reset:

echo '900' > /proc/self/oom_score_adj

Since this is PID 1 (before it execs systemd) all other processes will inherit the score adjustment.

The only remaining things are to test everything, then backport the base image changes to all previous base images so that pod templates benefit from the fix.

10

Remaining tasks:

Pod Base Image Backports

Ubuntu is by far the most familiar and ubiqutuous Linux distro for "regular people". It is an obvious choice for any interactive enrionment that we build for end users. Since Latch Pods are containers, we maintain Ubuntu-flavored base images which use apt to manage packages. We install a bunch of system stuff because users expect things to mostly "just work". Most Dockerfiles on the internet do something like this:

run apt update
run apt install -y curl

This is wholy inappropriate for complicated base images because it makes the end result cache-dependent. If you make a change by e.g. adding

run apt install -y gzip

and build the new image, Docker might find the previous 2 layers in your local cache, and thus reuse the old apt package achive (the thing downloaded with apt update) or it might find nothing in cache, and download brand new archives.

In simplistic cases like above, you probably don't care about all your system dependencies updating at unpredictable times. Various versions of curl, gzip, etc. are all pretty compatible and you don't care about libc updating a few versions either. There might still be problems if somebody uses a cache that is a year old since the new packages aren't necessarily compatible with the system libraries pulled in by old packages, but a clean rebuild will resolve it (again, you just upgrade everything and forget about it).

In our case, we really need the versions to not change spontaneously. I experimented at some time with saving the apt archive somewhere for future use and adding some notion of lockfiles to apt, but it's really not practical without a lot of custom infrastructure on top. Ubuntu has their own snapshot service for recent distros but they claim to only keep 2 years worth of snapshots which makes it rather useless.

What you also have to keep in mind is that the apt archive issue is not the only thing wrong with it for reproducible builds. A lot of packages themselves go missing over time. NVIDIA in particular gave us a lot of trouble by randomly withdrawing old driver versions from apt. Various maintainers also regularly rotate their package keys and break existing build.

The only real solution to this stuff is to forget about declarative Dockerfiles altogether and start doing "incremental" Dockerfiles:

from myapp:v1 as base

run apt install -y gzip

# publish as myapp:v2

When building v3 we merge the commands from v2 into the base Dockerfile for having a reasonable "build-from-scratch" script, even though we never expect to run it ever again. This is mainly for reference—answering questions like "what apt packages do we install?" Everything is also in git obviously.


A further complication is that we allow snapshots and backups of our running containers' root file-system (with some overlayfs tricks), which are not generally compatible across major base image changes. This means we have to maintain a giant backlog of base images for all the pod backups that might one day be restored.

We sometimes make incompatible changes to the platform or fix bugs in pods that then require changes to all base images. Originally we were doing this by directly overriding the images in the container registry because we were building them from scratch anyway. But eventually apt made it impossible by losing all our old package versions. We also ran into a bunch of new build errors regularly due to bitrot in software distribution and build systems.

I was going to simply bring over incremental builds, and override the old base images with the incremental rebuilds. However, this has an extra problem that does not occur with the master version updates, which is that destructively overriding images means that after one incremental build, rebuilding the same file again will actually add the increment twice (building from scratch did not have this problem because the result would be the same). So we actually have to stop overriding images entirely. Instead we patch the base image to an updated version of it (with a new tag) when the pod gets used.