-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Describe the bug
Unfortunately, on some systems parca-agent seems to trigger a rare upstream kernel BUG because it calls bpf_probe_read_user() inside the perf_event IRQ. This is because bpf_probe_read_user() will call copy_from_user_nofault > access_ok > ... > find_vmap_area with some kernel configs (i.e., CONFIG_HARDENED_USERCOPY) which will attempt to acquire vmap_area_lock. If the interrupt occurred while the lock is held (e.g., during alloc_vmap_area() in the clone() syscall) find_vmap_area() will never return. This causes the lock held by clone() to never be released and any other CPU attempting to acquire it is locked up in an infinite loop. Ultimately, this happens on all CPUs and the whole machine is locked up.
To Reproduce
Start a machine using the affected upstream kernel code (tested with v6.1 but I believe the bug is also present in most other kernels). To reproduce it, you can for example use an AWS EC2 c6a.large (64 vCPUs) instance with the AMI al2023-ami-2023.0.20230503.0-kernel-6.1-x86_64. Having more CPUs allows the bug to be triggered more quickly.
$ curl -sL https://github.com/parca-dev/parca-agent/releases/download/v0.19.0/parca-agent_0.19.0_`uname -s`_`uname -m`.tar.gz | tar xvfz -
$ sudo ./parca-agent --node=test --remote-store-address=localhost:7070 --remote-store-insecure
To trigger the bug quickly, execute some code that will also use vmap_area_lock. For example, the clone() syscall:
$ while true ; do
ls -al > /dev/null # do not use true which is a shell builtin
done
Within 10 minutes, the CPU soft lockup messages should appear on the serial console.
Expected behavior
The machine is not locked up. BPF should not be able to lock up the machine but because of the kernel bug this happens anyway.
Logs
Here's an annotated log from the serial console. Other traces are also printed (from the other CPUs attempting to acquire the lock), however, this is the root cause I believe:
[253905.544838] Sending NMI from CPU 27 to CPUs 55:
[253905.545371] NMI backtrace for cpu 55
[253905.545375] CPU: 55 PID: 3316 Comm: spawn Tainted: G L 6.1.25-37.47.amzn2023.x86_64 #1
[253905.545377] Hardware name: Amazon EC2 c6a.16xlarge/, BIOS 0 10/16/2017
[253905.545378] RIP: 0010:native_queued_spin_lock_slowpath+0x32/0x2c0
[253905.545384] Code: 54 55 48 89 fd 53 66 90 ba 01 00 00 00 8b 45 00 85 c0 75 14 f0 0f b1 55 00 85 c0 75 f0 5b 5d 41 5c 41 5d c3 cc cc cc cc f3 90 <eb> e1 81 fe 00 01 00 00 74 50 40 30 f6 85 f6 75 73 f0 0f ba 6d 00
[253905.545385] RSP: 0018:ffffc3edc6e68bc0 EFLAGS: 00000002
[253905.545387] RAX: 0000000000000001 RBX: ffffffffa1777ccc RCX: 0000000000000010
[253905.545388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffa1777ccc
[253905.545388] RBP: ffffffffa1777ccc R08: 0000000000000001 R09: 000004c6af4181a9
[253905.545389] R10: 0000000000000000 R11: ffffc3edc6e68ff8 R12: 0000000000000008
[253905.545390] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000080000000
[253905.545393] FS: 00007fd4a28d8600(0000) GS:ffffa057e99c0000(0000) knlGS:0000000000000000
[253905.545394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[253905.545395] CR2: 00000000004040b0 CR3: 00000002461a8001 CR4: 00000000003706e0
[253905.545398] Call Trace:
[253905.545399] <IRQ>
#
#
# https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L1861
#
[253905.545401] _raw_spin_lock+0x30/0x40
[253905.545403] find_vmap_area+0x17/0x60
#
#
# Likely requires https://elixir.bootlin.com/linux/v6.1.28/K/ident/CONFIG_HARDENED_USERCOPY
#
[253905.545407] check_heap_object+0xd4/0x150
[253905.545409] __check_object_size.part.0+0x47/0xd0
#
#
# This does pagefault_disable() (like perf_callchain_user()), which should make the actual copy IRQ-safe.
#
# But it calls access_ok() before pagefault_disable(), which is apparently not IRQ-safe.
# https://elixir.bootlin.com/linux/v6.1.28/source/arch/x86/include/asm/uaccess.h#L41
#
[253905.545411] copy_from_user_nofault+0x65/0x90
[253905.545413] bpf_probe_read_user+0x18/0x50
[253905.545416] bpf_prog_2448819a7219e528_profile_cpu+0x354/0x9fd
[253905.545421] bpf_overflow_handler+0xad/0x170
[253905.545424] __perf_event_overflow+0x102/0x1e0
[253905.545426] ? __perf_event_overflow+0x1e0/0x1e0
[253905.545427] perf_swevent_hrtimer+0x12b/0x140
[253905.545430] ? update_load_avg+0x7e/0x740
[253905.545433] ? enqueue_entity+0x1b2/0x520
[253905.545435] __hrtimer_run_queues+0x112/0x2b0
[253905.545439] hrtimer_interrupt+0x106/0x220
[253905.545442] __sysvec_apic_timer_interrupt+0x7f/0x170
[253905.545445] sysvec_apic_timer_interrupt+0x9d/0xd0
[253905.545448] </IRQ>
[253905.545449] <TASK>
[253905.545449] asm_sysvec_apic_timer_interrupt+0x16/0x20
[253905.545452] RIP: 0010:insert_vmap_area.constprop.0+0x34/0x120
[253905.545453] Code: 4b 03 41 55 41 54 55 53 48 89 fb 48 85 c0 0f 84 d3 00 00 00 4c 8b 4f 08 eb 10 48 8b 48 10 48 8d 50 10 48 85 c9 74 29 48 8b 02 <48> 8b 48 f0 49 39 c9 76 e7 48 8b 33 40 f8 4c 39 c6 0f 82 88
[253905.545454] RSP: 0018:ffffc3ede3b23bf8 EFLAGS: 00000282
[253905.545455] RAX: ffffa039ec903d10 RBX: ffffa039ec9030c0 RCX: ffffa039ec903d10
[253905.545456] RDX: ffffa0492d825520 RSI: ffffc3edf0f08000 RDI: ffffa039ec9030c0
[253905.545456] RBP: ffffa048c77e8400 R08: ffffc3edf0efd000 R09: ffffc3edf0f0d000
[253905.545457] R10: ffffc3edf0f05000 R11: 0000000000036b00 R12: 0000000000005000
[253905.545458] R13: 0000000000003fff R14: ffffa039ec9030c0 R15: ffffc3edc0000000
#
#
# https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L1634
#
[253905.545460] alloc_vmap_area+0x330/0x820
[253905.545463] __get_vm_area_node+0xb8/0x170
[253905.545464] __vmalloc_node_range+0xa6/0x220
[253905.545466] ? dup_task_struct+0x57/0x1a0
[253905.545470] alloc_thread_stack_node+0xcd/0x130
[253905.545472] ? dup_task_struct+0x57/0x1a0
[253905.545474] dup_task_struct+0x57/0x1a0
[253905.545476] copy_process+0x1bd/0x15c0
[253905.545479] kernel_clone+0x9b/0x3b0
[253905.545482] __do_sys_clone+0x66/0x90
[253905.545485] do_syscall_64+0x3b/0x90
[253905.545487] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[253905.545489] RIP: 0033:0x7fd4a2718a27
[253905.545490] Code: 00 00 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 41 89 c0 85 c0 75 2a 64 48 8b 04 25 10 00
[253905.545491] RSP: 002b:00007ffc648c1158 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[253905.545492] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd4a2718a27
[253905.545493] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[253905.545494] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[253905.545494] R10: 00007fd4a28d88d0 R11: 0000000000000246 R12: 0000000000000000
[253905.545495] R13: 00000000004010b0 R14: 0000000000403e00 R15: 00007fd4a2914000
[253905.545497] </TASK>
Software (please complete the following information):
- Parca Agent Version: v0.19.0, also tested git tree from last week
- Parca Server Version (if applicable): NA
Workload (please complete the following information):
- Runtime (if applicable):
- Compiler (if applicable):
Environment (please complete the following information):
- Linux Distribution (tested on the following, others are likely also affected):
$ cat /etc/*-release
Amazon Linux release 2023 (Amazon Linux)
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"
Amazon Linux release 2023 (Amazon Linux)
- Linux Version:
6.1.25-37.47.amzn2023.x86_64 - Arch:
x86_64 - Kubernetes Version (if applicable): NA
- Container Runtime (if applicable): NA
Additional context
I believe this is neither a bug in Amazon Linux nor in Parca, but a upstream kernel bug. I have not reported it upstream yet (you are free to do it yourself, it would be great if you CC gerhorst@amazon.de and linux-kernel@luisgerhorst.de if you do). I was not able to find an existing report on LKML. I am reporting this here because parca-agent is affected and you will likely want to change your BPF program even if the bug is fixed upstream (as it will take time for the fix to propagate).
The best fix for you is likely to stop using the BPF helper for now. Maybe you can also detect the specific conditions that trigger the bug and only avoid calling the helper when these are present.
To fix the kernel bug, it's maybe possible to disable IRQs during alloc_vmap_area() and similar or to make access_ok() IRQ-safe.