Skip to content

fix: low probability IO hang problem#123

Merged
eryugey merged 1 commit intocloud-hypervisor:masterfrom
aftersnow:fix-io-hang-issue
May 8, 2023
Merged

fix: low probability IO hang problem#123
eryugey merged 1 commit intocloud-hypervisor:masterfrom
aftersnow:fix-io-hang-issue

Conversation

@aftersnow
Copy link

@aftersnow aftersnow commented Apr 21, 2023

we found some applications may encounter the io hang:

Call Trace:
[] ? __schedule+0x23b/0x780
[] schedule+0x36/0x80
[] request_wait_answer+0xc0/0x1f0 [fuse]
[] ? prepare_to_wait_event+0x100/0x100
[] __fuse_request_send+0x84/0x90 [fuse]
[] fuse_request_send+0x27/0x30 [fuse]
[] fuse_simple_request+0xcf/0x1a0 [fuse]
[]fuse_dentry_revalidate+0x18b/0x300 [fuse]
[] lookup_fast+0x2eb/0x310
[] walk_component+0x47/0x310
[] path_lookupat+0x67/0x120
[] ? generic_file_read_iter+0x699/0x960
[] filename_lookup+0xb1/0x180
[] ? kmem_cache_alloc+0x146/0x1a0
[] ? getname_flags+0x4f/0x1f0
[] ? getname_flags+0x6f/0x1f0
[] user_path_at_empty+0x36/0x40
[] vfs_fstatat+0x66/0xc0
...

The root cause MAYBE that fuse daemon encounters errors (bytes_to_cstr() error) when processing requests, but does not reply to kernel.

This patch fixes the issue by adding a reply. Because the exception does not produce any logs, we cannot be 100% sure that it is the cause. However, after applying this patch in our production environment, the issue no longer occurs.

@aftersnow aftersnow force-pushed the fix-io-hang-issue branch 2 times, most recently from 721692b to a19613a Compare April 21, 2023 03:50
we found some applications may encounter the io hang:

Call Trace:
[<ffffffff8173ca3b>] ? __schedule+0x23b/0x780
[<ffffffff8173cfb6>] schedule+0x36/0x80
[<ffffffffa0377840>] request_wait_answer+0xc0/0x1f0 [fuse]
[<ffffffff810d5350>] ? prepare_to_wait_event+0x100/0x100
[<ffffffffa03779f4>] __fuse_request_send+0x84/0x90 [fuse]
[<ffffffffa0377a27>] fuse_request_send+0x27/0x30 [fuse]
[<ffffffffa037adaf>] fuse_simple_request+0xcf/0x1a0 [fuse]
[<ffffffffa037c80b>] fuse_dentry_revalidate+0x18b/0x300 [fuse]
[<ffffffff8125b79b>] lookup_fast+0x2eb/0x310
[<ffffffff8125c527>] walk_component+0x47/0x310
[<ffffffff8125d7c7>] path_lookupat+0x67/0x120
[<ffffffff811b6e99>] ? generic_file_read_iter+0x699/0x960
[<ffffffff81260421>] filename_lookup+0xb1/0x180
[<ffffffff812235c6>] ? kmem_cache_alloc+0x146/0x1a0
[<ffffffff8126001f>] ? getname_flags+0x4f/0x1f0
[<ffffffff8126003f>] ? getname_flags+0x6f/0x1f0
[<ffffffff81254896>] vfs_fstatat+0x66/0xc0
[<ffffffff81254e51>] SYSC_newlstat+0x31/0x60

The root cause MAYBE that fuse daemon encounters errors (bytes_to_cstr() error)
when processing requests, but does not reply to kernel.

This patch fixes the issue by adding a reply. Because the exception does not
produce any logs, we cannot be 100% sure that it is the cause. However, after
applying this patch in our production environment, the issue no longer occurs.

Signed-off-by: winters.zc <winters.zc@antfin.com>
Copy link
Contributor

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! @eryugey do you have more comments on the PR?

Copy link
Contributor

@eryugey eryugey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eryugey eryugey merged commit 89c5fa7 into cloud-hypervisor:master May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants