Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: tst_device: sleep before unbinding the loop device #866

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

YinboZhu
Copy link

@YinboZhu YinboZhu commented Sep 4, 2021

When running ltp/ltpstress test that kernel will generats io error
of loop device, which was due to loop io request doesn't finished
dispatch before unbinding the loop device. and this patch fixed io
error issue by add the logic that sleep for a shor period before
unbinding the loop device.

Signed-off-by: Yinbo Zhu zhuyinbo@loongson.cn

When running ltp/ltpstress test that kernel will generats io error
of loop device, which was due to loop io request doesn't finished
dispatch before unbinding the loop device. and this patch fixed io
error issue by add the logic that sleep for a shor period before
unbinding the loop device.

Signed-off-by: Yinbo Zhu <zhuyinbo@loongson.cn>
@YinboZhu
Copy link
Author

YinboZhu commented Sep 4, 2021

@metan-ucw

@metan-ucw
Copy link
Member

What exact error did you get?

You should handle the error correctly rather than moving sleep() around and hoping that you will not hit it.

@YinboZhu
Copy link
Author

Hi metan-ucw,

That ltpstress io error is "print_req_error: I/O error, dev loop0, sector 0" , which was due to loop io request doesn't finished
dispatch before unbinding the loop device. When the CPU pressure increases, the IO dispatch process will delay the dispatch of IO requests,but consider that IO request submit process was asynchronous to IO dispatch process, and IO request submit process completes the corresponding work before IO dispatch process, then testcase will unbind the loop device. It could happen that loop io request doesn't finished dispatch before unbinding the loop device at this time. so I add the logic that sleep for a short time before unbinding the loop device. later, i find out that use this way it doesn't let this problem disappear completely in a large number of tests but it can reduce the probability that loop io error happen so i will drop this patch. at last i make a analysis conclusion was above loop io error is normal when execute the ltpstress. Because the status of CPU resources occupied by different processes cannot be confirmed, so the kernel cannot guarantee that the loop IO dispatch process of the test case had finished dispatch IO request before unbinding the device. and do you have a different view about the loop io error "print_req_error: I/O error, dev loop0, sector 0" ?

@metan-ucw
Copy link
Member

The "print_req_error: I/O error, dev loop0, sector 0" is a kernel error, right?

What is the output from the testcases? There should be some kind of error in there as well.

@metan-ucw
Copy link
Member

After a bit of debugging over IRC we found that the problem seems to be in the fallback with a loop device for the needs_rofs flag. It seems that some tests fails to clean up properly when the test is skipped early such as chown04_16.

@YinboZhu
Copy link
Author

Hi metan-ucw,

Yes, the "print_req_error: I/O error, dev loop0, sector 0" is a kernel error. In the previous description, I have analyzed the conditions for this loop error. the corresponding code is as follows:
static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd)
{
...
if (lo->lo_state != Lo_bound)
return BLK_STS_IOERR;
...
}

The following code is the logic of loop IO error happen, the function "blk_mq_dispatch_rq_list" responsible for io dispatch, the "q->mq_ops->queue_rq" is initialized with "loop_queue_rq" , the function "blk_mq_end_request" will call "print_req_error(req, error), then kernel will report "print_req_error: I/O error, dev loop0, sector 0"
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget)
{
...
ret = q->mq_ops->queue_rq(hctx, &bd);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
blk_mq_handle_dev_resource(rq, list);
break;
}

            if (unlikely(ret != BLK_STS_OK)) {
                    errors++;
                    blk_mq_end_request(rq, BLK_STS_IOERR);    

                    continue;
            }

...
}

According to a large number of ltpstress test results, almost all test cases that use loop devices and have IO operations on loop devices will encounter this problem. Among them, the open12 testcase has the highest probability of hitting IO errors, and other recorded testcases that report errors are rename11 、lchown03、mmap16 、utime06、mknod07、ftruncate04.
In addition, I add some logic in some functions of IO dispatch queue to delay IO dispatch, and then execute a single test case. the looop IO errors can also occur. This also verifies my previous analysis conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants