lib: tst_device: sleep before unbinding the loop device #866

YinboZhu · 2021-09-04T07:24:28Z

When running ltp/ltpstress test that kernel will generats io error
of loop device, which was due to loop io request doesn't finished
dispatch before unbinding the loop device. and this patch fixed io
error issue by add the logic that sleep for a shor period before
unbinding the loop device.

Signed-off-by: Yinbo Zhu zhuyinbo@loongson.cn

When running ltp/ltpstress test that kernel will generats io error of loop device, which was due to loop io request doesn't finished dispatch before unbinding the loop device. and this patch fixed io error issue by add the logic that sleep for a shor period before unbinding the loop device. Signed-off-by: Yinbo Zhu <zhuyinbo@loongson.cn>

YinboZhu · 2021-09-04T09:33:47Z

@metan-ucw

metan-ucw · 2021-09-06T08:01:58Z

What exact error did you get?

You should handle the error correctly rather than moving sleep() around and hoping that you will not hit it.

YinboZhu · 2021-09-17T09:38:28Z

Hi metan-ucw,

That ltpstress io error is "print_req_error: I/O error, dev loop0, sector 0" , which was due to loop io request doesn't finished
dispatch before unbinding the loop device. When the CPU pressure increases, the IO dispatch process will delay the dispatch of IO requests，but consider that IO request submit process was asynchronous to IO dispatch process, and IO request submit process completes the corresponding work before IO dispatch process, then testcase will unbind the loop device. It could happen that loop io request doesn't finished dispatch before unbinding the loop device at this time. so I add the logic that sleep for a short time before unbinding the loop device. later, i find out that use this way it doesn't let this problem disappear completely in a large number of tests but it can reduce the probability that loop io error happen so i will drop this patch. at last i make a analysis conclusion was above loop io error is normal when execute the ltpstress. Because the status of CPU resources occupied by different processes cannot be confirmed, so the kernel cannot guarantee that the loop IO dispatch process of the test case had finished dispatch IO request before unbinding the device. and do you have a different view about the loop io error "print_req_error: I/O error, dev loop0, sector 0" ?

metan-ucw · 2021-09-17T10:05:35Z

The "print_req_error: I/O error, dev loop0, sector 0" is a kernel error, right?

What is the output from the testcases? There should be some kind of error in there as well.

metan-ucw · 2021-09-17T12:18:54Z

After a bit of debugging over IRC we found that the problem seems to be in the fallback with a loop device for the needs_rofs flag. It seems that some tests fails to clean up properly when the test is skipped early such as chown04_16.

YinboZhu · 2021-09-18T01:57:27Z

Hi metan-ucw,

Yes, the "print_req_error: I/O error, dev loop0, sector 0" is a kernel error. In the previous description, I have analyzed the conditions for this loop error. the corresponding code is as follows:
static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd)
{
...
if (lo->lo_state != Lo_bound)
return BLK_STS_IOERR;
...
}

The following code is the logic of loop IO error happen, the function "blk_mq_dispatch_rq_list" responsible for io dispatch, the "q->mq_ops->queue_rq" is initialized with "loop_queue_rq" , the function "blk_mq_end_request" will call "print_req_error(req, error), then kernel will report "print_req_error: I/O error, dev loop0, sector 0"
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget)
{
...
ret = q->mq_ops->queue_rq(hctx, &bd);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
blk_mq_handle_dev_resource(rq, list);
break;
}

            if (unlikely(ret != BLK_STS_OK)) {
                    errors++;
                    blk_mq_end_request(rq, BLK_STS_IOERR);    

                    continue;
            }

...
}

According to a large number of ltpstress test results, almost all test cases that use loop devices and have IO operations on loop devices will encounter this problem. Among them, the open12 testcase has the highest probability of hitting IO errors, and other recorded testcases that report errors are rename11 、lchown03、mmap16 、utime06、mknod07、ftruncate04.
In addition, I add some logic in some functions of IO dispatch queue to delay IO dispatch, and then execute a single test case. the looop IO errors can also occur. This also verifies my previous analysis conclusion.

wangli5665 force-pushed the master branch from b28d946 to 3490c28 Compare June 16, 2023 02:50

pevik force-pushed the master branch from 5cfb7f5 to 3fe59ef Compare May 7, 2024 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib: tst_device: sleep before unbinding the loop device #866

lib: tst_device: sleep before unbinding the loop device #866

YinboZhu commented Sep 4, 2021

YinboZhu commented Sep 4, 2021

metan-ucw commented Sep 6, 2021

YinboZhu commented Sep 17, 2021

metan-ucw commented Sep 17, 2021

metan-ucw commented Sep 17, 2021

YinboZhu commented Sep 18, 2021

lib: tst_device: sleep before unbinding the loop device #866

Are you sure you want to change the base?

lib: tst_device: sleep before unbinding the loop device #866

Conversation

YinboZhu commented Sep 4, 2021

YinboZhu commented Sep 4, 2021

metan-ucw commented Sep 6, 2021

YinboZhu commented Sep 17, 2021

metan-ucw commented Sep 17, 2021

metan-ucw commented Sep 17, 2021

YinboZhu commented Sep 18, 2021