Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC 2021] Introduce io_uring dump/restore support #1597

Draft
wants to merge 121 commits into
base: criu-dev
Choose a base branch
from

Conversation

kkdwivedi
Copy link

@kkdwivedi kkdwivedi commented Aug 30, 2021

Currently, missing features are:

  • Registered files support
  • Registered eventfd support
  • Registered buffers support
  • Registered workqueue fd support
  • ... and some other minor things depending on the above.

All of these depend on the eBPF iterator patches that are in https://github.com/kkdwivedi/linux/tree/criu-iter.

The following blockers exist currently:

  • For registered files support, the io_uring_file iterator can be used to match on files present in the fdtable of task by matching directly on the struct file pointer. To make this safe against closing of files (so that such an iteration to gather data could potentially be moved to pre_dump stage and update itself dynamically), a new file local storage map would be utilized.
  • For eventfd support, finding the backing fd would rely on matching the eventfd context pointer with the one stashed in f_private of the fds held open by the task. This would require pairing the io_uring iterator and task_file iterator.
  • Buffer registration takes reference to struct page internally, which means that the actual mapping can be destroyed after buffers are registered. This makes the task of mapping back it to a vma hard. The proposed solution to resolve is to introduce a eBPF iterator for io_uring buffers. Paired with with task_vma iterator, this should allow matching on the struct page backing the vma and the mapped buffer.
  • Finding the backing workqueue for an io_uring (registered using wq_fd, another io_uring) relies on matching ctx->sq_data inside the kernel, which would again rely on eBPF iterator support.

For gathering data about io_uring itself:

  • We can lift data like e.g. restrictions from the io_uring iteration stage itself, but there are stability issues with eBPF, so if e.g. member name is changed across kernel versions, it would require updating CRIU. The solution to this may be shipping stable eBPF iterators with the kernel (like modules) which maintain a stable output but can be modified internally across kernel versions. Right now the stopgap solution chosen is using fdinfo interface.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

@mihalicyn
Copy link
Member

Hi, Kumar!

Please take a look on tests failures.
Looks like on CentOS 7 we need to install liburing-devel package and use header liburing/io_uring.h from liburing.
On CentOS 8 we get 362 tests failed:

un criu dump
=[log]=> dump/zdtm/transition/fifo_loop/54/1/dump.log
------------------------ grep Error ------------------------
b'(00.010197) ----------------------------------------'
b'(00.010199)'
b'(00.010200) Collecting fds (pid: 54)'
b'(00.010201) ----------------------------------------'
b'(00.010208) Error (criu/cr-dump.c:214): pidfd_open system call not supported'
b'(00.010211) Error (criu/cr-dump.c:1326): Collect fds (pid: 54) failed with -1'
b'(00.014873) \tUnseizing 57 into 1'
b'(00.014877) \tUnseizing 58 into 1'
b'(00.014880) \tUnseizing 59 into 1'
b'(00.014883) \tUnseizing 60 into 1'
b'(00.014900) Error (criu/cr-dump.c:1834): Dumping FAILED.'

Regards,
Alex

criu/cr-dump.c Outdated Show resolved Hide resolved
@adrianreber
Copy link
Member

I started the CI tests which were not running.

@kkdwivedi kkdwivedi force-pushed the io-uring branch 2 times, most recently from 7ca36e7 to cab53ef Compare August 31, 2021 07:44
@kkdwivedi
Copy link
Author

This should fix everything except the 'header not found' problem for <linux/io_uring.h>.

@@ -704,6 +709,8 @@ int prepare_mm_pid(struct pstree_item *i)
ret = collect_filemap(vma);
else if (vma_area_is(vma, VMA_AREA_SOCKET))
ret = collect_socket_map(vma);
else if (vma_area_is(vma, VMA_AREA_IO_URING))
ret = collect_io_uring_map(vma);
Copy link
Member

@rst0git rst0git Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure that CRIU continues to work with older kernel versions that don't support io_uring. Adding a kerndat check for that might help. For example, consider the case when a process tree (container) has been migrated to a system with older kernel version where io_uring is not supported.

@@ -792,6 +817,7 @@ int parse_smaps(pid_t pid, struct vm_area_list *vma_area_list, dump_filemap_t du
if (handle_vma(pid, vma_area, str + path_off, map_files_dir, &vfi, &prev_vfi, &vm_file_fd))
goto err;

//pr_info("dump_filemap=%p IO_URING=%d?\n", dump_filemap, (int) vma_area_is(vma_area, VMA_AREA_IO_URING));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be pr_debug?

@rst0git
Copy link
Member

rst0git commented Aug 31, 2021

This should fix everything except the 'header not found' problem for <linux/io_uring.h>.

You can add linux/io_uring.h in criu/include/linux/.
For example, f68e5a6 adds linux/userfaultfd.h to solve a similar problem.

IoUringFileEntry *iofe = io_uring_get_iofe(ctx);
unsigned int nr;
pid_t pid;
int r;
Copy link
Member

@rst0git rst0git Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str, r, and f are not very descriptive. Maybe we could use better variable names?

@kkdwivedi
Copy link
Author

@rst0git Thanks for the review. I'll update this with your suggestions applied once I get the kernel side patches in order, which shouldn't take too long.

@kkdwivedi
Copy link
Author

A prerequisite series for this and other stuff planned for later got in, now the iterators are left. After that (leaving out some specific things on CRIU side, like buffer support), this can move forward.

@kkdwivedi
Copy link
Author

kkdwivedi commented Oct 12, 2021

The iterators are in the criu-iter branch, so if someone is feeling bored, I'd appreciate review from as many people :). Note that this is feature complete , i.e. all tests part of the branch pass on BPF CI (when run locally using vmtest.sh in tools/testing/selftests/bpf). The only thing missing is bio_vec iteration for io_uring_ubuf, but that requires more thought (implement natively in the verifier, or add a helper), and I'll probably only touch that after more discussion with others. Also some stuff required to possibly do such iteration natively using a for loop (and making verifier happy about it) depends on BTF tag support, which is undergoing some change upstream for now.

Note:

  • If you are trying out the tests, please use latest clang to avoid any issues (Ubuntu and Fedora have nightly packages).
  • Use git pull --rebase to sync as this branch is force pushed/rebased onto bpf-next whenever there is a conflict.

prakritigoyal19 and others added 16 commits October 14, 2021 12:51
Change made through this commit:
- Include copy of flog as a seperate tree.
- Modify the makefile to add and compile flog code.

Signed-off-by: prakritigoyal19 <prakritigoyal19@gmail.com>
CID 302713 (checkpoint-restore#1 of 1): Missing varargs init or cleanup (VARARGS)
 va_end was not called for argptr.

Signed-off-by: Adrian Reber <areber@redhat.com>
Separate commit for easier criu-dev <-> master transfer.

Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Support for external net namespaces has been introduced with
commit c2b21fb (criu: add support for external net namespaces).

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Currently CRIU cannot handle Checkpoint Restore operations when a device
file is involved in a process, however, CRIU allows flexible extensions
via special plugins but still, for certain complex devices such as a GPU,
the existing hooks are not sufficient. This introduces few new hooks
that will be used to support Checkpoint Restore operation with AMD GPU
devices and potentially to other similar devices too.

 - HANDLE_DEVICE_VMA
 - UPDATE_VMA_MAP
 - RESUME_DEVICES_LATE

 *HANDLE_DEVICE_VMA:
	Hook to detect a suitable plugin to handle device file VMA with
	PF | IO mappings.

 *UPDATE_VMA_MAP:
	Hook to handle VMAs during a device file restore.

	When restoring VMAs for the device files, criu runs sys_mmap in
	the pie restore context but the offsets and file path within a
	device file may change during restore operation so it needs to be
	adjusted properly.

 *RESUME_DEVICES_LATE:
	Hook to do some special handling in late restore phase.

	During criu restore phase when a device is getting restored with
	the help of a plugin, some device specific operations might need
	to be delayed until criu finalizes the VMA placements in address
	space of the target process. But by the time criu finalizes this,
	its too late since pie phase is over and control is back to criu
	master process. This hook allows an external trigger to each
	resuming task to check whether it has a device specific operation
	pending such as issuing an ioctl call? Since this is called from
	criu master process context, supply the pid of the target process
	and give a chance to each plugin registered to run device
	specific operation if the target pid is valid.

A future patch will add consumers for these plugin hooks to support AMD
GPUs.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
This is just a placeholder dummy plugin and will be replaced by a proper
plugin that implements support for AMD GPU devices. This just
facilitates the initial pull request and CI build test trigger for early
code review of CRIU specific changes. Future PRs will bring in more
support for amdgpu_plugin to enable CRIU with AMD ROCm.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Restore operation fails when we perform CR operation of multiple
independent proceses that have device files  because criu caches
the ids for the device files with same mnt_ids, inode pair. This
change ensures that even in case of a cached id found for a device, a
unique subid is generated and returned which is used for dumping.

Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Since commit e42f5e0 ("tcp: allow to specify --tcp-close on dump"),
--tcp-close option can be used when checkpointing. This option skips
checkpointing established socket's state (including once established
but now closed socket). However, when restoring, we still try to
restore closed socket's state. As a result, a non-existent protobuf
image is opened.

This commit skips TCP_CLOSE socket when restoring established TCP
connection and removes the redundant check for TCP_LISTEN socket as
TCP_LISTEN socket cannot reach this function.

Suggested-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
The expected behavior of --tcp-close option when dumpping is to close
all established tcp connections including connection that is once
established but now closed. This adds an explicit description about
that behavior.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Resolve the following python3 portability issues:

1) Python 3 needs explicit relative import path.

2) Coredumps are binary data, not unicode strings. Use byte strings
(b"" instead of "") and open files in binary format.

3) Some functions (for example: filter) return a list in python 2,
but an iterator in python 3. Port code to a common subset of python 2
and python 3 using itertool.

4) Division operator / changed meaning in Python 3. Use explicit
integer division (//) where appropriate.

Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
Previous commit added support for python3 in criu-coredump. For convenience,
add two files (coredump-python2 and coredump-python3) that start
criu-coredump with respective python version. Edit env.sh accordingly.

Signed-off-by: Andrey Vyazovtsev <viazovtsev.av@phystech.edu>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
PEP8 recommends for comparisons to singletons like None to always be
done with 'is' or 'is not', never the equality operators.

https://python.org/dev/peps/pep-0008/#programming-recommendations

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Snorch and others added 14 commits January 1, 2022 11:50
New get_sdev_from_fd helper first gets mnt_id from fd using fdinfo and
then converts mnt_id to sdev using mountinfo.

By default mnt_id to sdev conversion only works for mounts in mntinfo.

If parse_mountinfo argument is true, will also parse current process
mountinfo when looking for mount sdev, this should be used only with
temporary mounts just created by criu in current mntns.

v3: add argument to parse self mountinfo for auxiliary mounts

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We face that btrfs returns anonymous device in stat instead of real
superblock dev for volumes, thus all btrfs volume mounts does not pass
check_mountpoint_fd due to dev missmatch between stat and mountinfo. We
can use special helper get_sdev_from_fd instead of stat to try to get
real dev of fd for btrfs.

We move check_mountpoint_fd from open_mountpoint into get_clean_fd and
ns_open_mountpoint to the point where temporary mount we open fd to is
still in mountinfo, thus get_sdev_from_fd would be able to find tmp
mount in mountinfo.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
All the bugs that were in the way got fixed. We can enable all tests.

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CI badges and logo are already present in the readme file.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Adrian Reber <areber@redhat.com>
A couple of months (or years) ago I looked into lgtm.com for CRIU. Today
on a pull request I saw result from lgtm.com for the first time and it
failed. Not sure what triggered the lgtm.com message into the CRIU
repository, but with the .lgtm.yml file in this commit lgtm.com can
actually build CRIU.

Signed-off-by: Adrian Reber <areber@redhat.com>
This commit adds feature check support to libcriu. It already exists in
the CLI and RPC and this just extends it to libcriu.

This commit provides one function to do all possible feature checks in
one call. The parameter to the feature check function is a structure and
the user can enable which features should be checked.

Using a structure makes the function extensible without the need to
break the API/ABI in the future.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
When requested iovs are huge, criu needs to invoke more then one
preadv()s. In this situation criu truncates memory image with
offset of first preadv() and length of last one, which leads
to leakage of memory image. This patch fixs truncating with right
offset and length.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
This case sometimes will cause SIGILL signal in arm64 platform.

<<ARM Coretex-A series Programmer's Guide for ARMv8-A>> notes:
  The ARM architecture does not require the hardware to ensure coherency
  between instruction caches and memory, even for locations of shared
  memory.

Therefore, we need flush dcache and icache for self-modifying code.

- https://developer.arm.com/documentation/den0024/a/Caches/Point-of-coherency-and-unification

Signed-off-by: fu.lin <fulin10@huawei.com>
This is a confusing change as it seems the original code was just wrong.
GCC 12 complains with:

In function ‘__conv_val’,
    inlined from ‘std_strtoul’ at compel/plugins/std/string.c:202:7:
compel/plugins/std/string.c:154:24: error: array subscript 97 is above array bounds of ‘const char[37]’ [-Werror=array-bounds]
  154 |                 return &conv_tab[__tolower(c)] - conv_tab;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~
compel/plugins/std/string.c: In function ‘std_strtoul’:
compel/plugins/std/string.c:10:19: note: while referencing ‘conv_tab’
   10 | static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz";
      |                   ^~~~~~~~
cc1: all warnings being treated as errors

Which sounds correct. The array conv_tab has just 37 elements.

If I understand the code correctly we are trying to convert anything
that is character between a-z and A-Z to a number for cases where
the base is larger than 10. For a base 11 conversion b|B should return 11.
For a base 35 conversion z|Z should return 35. This is all for a strtoul()
implementation.

The original code was:

  static const char conv_tab[] = "0123456789abcdefghijklmnopqrstuvwxyz";

  return &conv_tab[__tolower(c)] - conv_tab;

and that seems wrong. If conv_tab would have been some kind of hash it could
have worked, but '__tolower()' will always return something larger than
97 ('a') which will always overflow the array.

But maybe I just don't get that part of the code.

I replaced it with

  return __tolower(c) - 'a' + 10;

which does the right thing: 'A' = 10, 'B' = 11 ... 'Z' = 35

Signed-off-by: Adrian Reber <areber@redhat.com>
This fixes:

criu/config.c: In function ‘parse_statement’:
criu/config.c:232:43: error: the comparison will always evaluate as ‘true’ for the pointer operand in ‘*(configuration + (sizetype)((long unsigned int)i * 8)) + ((sizetype)offset + 1)’ must not be NULL [-Werror=address]
  232 |         if (configuration[i] + offset + 1 != 0 && strchr(configuration[i] + offset, ' ')) {
      |                                           ^~

Signed-off-by: Adrian Reber <areber@redhat.com>
Parasite creation started to fail with GCC 12:

On x86_64 with:
 ./compel/compel-host hgen -f criu/pie/restorer.built-in.o -o criu/pie/restorer-blob.h
 Error (compel/src/lib/handle-elf-host.c:337): Unexpected undefined symbol: `strlen'. External symbol in PIE?

On aarch64 with:
 ld: criu/pie/restorer.o: in function `lsm_set_label':
 /drone/src/criu/pie/restorer.c:174: undefined reference to `strlen'

Line 174 is: "for (len = 0; label[len]; len++)"

Adding '-ffreestanding' to parasite compilation fixes these errors
because, according to GCC developers:

"strlen is a standard C function, so I don't see any bug in that being used
unless you do a freestanding compilation (-nostdlib isn't that)."

Signed-off-by: Adrian Reber <areber@redhat.com>
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

rst0git and others added 3 commits January 28, 2022 11:59
Running cross compile tests with Debian unstable sometimes
fails due to missing or outdated packages.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
autofs.c:66:17: error: pointer 'str' may be used after 'realloc' [-Werror=use-after-free]

autofs.c: In function 'check_automount':
../lib/zdtmtst.h:131:9: error: pointer 'mountpoint' may be used after 'free' [-Werror=use-after-free]
  131 |         test_msg("ERR: %s:%d: " format " (errno = %d (%s))\n", __FILE__, __LINE__, ##arg, errno, strerror(errno))
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
autofs.c:277:17: note: in expansion of macro 'pr_perror'
  277 |                 pr_perror("%s: failed to close fd %d", mountpoint, p->fd);
      |                 ^~~~~~~~~
autofs.c:268:9: note: call to 'free' here
  268 |         free(mountpoint);
      |         ^~~~~~~~~~~~~~~~

Fixes: checkpoint-restore#1731

v2: (@Snorch) always update `str` after successful realloc()

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Ashutosh Mehra <asmehra@redhat.com>
This commit introduces basic support in CRIU to make use of eBPF kernel
features to aid in the checkpoint/restore process. The immediate usecase
is to provide an API to find fd to file pointer mapping, and vice-versa,
for quick lookup from one file set (e.g. task, epoll, io_uring) to
another. This is done by making use of eBPF iterator.

This makes use of task, epoll, and io_uring file iterator features
to be introduced in upcoming linux kernel versions.

No dependency on clang's BPF toolchain or libbpf is taken, as we don't
need those features just yet. It might be inevitable as we make more use
of BPF, but for now we can tolerate just writing raw BPF assembly.

To this end, also import bpf_insn.h from kernel's samples/bpf directory.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
@xinyangge-db
Copy link

As the io_uring interface is getting mature, is there any plan to resume the effort in supporting io_uring in CRIU? We have a data-intensive application that is going to significantly benefit from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet