Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] support of bindmounted unix dgram sockets #1001

Draft
wants to merge 277 commits into
base: criu-dev
Choose a base branch
from

Conversation

mihalicyn
Copy link
Member

This patchset adds support of bindmounted unix DGRAM sockets

Will be ready for review/updated after #992 will be merged

rst0git and others added 30 commits January 24, 2020 10:46
Drop the constants for default cache host/port and page size because
they are not used anywhere.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
1) fix sfle memory leak on get_fle_for_scm error
2) fix gfd open descriptor leak on get_fle_for_scm error
3-6) fix buf memory leak on read and pwrite errors

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Show error message when image-{cache,proxy} is called without --port
and image-proxy without --address argument.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
criu/sk-packet.c:443:3: error: 'strncpy' output may be truncated
copying 14 bytes from a string of length 15
   strncpy(addr_spkt.sa_data, req.ifr_name, sa_data_size);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/img-remote.c:383:3: error: 'strncpy' specified bound 4096
equals destination size
   strncpy(snapshot_id, li->snapshot_id, PATHLEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/img-remote.c:384:3: error: 'strncpy' specified bound 4096
equals destination size
   strncpy(path, li->name, PATHLEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/files.c:288:3: error: 'strncpy' output may be truncated copying
4095 bytes from a string of length 4096
   strncpy(buf, link->name, PATH_MAX - 1);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/sk-unix.c:239:36: error: '/' directive output may be truncated
writing 1 byte into a region of size between 0 and 4095
   snprintf(path, sizeof(path), ".%s/%s", dir, sk->name);
                                    ^
criu/sk-unix.c:239:3: note: 'snprintf' output 3 or more bytes
(assuming 4098) into a destination of size 4096
   snprintf(path, sizeof(path), ".%s/%s", dir, sk->name);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/mount.c:2563:3: error: 'strncpy' specified bound 4096 equals
destination size
   strncpy(path, m->mountpoint, PATH_MAX);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
criu/cr-restore.c:3647:2: error: 'strncpy' specified bound 16 equals
destination size
  strncpy(task_args->comm, core->tc->comm, sizeof(task_args->comm));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
CID 190778 (checkpoint-restore#1 of 1): Read from pointer after free (USE_AFTER_FREE)
7. deref_after_free: Dereferencing freed pointer rop.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Both functions check_pending_forwards() and check_pending_reads()
have very similar functionality. This patch aims to reduce the
duplication of code by merging both functions into check_pending()

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The macro PATHLEN is redundant. It is defined such that its replacement
token sequence is the token PATH_MAX.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Acked-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The function `setup_TCP_server_socket` (defined in img-remote.c) and
`setup_tcp_server` (defined in util.c) have very similar functionality.

Replace setup_TCP_server_socket() with setup_tcp_server() to reduce
code duplication and to enable IPv6 support for the image-cache action
of CRIU.

We set SO_REUSEADDR flag to allow reuse of local addresses.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Remove setup_TCP_client_socket() and use setup_tcp_server() instead as
both functions have very similar functionality.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Check only once if (sockfd < 0)

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When CRIU calls the ip tool on restore, it passes the fd of remote
socket by replacing the STDIN before execvp. The stdin is used by the
ip tool to receive input. However, the ip tool calls ftell(stdin)
which fails with "Illegal seek" since UNIX sockets do not support file
positioning operations. To resolve this issue, read the received
content from the UNIX socket and store it into temporary file, then
replace STDIN with the fd of this tmp file.

 # python test/zdtm.py run -t zdtm/static/env00 --remote -f ns
 === Run 1/1 ================ zdtm/static/env00

 ========================= Run zdtm/static/env00 in ns ==========================
 Start test
 ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
 Adding image cache
 Adding image proxy
 Run criu dump
 Run criu restore
 =[log]=> dump/zdtm/static/env00/31/1/restore.log
 ------------------------ grep Error ------------------------
 RTNETLINK answers: File exists
 (00.229895)      1: do_open_remote_image RDONLY path=route-9.img snapshot_id=dump/zdtm/static/env00/31/1
 (00.230316)      1: 	Running ip route restore
 Failed to restore: ftell: Illegal seek
 (00.232757)      1: Error (criu/util.c:712): exited, status=255
 (00.232777)      1: Error (criu/net.c:1479): IP tool failed on route restore
 (00.232803)      1: Error (criu/net.c:2153): Can't create net_ns
 (00.255091) Error (criu/cr-restore.c:1177): 105 killed by signal 9: Killed
 (00.255307) Error (criu/mount.c:2960): mnt: Can't remove the directory /tmp/.criu.mntns.dTd7ak: No such file or directory
 (00.255339) Error (criu/cr-restore.c:2119): Restoring FAILED.
 ------------------------ ERROR OVER ------------------------
 ################# Test zdtm/static/env00 FAIL at CRIU restore ##################
 ##################################### FAIL #####################################

Fixes checkpoint-restore#311

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In Py2 `range` returns a list and `xrange` creates a sequence object
that evaluates lazily. In Py3 `range` is equivalent to `xrange` in Py2.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
As discussed on the mailing list, current .py files formatting does not
conform to the world standard, so we should better reformat it. For this
the yapf tool is used. The command I used was

yapf -i $(find -name *.py)

Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Most of the typos were found by codespell.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This patch does not introduce any functional changes.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The function accept_proxy_to_cache() is a wrapper around accept().
Use accept() directly instead.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Reduce code duplication by introducing a strflags() macro which maps
a flag to corresponding string.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
This patch makes various private variables and procedures static.

These changes conform to the following code style conventions:

 When declaring pointer data or a function that returns a pointer type,
 the preferred use of ``*`` is adjacent to the data name or function
 name and not adjacent to the type name.

 Statements longer than 80 columns will be broken into sensible chunks,
 unless exceeding 80 columns significantly increases readability and
 does not hide information. Descendants are always substantially
 shorter than the parent and are placed substantially to the right.
 The same applies to function headers with a long argument list.

from https://www.kernel.org/doc/Documentation/process/coding-style.rst

The function declarations {send,recv}_image() from img-remote.h are
removed because they do not have a corresponding implementation.

The following functions are made static because they are not used
outside img-remote.c:
* {send,recv}_image_async()
* {read,write}_remote_header()
* socket_set_non_blocking()

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
There is no need to print an error message, __xalloc() does that.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
When an interrupt signal (even SIGWINCH when strace is used) is
received while epoll_wait() sleeps, it will return a value of
-1 and set errno to EINTR, which is not an error and should be
ignored.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
There is no need to pass the values for address and port as arguments
when creating a TCP server. The external `opts` object, which provides
opts.addr and opts.port, is accessible in all components that require
these values.

With this change, a value specified with the `--address` option will
used by image-cache in the same way as with page-server.

Example:
criu image-cache --address 127.0.0.1 --port 1234
criu page-server --address 127.0.0.1 --port 1234

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
The variable name 'remote_sk' is shorter than 'proxy_to_cache_fd' and
it is more similar to 'page_server_sk' (used in criu/page-xfer.c).

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
cyrillos and others added 18 commits March 24, 2020 16:23
Mount points might be beindmount to some resources (say unix binded
sockets) thus when times come to do real bind mount call we need
to prepare appropriate resource first.

On dump procedure we walk over all bind-mounts and check if
the mountpoint is a unix socket saving the mnt_id into
the image then. To distinguish such sockets from others
we use UNIX_UFLAGS__BINDMOUNT flag.

Note at moment we support only DGRAM closed sockets.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Because we need to gather unix sockets earlier than
we start creating mount tree. Thus we will be able
to handle bindmounted sockets.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
So when we will be examinating mount points and
sockets we won't waste time on non-bindmounted.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
We will need to take mutex when bind()
bindmounted sockets. Strictly speaking we
won't support bindmounted and deleted sockets
for now but better prepare this scaffolds early.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Some unix sockets might be bindmounted (say /dev/log
bound to another place). So to handle it we need to
change the logic we open such sockets especially
because we create mount tree earlier than we start
to restore files.

Thus here what we do:

 - on dump mark such sockets with UNIX_UFLAGS__BINDMOUNT
   flag so we would distinguish them on restore;

 - collect unix sockets before creating mount tree;
   note that at this moment we able to simply gather
   this sockets into own @unix_mnt_sockets list and
   nothing more because setting up the peers and such
   happens later in that named post action procedures;

 - when we need to create a bindmount point we enter
   into unix engine and figure out if there a socket
   to bindmount over; if found we pre-allocate the
   socketpair, bind it and save inside fdstore engine;
   using socketpair is important because later we need
   both peers to restore queued data;

 - finally when we start restoring files we simply
   fetch the socket from the fdstore and use it
   directly.

All this scheme is working simply because we support
dgram standalone sockets only, adding support for
streamed sockets requires a way more engine rework
and hopefully we won't need it in near future.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
To test a case where unix socket is bind mounted to somewhere
so restore may fail if socket has not been created.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Otherwise the mount call may fail since bindmounted unix socket
wont be created.

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
To elimitane compilation warnings with gcc-8.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
For better debug

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
No need to check for pair argument, it is rather confusing.

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Currently bind_unix_sk is used in two contexts: to bind freshly
created socket pairs and to bind name for sockets which are
to be queued into fdstore (binmounted sockets). For first
case we should notify the waiting side immediately but
in turn bindmount sockets are created early and there might
be the case where peers are not yet even opened and notification
may simply lost or even cause sigsegv since file list is yet
empty. In turn we should defer it until we do a real bindmount
socket opening right after we fetched it from the fdstore.

https://jira.sw.ru/browse/PSBM-88274

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
The unlink procedure is rather a cleanup before we
start creating new sockets, but bindmounted sockets
are pre-created early so we should not touch them.

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Once socket is bounded we should allow to connect to us
via relative name.

https://jira.sw.ru/browse/PSBM-88274

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Also full rework to make sure we can test the situation
where client is not yet opened when we're restoring
bindmount.

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Previously we always created sockets in root mount
namespace but in 019ebec
we've tried to resolve this problem setting up proper
mount namespace. This is not always possible though:
the root task from which we trying to switch ns might
already have a number of children with CLONE_FS|CLONE_FILES
set and kernel doesn't allow to do that.

Lets disable this ability until we find a proper solution.

https://jira.sw.ru/browse/PSBM-89126

Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
@avagin avagin requested a review from Snorch April 12, 2020 23:24
Copy link
Member

@Snorch Snorch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is yet a draft, so let's draft-review it =)

Please see in-line comments. upd: Also can you please merge bugfixes to the corresponding base patches?


pr_info("%d restore sndbuf %d rcv buf %d\n", sk, soe->so_sndbuf, soe->so_rcvbuf);

/* setsockopt() multiplies the input values by 2 */
ret |= userns_call(sk_setbufs, UNS_ASYNC, bufs, sizeof(bufs), sk);
return userns_call(sk_setbufs, UNS_ASYNC, bufs, sizeof(bufs), sk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine to use async call if we want to set queue size strictly before restoring the queue?


if (switch_ns(mi->nsid->ns_pid, &mnt_ns_desc, &ns_old) < 0) {
pr_err("Can't switch ns to mnt_id %d", mi->mnt_id);
if (restore_ns(ns_old, &mnt_ns_desc)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need to restore_ns here if switch ns failed, we've just restored initial mntns in the end of previous iteration, and if switch_ns fails it does not actually do the switch.

if (restore_ns(ns_old, &mnt_ns_desc)) {
pr_err("Can't switch mount ns back from %d\n", mi->nsid->ns_pid);
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also "ns_old = -1" after all restore_ns calls would be nice as the fd is closed, not good if we will accidentally reuse it.

return -1;
}

if (stat(mi->mountpoint, &st)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mountpoint is overmounted you can stat not what you need (some other file from overmount). So we either need to use open+fdinfo to check that we stat the file from right mount or just fail if mnt_is_overmounted(mi).

}
} else {
ue->mnt_id = d->mnt_id;
ue->has_mnt_id = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we loose this way the information about the original socket file mount id? Maybe we can add another field for it?

mi->mountpoint);
goto err;
}

return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hunk is supper strange, please fix it =) You better merge "unix: bindmount -- Move mounting code to be called before first mount" to this commit.

@github-actions
Copy link

github-actions bot commented Jan 9, 2021

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet