Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor criu plugin baseline v1 #2295

Open
wants to merge 3 commits into
base: criu-dev
Choose a base branch
from

Conversation

rerrabolu
Copy link
Contributor

No description provided.

…vices

Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
@rerrabolu
Copy link
Contributor Author

Could someone advise me as to what is blocking? The 3 tests that are failing are doing so in "build" phase. It is not clear how that is connected to my change. What are the next steps

@Snorch
Copy link
Member

Snorch commented Nov 10, 2023

# 1  DCO
Commit sha: [ff42a4f](https://github.com/checkpoint-restore/criu/pull/2295/commits/ff42a4f97f41849c45f5331dfa3c1b5840baa028), Author: Ramesh Errabolu, Committer: GitHub; The sign-off is missing.

# 2 CentOS Stream 8 based test
------------------------ grep Error ------------------------
b'(00.011876)     53: cg: setting cgns prefix to /system.slice/google-startup-scripts.service'
b'(00.011883)     53: cg: setting cgns prefix to /system.slice/google-startup-scripts.service'
b'(00.011890)     53: cg: setting cgns prefix to /system.slice/google-startup-scripts.service'
b'(00.011898)     53: cg: setting cgns prefix to /test'
b"(00.011905)     53: Error (criu/cgroup.c:1135): cg: Can't move 53 into zdtmtst//test/cgroup.procs (-1/-1): No such file or directory"
b"(00.011907)     53: Error (criu/cgroup.c:1191): cg: couldn't set cgns prefix zdtmtst//test/cgroup.procs: No such file or directory"
b'(00.011909)     53: Error (criu/cgroup.c:1282): cg: failed preparing cgns'
b'(00.012183) Error (criu/cr-restore.c:1513): 53 exited, status=1'
b'(00.012190) Error (criu/cr-restore.c:2557): Restoring FAILED.'
b'(00.016876) Error (criu/cgroup.c:1970): cg: cgroupd: recv req error: No such file or directory'
------------------------ ERROR OVER ------------------------
################ Test zdtm/static/cgroupns FAIL at CRIU restore ################

# 3 CentOS Stream 9 based test
gcc -g -Wall -Werror -D _GNU_SOURCE -shared -nostartfiles -fPIC -DCR_PLUGIN_DEFAULT="/usr/local/lib/criu" -I ../../compel/include/uapi amdgpu_plugin.c amdgpu_plugin_drm.c amdgpu_plugin_topology.c amdgpu_plugin_util.c criu-amdgpu.pb-c.c -o amdgpu_plugin.so -iquote../../include -iquote../../criu/include -iquote../../criu/arch/x86/include/ -iquote../../ -lpthread -lrt -ldrm -ldrm_amdgpu -I/usr/include/libdrm
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:52: multiple definition of `fd_next'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:69: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:54: multiple definition of `kfd_fw_version_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:50: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:55: multiple definition of `kfd_sdma_fw_version_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:52: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:56: multiple definition of `kfd_caches_count_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:54: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:57: multiple definition of `kfd_num_gws_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:56: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:58: multiple definition of `kfd_vram_size_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:58: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:59: multiple definition of `kfd_numa_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:60: first defined here
/usr/bin/ld: /tmp/ccrGAnOx.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_util.c:60: multiple definition of `kfd_capability_check'; /tmp/cczVlqFR.o:/tmp/criu/plugins/amdgpu/amdgpu_plugin_topology.c:62: first defined here
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:32: amdgpu_plugin.so] Error 1
make[2]: *** [Makefile:347: amdgpu_plugin] Error 2
make[2]: Leaving directory '/tmp/criu'
make[1]: *** [Makefile:4: run] Error 2
make[1]: Leaving directory '/tmp/criu/test/others/make'
+ cleanup_cgroup
+ ./test/zdtm_umount_cgroups 12456
make: *** [Makefile:2: local] Error 2
make: Leaving directory '/tmp/criu/scripts/ci'

# 4 Vagrant Fedora Rawhide based test 
------------------------ grep Error ------------------------
b'(00.031043)     59: net: \tRunning ip rule delete table local'
b'(00.034514)     59: net: \tRunning ip rule restore'
b'(00.051113)     59: net: \tRunning iptables-restore -w for iptables-restore -w'
b'(00.055342)     59: net: \tRunning ip6tables-restore -w for ip6tables-restore -w'
b'(00.060395)     59: Error (criu/libnetlink.c:54): -16 reported by netlink: Device or resource busy'
b"(00.060445)     59: Error (criu/util.c:1495): Can't wait or bad status: errno=0, status=65280"
b'(00.060792) Error (criu/cr-restore.c:2557): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
######### Test zdtm/static/socket-tcp-nfconntrack FAIL at CRIU restore #########

You can just enter details section of each check and see errors by yourself. 1 and 3 are obviously introduced by your code, so you should fix them. 4 is definitely unrelated to your code as I saw it in other PRs, 2 is likely unrelated - I triggered a rerun for it. I also approved all checks for your PR, so you may expect some more fails.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c474816) 70.55% compared to head (3ad74b3) 70.62%.

❗ Current head 3ad74b3 differs from pull request most recent head ff42a4f. Consider uploading reports for the commit ff42a4f to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2295      +/-   ##
============================================
+ Coverage     70.55%   70.62%   +0.06%     
============================================
  Files           132      133       +1     
  Lines         33508    33312     -196     
============================================
- Hits          23642    23525     -117     
+ Misses         9866     9787      -79     

see 21 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

plugins/amdgpu/amdgpu_plugin_topology.c Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin.c Show resolved Hide resolved
plugins/amdgpu/criu-amdgpu.proto Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin_util.h Show resolved Hide resolved
return ret;
}

void print_kfd_bo_stat(int bo_cnt, struct kfd_criu_bo_bucket *bo_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this as a helper method that could be used in debugging if needed. Is this not allowed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with adding unused debugging functions upstream is that other people are not going to use them and test if they still work when new changes are being introduced. This makes it more difficult to maintain the code base and creates confusion for new contributors working on the project.

plugins/amdgpu/amdgpu_plugin_drm.c Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin_util.c Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin_util.c Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin_util.c Show resolved Hide resolved
plugins/amdgpu/amdgpu_plugin_util.c Show resolved Hide resolved
Copy link

A friendly reminder that this PR had no activity for 30 days.

{
dev_file_cnt--;
pr_info("\n");
pr_info("Sairam: %s(), Number of Checkpoints LEFT: %d\n", __func__, dev_file_cnt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all prints with Sairam:

@@ -68,6 +68,13 @@ struct vma_metadata {

extern int fd_next;

// FD of KFD device used to checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use consistent commenting style. I think rest of this file uses this style:

/**
 * Comments
 */


/* Helper macros to Checkpoint and Restore a ROCm file */
#define HSAKMT_SHM_PATH "/dev/shm/hsakmt_shared_mem"
#define HSAKMT_SHM "/hsakmt_shared_mem"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mis-alignment for the right-hand-sides here

return (dev_file_cnt == 0);
}

void decrement_checkpoint_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this function in the other commit as it is used there

@@ -840,6 +840,9 @@ void topology_free(struct tp_system *sys)
list_del(&p2pgroup->listm_system);
xfree(p2pgroup);
}

/* Update Topology as being freed */
sys->parsed = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think parsed is used anywhere else?

@dayatsin-amd
Copy link
Contributor

Overall, the general idea/concept of this refactor is fine.

@github-actions github-actions bot removed the stale-pr label Dec 14, 2023
Copy link

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants