Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrity check fails but no info provided on why #364

Open
dtenenba opened this issue Nov 12, 2020 · 2 comments
Open

integrity check fails but no info provided on why #364

dtenenba opened this issue Nov 12, 2020 · 2 comments
Assignees

Comments

@dtenenba
Copy link
Contributor

A user ran a copy job and then an integrity job which failed but did not provide information on what failed.
This is what the user saw:

image

Also note that the color of "FAILED" and the progress bar does not exactly match any of the colors in the legend, though it comes closest to "Missing MD5 Hash".

@dtenenba dtenenba added this to Backlog in Motuz Milestone 3 via automation Nov 12, 2020
@dtenenba dtenenba moved this from Backlog to To do in Motuz Milestone 3 Nov 12, 2020
@dtenenba
Copy link
Contributor Author

A little bit more on this. I would expect to see two "rclone md5sum" commands in the log related to this integrity check, one for the source and one for the destination.

I did find the one for the source:

[2020-11-12 12:25:22,551: INFO/ForkPoolWorker-7]  sudo -E -u ehatch /usr/local/bin/rclone --config=/dev/null md5sum /home/ehatch/hatchlab data/Leica SD_storage --exclude=\.snapshot/

In fact it is in there 3 times because the user retried the integrity check several times.

However, there is no corresponding entry for the destination. I would expect to see something like this (I just constructed this from a similar line in the log):

[2020-11-09 18:50:35,336: INFO/ForkPoolWorker-5] RCLONE_CONFIG_SRC_TYPE='s3' RCLONE_CONFIG_SRC_REGION='us-west-2' RCLONE_CONFIG_SRC_ACCESS_KEY_ID='*******' RCLONE_CONFIG_SRC_SECRET_ACCESS_KEY='***' sudo -E -u ehatch /usr/local/bin/rclone --config=/dev/null md5sum src:/fh-pi-hatch-e/raw image data/Leica SD_storage

But such a line never occurs. So it seems that the rclone md5sum command for the destination was never run, and yet according to the screenshot above, it did complete but FAILED.

In looking through the logs for the ID of the hashsum job (527), I did find this:

[2020-11-10 16:26:58,671: ERROR/ForkPoolWorker-2] list modified during sort
Traceback (most recent call last):
  File "/app/src/backend/api/tasks/celery_tasks.py", line 124, in hashsum_job
    result_src = _hashsum_job_single(self, hashsum_job, side='src', start_time=start_time)
  File "/app/src/backend/api/tasks/celery_tasks.py", line 245, in _hashsum_job_single
    f'progress_{side}_tree': get_hashsum_tree(),
  File "/app/src/backend/api/tasks/celery_tasks.py", line 225, in get_hashsum_tree
    tree = generate_file_tree(files)
  File "/app/src/backend/api/utils/file_utils.py", line 24, in generate_file_tree
    data.sort(key=lambda d: d["Name"])
ValueError: list modified during sort
[2020-11-10 16:26:58,775: INFO/ForkPoolWorker-2] Sent notification email to ehatch@fredhutch.org
[2020-11-10 16:26:58,788: INFO/ForkPoolWorker-2] Task motuz.api.tasks.hashsum_job[527] succeeded in 2926.9183557303622s: {'error_text': 'list modified during sort'}

So that seems to indicate a problem. It's the same error we see in #360 .

Also, above that, I see stuff like this. I am not sure if it is related to the same job, because log entries don't yet include the job they are part of (but I think we have an issue for that).

[2020-11-10 16:09:21,652: ERROR/ForkPoolWorker-15] ERROR : bamFiles/03-082C3_NORMAL.dedup.newRG.realigned.recal.bam: corrupted on transfer: sizes differ 20368000222 vs 16106127360
[2020-11-10 16:20:37,349: ERROR/ForkPoolWorker-1] ERROR : filtered/01_087_N.R1.fastq: corrupted on transfer: sizes differ 8848948331 vs 5368709120
[2020-11-10 16:20:44,190: ERROR/ForkPoolWorker-1] ERROR : filtered/01_087_T.R2.fastq: corrupted on transfer: sizes differ 7002110812 vs 0
[2020-11-10 16:22:52,588: ERROR/ForkPoolWorker-1] ERROR : filtered/02-065N.R1.fastq: corrupted on transfer: sizes differ 7887889900 vs 5368709120
[2020-11-10 16:26:15,657: ERROR/ForkPoolWorker-1] ERROR : filtered/02-065N.R2.fastq: corrupted on transfer: sizes differ 7887889900 vs 5368709120

I don't know if that is the normal expected output when md5sums don't match, or if it is indicating another problem.

@aicioara
Copy link
Collaborator

Thank you for filing this issue. I will look into it. The command in the log looks suspicious. Also we have to statuses:

  • "FAILED" (like this job) means there was a problem with the checksum job itself
  • "DIFFERENT" status would mean that the md5sum job completed successfully, but the two entities did not match.

This particular error seems to point to the fact that the job itself failed. Checking...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants