Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunks show encoding a Fail status, but not any apparent problems #929

Open
KyleMaas opened this issue Dec 2, 2023 · 6 comments · May be fixed by #938
Open

Chunks show encoding a Fail status, but not any apparent problems #929

KyleMaas opened this issue Dec 2, 2023 · 6 comments · May be fixed by #938
Assignees

Comments

@KyleMaas
Copy link
Contributor

KyleMaas commented Dec 2, 2023

Describe the issue
Often times when transcoding to an MP4 profile one of the chunks of an encoding will come back as an encoding fail but the video seems to play fine if I pull the actual encoded file. The logs don't show any errors and look the same as successful chunks, Retries is 0, Status is Fail, and Total run time is also 0. But then if I click on Encoding file it plays just fine and everything looks okay. I also never see this with VP9 encoding.

To Reproduce
I'm not sure how to reproduce this. However, I did not see this problem until I moved to a server setup with much higher capacity but also higher disk latency. I've raised the timeouts for Celery to allow each task to run for considerably longer, far longer than successful chunks take to encode, but the issue persists.

Expected behavior
I would expect that the chunk, which from everything I'm seeing did in fact succeed in encoding, would be marked as Success instead. What I find most odd is that Total run time is 0 - seems like even for a failure, that should be filled in with some kind of a value.

Screenshots
Can't share screenshots of this setup unfortunately.

Environment (please complete the following information):

  • OS: Ubuntu Linux
  • Installation method: single server install
  • Browser, if applicable

Additional context
Add any other context about the problem here.

@KyleMaas
Copy link
Contributor Author

KyleMaas commented Dec 2, 2023

Aha, caught one in the act. From the Celery log:

[2023-12-02 15:06:34,881: ERROR/ForkPoolWorker-4] Task encode_media[91a5fb7e-4692-4c6f-b872-cfbd757d8985] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
    return self.run(*args, **kwargs)
  File "(path redacted)/files/tasks.py", line 365, in encode_media
    encoding.media_file.save(content=myfile, name=output_name)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/fields/files.py", line 99, in save
    self.instance.save()
  File "(path redacted)/files/models.py", line 1171, in save
    super(Encoding, self).save(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
    self.save_base(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
    post_save.send(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
    return [
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "(path redacted)/files/models.py", line 1569, in encoding_file_save
    encoding.save()
  File "(path redacted)/files/models.py", line 1171, in save
    super(Encoding, self).save(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
    self.save_base(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
    post_save.send(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
    return [
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "(path redacted)/files/models.py", line 1629, in encoding_file_save
    instance.media.post_encode_actions(encoding=instance, action="add")
  File "(path redacted)/files/models.py", line 643, in post_encode_actions
    tasks.create_hls(self.friendly_token)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/local.py", line 182, in __call__
    return self._get_current_object()(*a, **kw)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 761, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/task.py", line 411, in __call__
    return self.run(*args, **kwargs)
  File "(path redacted)/files/tasks.py", line 445, in create_hls
    shutil.rmtree(output_dir)
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '(path redacted)/media_files/hls/5033c4cda075419cb11bd712b21a88c66Di5uhLjQ'

@KyleMaas
Copy link
Contributor Author

KyleMaas commented Dec 2, 2023

I'm wondering at this point if maybe there are somehow two concurrent create_hls() calls going at the same time and one is erasing the directory while the other one's copying it.

@KyleMaas
Copy link
Contributor Author

KyleMaas commented Dec 3, 2023

It is most definitely running multiple instances of create_hls() concurrently. I added logging to the beginning and end of it, as well as catching and ignoring the FileNotFoundError when rmtree() runs and logging when it happens. Here's the result:

[2023-12-03 12:04:42,607: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,702: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,760: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,277: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,792: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,253: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,314: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,348: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,379: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,513: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,411: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,597: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,628: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:48,376: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:49,661: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:07:50,874: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,191: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,432: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:01,998: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,885: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,974: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,536: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,520: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,076: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,947: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,011: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,980: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,138: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,490: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,638: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,336: INFO/ForkPoolWorker-13] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,929: INFO/ForkPoolWorker-14] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,995: INFO/ForkPoolWorker-16] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,944: INFO/ForkPoolWorker-21] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,599: INFO/ForkPoolWorker-22] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,920: INFO/ForkPoolWorker-13] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,015: INFO/ForkPoolWorker-13] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,362: INFO/ForkPoolWorker-16] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,424: INFO/ForkPoolWorker-16] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,393: INFO/ForkPoolWorker-14] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,515: INFO/ForkPoolWorker-14] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,611: INFO/ForkPoolWorker-21] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,697: INFO/ForkPoolWorker-21] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:09,680: INFO/ForkPoolWorker-22] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:10,411: INFO/ForkPoolWorker-22] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6

@KyleMaas
Copy link
Contributor Author

KyleMaas commented Dec 5, 2023

Need to do some more debugging, but from reading through the code I think this assumption is not true:

# this should run only once!

@KyleMaas
Copy link
Contributor Author

So, I think what's happening is that a chunk gets to this point:

for f in chunks_paths:

If the disk throughput is low or the video is very long, there is nothing preventing another encoding save() action from coming along and triggering another copy of this. Basically any time an encoding is saved, it rechecks if the encoding has all the chunks complete, and if so, it triggers another recombine of all of the chunks. This process isn't triggered when a chunk goes from in-progress to successful, but any time a chunk is saved during or after being marked as successful, so any updates can trigger this. After the chunks are combined, it goes through and deletes the chunks, but if a bunch of them are saved while it's combining them it can cause multiple concurrent combine operations, resulting in multiple concurrent HLS creations when the combines are finished.

@KyleMaas
Copy link
Contributor Author

It's possible that by changing these check loops:

for chunk in orig_chunks:

for chunk in chunks:

To check to make sure that each chunk is in a "success" state instead of just making sure that it has media, it should make it so it's not able to do concurrent combines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants