New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote runner stuck #6282
Comments
I've modified the code of the runner to enable VAAPI hardware acceleration, this may be causing it. Ignore for now until I've isolated the bug further. |
Let it run overnight when it was stuck. Ctrl-Cing after 8+ hours it spat out an [01:37:02.442] INFO (1729443): Checking available jobs on https://pcbu.nl
[01:37:02.443] DEBUG (1729443): Requesting jobs on https://pcbu.nl for runner z97x
[01:37:02.635] INFO (1729443): [https://pcbu.nl] Processing job of type vod-hls-transcoding: 9529dd81-e4af-4c5f-8893-d2bcce2fa194
[01:37:02.637] INFO (1729443): Downloading input file https://pcbu.nl/api/v1/runners/jobs/9529dd81-e4af-4c5f-8893-d2bcce2fa194/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality for HLS transcoding job ptrjt-0dcc43c4-c422-46b3-8b87-aab030592c63
[01:37:02.637] DEBUG (1729443): Downloading file https://pcbu.nl/api/v1/runners/jobs/9529dd81-e4af-4c5f-8893-d2bcce2fa194/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality
[01:47:59.891] INFO (1729443): Downloaded input file https://pcbu.nl/api/v1/runners/jobs/9529dd81-e4af-4c5f-8893-d2bcce2fa194/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality for job ptrjt-0dcc43c4-c422-46b3-8b87-aab030592c63. Running HLS transcoding.
[01:47:59.892] DEBUG (1729443): Will run transcode.
[01:48:00.755] DEBUG (1729443): Apply ffmpeg params from libfdk_aac for audio stream of input /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/8d82c27b-d6b9-4d97-b776-570dc4b4f81b using default profile.
[01:48:00.756] DEBUG (1729443): Apply ffmpeg params from h264_vaapi for video stream of input /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/8d82c27b-d6b9-4d97-b776-570dc4b4f81b using default profile.
[01:48:00.761] DEBUG (1729443): ffmpeg command: ffmpeg -n 20 /usr/bin/ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -hwaccel_device /dev/dri/renderD128 -i /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/8d82c27b-d6b9-4d97-b776-570dc4b4f81b -y -acodec libfdk_aac -vcodec h264_vaapi -threads 2 -f mp4 -movflags faststart -max_muxing_queue_size 1024 -map_metadata -1 -vbr 5 -filter:a loudnorm=I=-13:LRA=20:TP=-2 -vf hwupload,scale_vaapi=w=-2:h=1080:format=nv12 -maxrate:v 6220877 -bufsize:v 12441754 -b_strategy 1 -bf 16 -qp 18 -r 30 -g:v 60 -hls_time 4 -hls_list_size 0 -hls_playlist_type vod -hls_segment_filename /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/f68005ce-9112-4622-9533-f7b7913c943d-1080-fragmented.mp4 -hls_segment_type fmp4 -f hls -hls_flags single_file /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/f68005ce-9112-4622-9533-f7b7913c943d-1080.m3u8
[02:18:12.175] DEBUG (1729443): FFmpeg command ended.
^C[13:24:37.518] INFO (1729443): Cleaning up after program exit
[13:24:37.580] ERROR (1729443): Expected status 204, got 400.
The server responded: "Job is not in "processing" state".
You may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs
err: {
"type": "Error",
"message": "Expected status 204, got 400. \nThe server responded: \"Job is not in \"processing\" state\".\nYou may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs",
"stack":
Error: Expected status 204, got 400.
The server responded: "Job is not in "processing" state".
You may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs
at buildRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75874:14)
at makePostBodyRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75816:10)
at RunnerJobsCommand.postBodyRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:84785:12)
at RunnerJobsCommand.abort (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:85268:17)
at RunnerServer.onExit (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:90635:33)
at process.<anonymous> (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:90471:20)
at process.emit (node:events:519:28)
----
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75877:19
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50823:17
at Test._assertFunction (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50809:17)
at Test.assert (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50699:27)
at localAssert (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50667:18)
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50671:11
at Request3.callback (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50240:7)
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50392:22
at IncomingMessage.<anonymous> (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:49643:11)
at IncomingMessage.emit (node:events:531:35)
"res": {
"req": {
"method": "POST",
"url": "https://pcbu.nl/api/v1/runners/jobs/281aeb0f-a03e-46c4-85e5-80a68d2b4f9e/abort",
"data": {
"reason": "Runner stopped",
"jobToken": "ptrjt-713f7a1f-c7d4-4ed1-b151-bda5acd11c42",
"runnerToken": "ptrt-5cabc452-26e8-4c8f-9c03-13d81a1da642"
},
"headers": {
"content-type": "application/json"
}
},
"header": {
"server": "nginx",
"date": "Thu, 21 Mar 2024 12:24:37 GMT",
"content-type": "application/problem+json; charset=utf-8",
"content-length": "262",
"connection": "close",
"x-powered-by": "PeerTube",
"tk": "N",
"access-control-allow-origin": "*",
"access-control-allow-credentials": "true",
"access-control-expose-headers": "Retry-After",
"x-ratelimit-limit": "50",
"x-ratelimit-remaining": "48",
"x-ratelimit-reset": "1711023883",
"etag": "W/\"106-IgZHWE3xH2agTCXd1tuwbcgGDlQ\"",
"x-protocol": "HTTP/1.1"
},
"status": 400,
"text": "{\"type\":\"https://docs.joinpeertube.org/api-rest-reference.html#section/Errors/runner_job_not_in_processing_state\",\"detail\":\"Job is not in \\\"processing\\\" state\",\"status\":400,\"code\":\"runner_job_not_in_processing_state\",\"error\":\"Job is not in \\\"processing\\\" state\"}"
}
}
13:24 ewout@z97x ~/workspace/personal/PeerTube% (git)-[feature/ffmpeg-hwaccel]
13:25 ewout@z97x ~/workspace/personal/PeerTube% DRI_PRIME=1 LIBVA_DRIVER_NAME=i965 node apps/peertube-runner/dist/peertube-runner.js --verbose server (git)-[feature/ffmpeg-hwaccel]
[13:25:15.512] INFO (2098655): Running PeerTube runner in server mode
[13:25:15.516] INFO (2098655): Using /home/ewout/.config/peertube-runner-nodejs/default/config.toml as configuration file
[13:25:15.536] INFO (2098655): Loading registered instance https://pcbu.nl
[13:25:15.540] INFO (2098655): IPC socket created on /home/ewout/.local/share/peertube-runner-nodejs/default/peertube-runner.sock
[13:25:15.640] INFO (2098655): Connected to https://pcbu.nl/runners socket
[13:25:17.205] INFO (2098655): Using /home/ewout/.cache/peertube-runner-nodejs/default/transcoding for transcoding directory
[13:25:17.205] INFO (2098655): Checking available jobs on https://pcbu.nl
[13:25:17.205] DEBUG (2098655): Requesting jobs on https://pcbu.nl for runner z97x
[13:25:17.364] INFO (2098655): [https://pcbu.nl] Processing job of type vod-hls-transcoding: f8daf235-a0ae-4635-9c41-ee6e3799e8dd
[13:25:17.367] INFO (2098655): Downloading input file https://pcbu.nl/api/v1/runners/jobs/f8daf235-a0ae-4635-9c41-ee6e3799e8dd/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality for HLS transcoding job ptrjt-354a2278-924d-4866-b03f-eac2d663d149
[13:25:17.368] DEBUG (2098655): Downloading file https://pcbu.nl/api/v1/runners/jobs/f8daf235-a0ae-4635-9c41-ee6e3799e8dd/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality
[13:25:19.870] INFO (2098655): Checking available jobs on https://pcbu.nl
[13:25:19.871] DEBUG (2098655): Requesting jobs on https://pcbu.nl for runner z97x
[13:25:20.183] INFO (2098655): [https://pcbu.nl] Processing job of type vod-hls-transcoding: 0c9cfddf-2c33-43d0-8323-da6d48db1002
[13:25:20.184] INFO (2098655): Downloading input file https://pcbu.nl/api/v1/runners/jobs/0c9cfddf-2c33-43d0-8323-da6d48db1002/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality for HLS transcoding job ptrjt-fd572096-4f97-410c-8c63-f970ce213419
[13:25:20.184] DEBUG (2098655): Downloading file https://pcbu.nl/api/v1/runners/jobs/0c9cfddf-2c33-43d0-8323-da6d48db1002/files/videos/05fa3c9a-cd71-4e7f-bbff-2fab27b9d96b/max-quality Screenshot of remote runner jobs list: |
Okay, after a day debugging it was caused by a Intel VAAPI bug and ffmpeg silently crashing. Switching to Mesa VAAPI (AMD) instantly fixed it. Apologies for false positive. |
Hello, Can you try with peertube runner 0.0.12? |
Instead of develop branch? |
04:42 ewout@z97x ~/workspace/personal/PeerTube% LIBVA_DRIVER_NAME=radeonsi node apps/peertube-runner/dist/peertube-runner.js --verbose server (git)-[feature/ffmpeg-hwaccel]
[04:42:50.281] INFO (320428): Running PeerTube runner in server mode
[04:42:50.287] INFO (320428): Using /home/ewout/.config/peertube-runner-nodejs/default/config.toml as configuration file
[04:42:50.309] INFO (320428): Loading registered instance https://pcbu.nl
[04:42:50.314] INFO (320428): IPC socket created on /home/ewout/.local/share/peertube-runner-nodejs/default/peertube-runner.sock
[04:42:50.401] INFO (320428): Connected to https://pcbu.nl/runners socket
[04:42:50.633] INFO (320428): Using /home/ewout/.cache/peertube-runner-nodejs/default/transcoding for transcoding directory
[04:42:50.633] INFO (320428): Checking available jobs on https://pcbu.nl
[04:42:50.634] DEBUG (320428): Requesting jobs on https://pcbu.nl for runner z97x
[04:42:50.784] INFO (320428): [https://pcbu.nl] Processing job of type vod-hls-transcoding: 6255d3d1-fbc1-49da-a5ee-16879de0eb8d
[04:42:50.787] INFO (320428): Downloading input file https://pcbu.nl/api/v1/runners/jobs/6255d3d1-fbc1-49da-a5ee-16879de0eb8d/files/videos/402cb23f-0671-4506-8f00-63b0cc47ea4f/max-quality for HLS transcoding job ptrjt-de1ee6a6-ceee-4b9f-8844-4d16315849b0
[04:42:50.788] DEBUG (320428): Downloading file https://pcbu.nl/api/v1/runners/jobs/6255d3d1-fbc1-49da-a5ee-16879de0eb8d/files/videos/402cb23f-0671-4506-8f00-63b0cc47ea4f/max-quality
[04:48:54.820] INFO (320428): Downloaded input file https://pcbu.nl/api/v1/runners/jobs/6255d3d1-fbc1-49da-a5ee-16879de0eb8d/files/videos/402cb23f-0671-4506-8f00-63b0cc47ea4f/max-quality for job ptrjt-de1ee6a6-ceee-4b9f-8844-4d16315849b0. Running HLS transcoding.
[04:48:54.821] DEBUG (320428): Will run transcode.
[04:48:56.154] INFO (320428): Built supported ffmpeg encoders.
[04:48:56.155] DEBUG (320428): Apply ffmpeg params from libfdk_aac for audio stream of input /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/e5bd5811-afc8-48ff-9da8-bab4cece61be using default profile.
[04:48:56.155] DEBUG (320428): Apply ffmpeg params from h264_vaapi for video stream of input /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/e5bd5811-afc8-48ff-9da8-bab4cece61be using default profile.
[04:48:56.512] DEBUG (320428): ffmpeg command: ffmpeg -n 20 /usr/bin/ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -hwaccel_device /dev/dri/renderD129 -i /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/e5bd5811-afc8-48ff-9da8-bab4cece61be -y -acodec libfdk_aac -vcodec h264_vaapi -threads 2 -f mp4 -movflags +faststart -max_muxing_queue_size 1024 -map_metadata -1 -vbr 5 -af loudnorm=I=-13:LRA=20:TP=-2 -vf hwupload,scale_vaapi=w=1920:h=1080:mode=hq:format=nv12 -qp 18 -r 30 -g:v 60 -hls_time 4 -hls_list_size 0 -hls_playlist_type vod -hls_segment_filename /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/155fbea0-4c04-4478-ba25-33676a791962-1080-fragmented.mp4 -hls_segment_type fmp4 -f hls -hls_flags single_file /home/ewout/.cache/peertube-runner-nodejs/default/transcoding/155fbea0-4c04-4478-ba25-33676a791962-1080.m3u8
[05:51:46.234] DEBUG (320428): FFmpeg command ended.
^C[14:56:21.026] INFO (320428): Cleaning up after program exit
[14:56:21.111] ERROR (320428): Expected status 204, got 400.
The server responded: "Job is not in "processing" state".
You may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs
err: {
"type": "Error",
"message": "Expected status 204, got 400. \nThe server responded: \"Job is not in \"processing\" state\".\nYou may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs",
"stack":
Error: Expected status 204, got 400.
The server responded: "Job is not in "processing" state".
You may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs
at buildRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75874:14)
at makePostBodyRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75816:10)
at RunnerJobsCommand.postBodyRequest (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:84785:12)
at RunnerJobsCommand.abort (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:85268:17)
at RunnerServer.onExit (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:90650:33)
at process.<anonymous> (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:90486:20)
at process.emit (node:events:519:28)
----
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:75877:19
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50823:17
at Test._assertFunction (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50809:17)
at Test.assert (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50699:27)
at localAssert (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50667:18)
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50671:11
at Request3.callback (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50240:7)
at file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:50392:22
at IncomingMessage.<anonymous> (file:///home/ewout/workspace/personal/PeerTube/apps/peertube-runner/dist/peertube-runner.js:49643:11)
at IncomingMessage.emit (node:events:531:35)
"res": {
"req": {
"method": "POST",
"url": "https://pcbu.nl/api/v1/runners/jobs/6255d3d1-fbc1-49da-a5ee-16879de0eb8d/abort",
"data": {
"reason": "Runner stopped",
"jobToken": "ptrjt-de1ee6a6-ceee-4b9f-8844-4d16315849b0",
"runnerToken": "ptrt-5cabc452-26e8-4c8f-9c03-13d81a1da642"
},
"headers": {
"content-type": "application/json"
}
},
"header": {
"server": "nginx",
"date": "Mon, 25 Mar 2024 13:56:21 GMT",
"content-type": "application/problem+json; charset=utf-8",
"content-length": "262",
"connection": "close",
"x-powered-by": "PeerTube",
"tk": "N",
"access-control-allow-origin": "*",
"access-control-allow-credentials": "true",
"access-control-expose-headers": "Retry-After",
"x-ratelimit-limit": "50",
"x-ratelimit-remaining": "25",
"x-ratelimit-reset": "1711374985",
"etag": "W/\"106-IgZHWE3xH2agTCXd1tuwbcgGDlQ\"",
"x-protocol": "HTTP/1.1"
},
"status": 400,
"text": "{\"type\":\"https://docs.joinpeertube.org/api-rest-reference.html#section/Errors/runner_job_not_in_processing_state\",\"detail\":\"Job is not in \\\"processing\\\" state\",\"status\":400,\"code\":\"runner_job_not_in_processing_state\",\"error\":\"Job is not in \\\"processing\\\" state\"}"
}
}
14:56 ewout@z97x ~/workspace/personal/PeerTube% |
Can you try with the official runner release, and so without modified ffmpeg command? |
I don't see how that'd solve the issue, given the From the logs above: [05:51:46.234] DEBUG (320428): FFmpeg command ended. I'll try to debug the server more using the instructions specified here. As this leads me to believe the fault is there and it has something to do with the job scheduling being out of sync: [14:20:38.894] ERROR (27914): Expected status 204, got 400.
The server responded: "Job is not in "processing" state".
You may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs
err: {
"type": "Error",
"message": "Expected status 204, got 400. \nThe server responded: \"Job is not in \"processing\" state\".\nYou may take a closer look at the logs. To see how to do so, check out this page: https://github.com/Chocobozzz/PeerTube/blob/develop/support/doc/development/tests.md#debug-server-logs", |
I don't know modifications you made so I prefer to try to debug your issue with the official runner first. After the |
Issue should be fixed by 93b09bf included in runner v0.0.16 :) |
I had something that looks a lot like this bug, on my standard (no ffmpeg change) remote-runner running on 0.0.13. I have 6 runners connected to ~10 peertubes, and the runners get stuck pretty quickly, at least once a day, even when they have nothing to do. Will look at 0.0.16 and if it's still stuck, i'll open a new bug, thanks Chocobozzz! |
I just updated 3 of my 6 runners to 0.0.16, will tell you if they get stucked again or not ;) |
1 of the 3 non-updated has been stucked :/ looks fixed to me |
Not fixed, ran into the same behavior as previously posted here (the 204 expected, 400 received error on Ctrl-C and it not fetching new jobs for over 8+ hours). Tracked the bug down I think in PR #6318.
Edit: nevermind, got stuck after the second again.. It may be a relevant detail that i'm running the runner with a concurrency of 1 btw. |
I think I narrowed it down: during debugging I observed the large upload (>3GB) (after transcode/encode) to PeerTube never resolving the result promise. It seems to be a upstream issue as others reported the same thing ladjs/supertest#727. Now rewriting parts of the code to utilize node-fetch instead to see if that fixes the issue. |
Describe the current behavior
My PeerTube instance has been configured to remotely transcode everything, however, it gets stuck.
Steps to reproduce
Describe the expected behavior
The runner not getting stuck.
Additional information
PeerTube instance:
Browser name, version and platforms on which you could reproduce the bug:
Link to browser console log if relevant:
Link to server log if relevant (
journalctl
or/var/www/peertube/storage/logs/
):The text was updated successfully, but these errors were encountered: