Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to skip n-sig calculation #32687

Open
3 tasks done
wellsyw opened this issue Jan 9, 2024 · 4 comments
Open
3 tasks done

ability to skip n-sig calculation #32687

wellsyw opened this issue Jan 9, 2024 · 4 comments
Labels

Comments

@wellsyw
Copy link

wellsyw commented Jan 9, 2024

Checklist

  • I'm reporting a site feature request
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've searched the bugtracker for similar site feature requests including closed ones

Description

The n-sig calculation takes roughly 4-5 seconds each for me (twice for each video), so solving it makes youtube-dl spend 10 seconds of cpu time for each url. It would be nice to have an option to skip the calculation if it is not strictly necessary, or even better, automatic detection of its necessity. For instance, I'm fairly sure the --get-title, --get-duration, -F (possibly) options do not require solving the signature.

I have below a simple helper script that displays the channel ids for youtube urls, and it just spent 42 seconds to get the information for five videos. This is, of course, unacceptable performance.

#!/bin/sh
FORMAT='%(id)s %(channel_id)s %(uploader)s'

exec python youtube_dl/__main__.py \
	-o "$FORMAT" --get-filename \
	"$@"

Of course, making the calculation about 100 times faster would be preferred, but until that happens, the ability to skip it on demand would be an acceptable substitute.

@dirkf
Copy link
Contributor

dirkf commented Jan 10, 2024

The problem is that the extractor code doesn't know whether it needs to calculate valid format URLs or not, since this is only known in the calling code (YoutubeDL.extract_info() in youtube_dl/YoutubeDL.py).

To achieve what you suggest with the current YT extraction scheme, we'd have to do one of these:

  1. define a method of YoutubeDL that the extractor code could call to determine whether valid format links are needed, or
  2. define a way of returning the format links as a continuation that is either evaluated by YoutubeDL.process_video_result() (say), or discarded, depending on whether valid format links are needed.

Or the extraction scheme could be modified so that a different YT client could be used, as in yt-dlp and as already used for age-gated videos. But that client gets fewer, if unthrottled, formats.

The n-sig processing is meant to cache results: when the same n-sig value is seen in a whole lot of formats for one video it should only be computed once.

I tried this command:

$ time python -m youtube_dl -o '%(id)s %(channel_id)s %(uploader)s' test:YouTube --get-filename

On this machine, low-spec by today's standards, distro Py3.11 seems to be almost 3x as fast as the miniconda Py2.7 (on another machine, distro Py2.7 is much closer to PPA Py3.9). Disabling n-sig processing cuts execution time from 9s to 3s with Py3 and 26s to 5s with Py2.

If someone were to try profiling the current code (we did that when the n-sig processing was first implemented), it might indicate some unsuspected hog.

One known execution time driver is that, whenever YT changes its player JS, we have to fetch that, a 2MB download, in addition to the bloated page and/or API JSON. This is still going to be a small part of the total run time with a typical modern internet connection.

@wellsyw
Copy link
Author

wellsyw commented Jan 11, 2024

If I understood your response correctly, you say that:

  1. the code is decoupled enough that passing a command-line argument to the relevant code would be somewhat difficult to implement
  2. upgrading python version may or may not give a performance boost.

A third, hackish approach would be trivial to implement: an environment variable could easily be passed to the code, and achieve the same effect, I guess.

For what it's worth, I have an Athlon II ("Rana" core) so it is not the newest thing around. Using youtube with the new 'polymer' interface is often unbearably slow, so I mostly use youtube-dl to make up for it.

Anyway, this is digressing but the normal runtime for youtube-dl with --simulate for me is about 13 seconds (so n-sig calculation takes ~70% of runtime), but I also found some videos where the runtime is much longer, 25-30 seconds or more per video and it is not caused by the n-sig processing, but rather something that happens between the two instances of n-sig solving, judging from a debug print or two. But I'll just file a new bug for that.

@dirkf
Copy link
Contributor

dirkf commented Jan 16, 2024

Please try #32695, or a new nightly build that incorporates it after I merge it.

@wellsyw
Copy link
Author

wellsyw commented Jan 16, 2024

Well, I'll be. Three seconds or a bit under.

But, er, all downloads seem to be throttled now?

I spotted a &n=%3Cfunction+inner+at+0x808a01d70%3E& in the url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants