New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[unige] Add extractor for UniGE mediaserver #32596
base: master
Are you sure you want to change the base?
Conversation
This enables to extract videos posted to the University of Geneva mediaserver, located at https://mediaserver.unige.ch. It supports: - Single courses/videos - Whole course "playlists" (accessible by clicking on the course's title) Signed-off-by: Frank Villaro-Dixon <frank@villaro-dixon.eu>
Thanks for your work! Before progressing this PR (and sorry for the slow response), I'd like to check that content is generally accessible outside the University. If not, it would be better to serve a local yt-dl fork for UniGe users. |
Thanks for your reply !
It depends on the videos. Some of them are public, and others need authentication.
Cheers
…On December 2, 2023 2:33:31 PM GMT+01:00, dirkf ***@***.***> wrote:
Thanks for your work!
Before progressing this PR (and sorry for the slow response), I'd like to check that content is generally accessible outside the University. If not, it would be better to serve a local yt-dl fork for UniGe users.
--
Reply to this email directly or view it on GitHub:
#32596 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
I've made a few comments and suggestions. Have a look.
Also, I forgot to say before: please paste a completed Site Support issue template into the PR: it's where you affirm that the site meets the conditions to be supported by yt-dl (spoiler: I expect that it does).
urlencode_postdata, | ||
) | ||
|
||
from youtube_dl.compat import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer:
from youtube_dl.compat import ( | |
from ..compat import ( |
self.raise_login_required('You need a username/pwd to access this video') | ||
|
||
try: | ||
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this fancy f-string stuff here, please (breaks Python 2)! The line could be outside the try
block.
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | |
secure_wp = 'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id) |
try: | ||
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | ||
self._download_webpage( | ||
secure_wp, None, 'Logging in', | ||
data=urlencode_postdata({ | ||
'httpd_username': username, | ||
'httpd_password': password, | ||
}), headers={ | ||
'Referer': secure_wp, | ||
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', | ||
}) | ||
except ExtractorError as e: | ||
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400: | ||
raise ExtractorError( | ||
'Unable to login: incorrect username and/or password', | ||
expected=True) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this should work there is another pattern that you may find simpler (at least, it should be equivalent, but I didn't test this specific code).
Also, as we're not reading the page, use _request_webpage()
instead of _download_webpage_handle()
.
Like so:
try: | |
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | |
self._download_webpage( | |
secure_wp, None, 'Logging in', | |
data=urlencode_postdata({ | |
'httpd_username': username, | |
'httpd_password': password, | |
}), headers={ | |
'Referer': secure_wp, | |
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', | |
}) | |
except ExtractorError as e: | |
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400: | |
raise ExtractorError( | |
'Unable to login: incorrect username and/or password', | |
expected=True) | |
raise | |
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | |
_, urlh = self._request_webpage( | |
secure_wp, None, 'Logging in', | |
data=urlencode_postdata({ | |
'httpd_username': username, | |
'httpd_password': password, | |
}), headers={ | |
'Referer': secure_wp, | |
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', | |
}, expected_status=400) | |
if urlh.getcode() == 400: | |
raise ExtractorError( | |
'Unable to login: incorrect username and/or password', | |
expected=True) |
But actually, this is being done twice, so it should be pulled out:
try: | |
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | |
self._download_webpage( | |
secure_wp, None, 'Logging in', | |
data=urlencode_postdata({ | |
'httpd_username': username, | |
'httpd_password': password, | |
}), headers={ | |
'Referer': secure_wp, | |
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', | |
}) | |
except ExtractorError as e: | |
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400: | |
raise ExtractorError( | |
'Unable to login: incorrect username and/or password', | |
expected=True) | |
raise | |
ret_code = self._check_login( | |
video_id, 'Logging in', | |
data=urlencode_postdata({ | |
'httpd_username': username, | |
'httpd_password': password, | |
}), headers={ | |
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', | |
}, expected_status=400) | |
if ret_code == 400: | |
raise ExtractorError( | |
'Unable to login: incorrect username and/or password', | |
expected=True) |
'url': 'https://mediaserver.unige.ch/proxy/196613/VN3-2569-2023-2024-09-19.mp4', | ||
'only_matching': True, | ||
}] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion from below can be pulled out. kwargs
are going to match those of _request_webpage()
, though maybe expected_status
could made an explicit argument:
def _check_login(self, video_id, note, **kwargs): | |
_, urlh = self._request_webpage( | |
'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id), | |
None, note, **kwargs) | |
return urlh.getcode() | |
try: | ||
# This dumb download only checks if we need to login, as authentication | ||
# is unique (and sometimes optional) for each video | ||
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | ||
self._download_webpage(secure_wp, f'secure_{video_id}') | ||
except ExtractorError as e: | ||
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401: | ||
self._login(video_id) | ||
else: | ||
# The video doesn't require login | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the proposed new method:
try: | |
# This dumb download only checks if we need to login, as authentication | |
# is unique (and sometimes optional) for each video | |
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}' | |
self._download_webpage(secure_wp, f'secure_{video_id}') | |
except ExtractorError as e: | |
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401: | |
self._login(video_id) | |
else: | |
# The video doesn't require login | |
pass | |
try: | |
# This dumb download only checks if we need to login, as authentication | |
# is unique (and sometimes optional) for each video | |
if self._check_login(video_id, 'secure_{0}'.format(video_id), 401) == 401: | |
self._login(video_id) | |
else: | |
# The video doesn't require login | |
pass |
course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige') | ||
course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several points here:
- the extraction shouldn't crash if these optional fields are missing
- the regexes could be improved
- since 1e8ccdd, we can search for two things at once, like in yt-dlp.
course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige') | |
course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige') | |
course_title, course_id = self._search_regex( | |
r'''<a\s[^>]*\bhref\s*=\s*('|")?/collection/(?P<courseid>[+\w-]+)(?(1)\1|(?![+-])\b)[^>]*>(?P<course>.*)</a>''', | |
webpage, 'course details', group=('course', 'courseid'), default=(None, None)) |
return { | ||
'id': video_id, | ||
'title': video_title, | ||
'url': video_url, | ||
'channel': course_title, | ||
'channel_id': course_id, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've set an info
according to the suggestion above:
return { | |
'id': video_id, | |
'title': video_title, | |
'url': video_url, | |
'channel': course_title, | |
'channel_id': course_id, | |
} | |
info.update({ | |
'id': video_id, | |
'title': video_title, | |
'channel': course_title, | |
'channel_id': course_id, | |
}) | |
return info |
'id': video_id, | ||
'title': video_title, | ||
'url': video_url, | ||
'channel': course_title, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was extracted with _search_regex()
according to the suggestion above:
'channel': course_title, | |
'channel': clean_html(course_title), |
# The video doesn't require login | ||
pass | ||
|
||
video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to allow matching \n
in the heading text and rely on clean_html()
called by the search method to strip unwanted characters.
video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige') | |
video_title = self._html_search_regex(r'(?s)<h1>(.+?)</h1>, 'title') |
Although the above strips white space at the start and end of the match, this does it explicitly and crashes if the putative title would be empty:
video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige') | |
video_title = self._html_search_regex(r'(?s)<h1>\s*(\S.*?)\s*</h1>, 'title') |
class UnigeIE(InfoExtractor): | ||
_VALID_URL = r'https://mediaserver.unige.ch/play/(?P<id>\d+)' | ||
_TESTS = [{ | ||
'url': 'https://mediaserver.unige.ch/play/196613', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this is a public video?
Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
This enables to extract videos posted to the University of Geneva mediaserver, located at https://mediaserver.unige.ch. It supports:
Thanks and cheers !