Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unige] Add extractor for UniGE mediaserver #32596

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Frankkkkk
Copy link

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This enables to extract videos posted to the University of Geneva mediaserver, located at https://mediaserver.unige.ch. It supports:

  • Single courses/videos
  • Whole course "playlists" (accessible by clicking on the course's title)

Thanks and cheers !

This enables to extract videos posted to the University of Geneva mediaserver,
located at https://mediaserver.unige.ch. It supports:
- Single courses/videos
- Whole course "playlists" (accessible by clicking on the course's
  title)

Signed-off-by: Frank Villaro-Dixon <frank@villaro-dixon.eu>
@dirkf
Copy link
Contributor

dirkf commented Dec 2, 2023

Thanks for your work!

Before progressing this PR (and sorry for the slow response), I'd like to check that content is generally accessible outside the University. If not, it would be better to serve a local yt-dl fork for UniGe users.

@Frankkkkk
Copy link
Author

Frankkkkk commented Dec 2, 2023 via email

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

I've made a few comments and suggestions. Have a look.

Also, I forgot to say before: please paste a completed Site Support issue template into the PR: it's where you affirm that the site meets the conditions to be supported by yt-dl (spoiler: I expect that it does).

urlencode_postdata,
)

from youtube_dl.compat import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer:

Suggested change
from youtube_dl.compat import (
from ..compat import (

self.raise_login_required('You need a username/pwd to access this video')

try:
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this fancy f-string stuff here, please (breaks Python 2)! The line could be outside the try block.

Suggested change
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
secure_wp = 'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id)

Comment on lines +37 to +53
try:
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
self._download_webpage(
secure_wp, None, 'Logging in',
data=urlencode_postdata({
'httpd_username': username,
'httpd_password': password,
}), headers={
'Referer': secure_wp,
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
})
except ExtractorError as e:
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:
raise ExtractorError(
'Unable to login: incorrect username and/or password',
expected=True)
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this should work there is another pattern that you may find simpler (at least, it should be equivalent, but I didn't test this specific code).

Also, as we're not reading the page, use _request_webpage() instead of _download_webpage_handle().

Like so:

Suggested change
try:
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
self._download_webpage(
secure_wp, None, 'Logging in',
data=urlencode_postdata({
'httpd_username': username,
'httpd_password': password,
}), headers={
'Referer': secure_wp,
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
})
except ExtractorError as e:
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:
raise ExtractorError(
'Unable to login: incorrect username and/or password',
expected=True)
raise
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
_, urlh = self._request_webpage(
secure_wp, None, 'Logging in',
data=urlencode_postdata({
'httpd_username': username,
'httpd_password': password,
}), headers={
'Referer': secure_wp,
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}, expected_status=400)
if urlh.getcode() == 400:
raise ExtractorError(
'Unable to login: incorrect username and/or password',
expected=True)

But actually, this is being done twice, so it should be pulled out:

Suggested change
try:
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
self._download_webpage(
secure_wp, None, 'Logging in',
data=urlencode_postdata({
'httpd_username': username,
'httpd_password': password,
}), headers={
'Referer': secure_wp,
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
})
except ExtractorError as e:
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:
raise ExtractorError(
'Unable to login: incorrect username and/or password',
expected=True)
raise
ret_code = self._check_login(
video_id, 'Logging in',
data=urlencode_postdata({
'httpd_username': username,
'httpd_password': password,
}), headers={
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}, expected_status=400)
if ret_code == 400:
raise ExtractorError(
'Unable to login: incorrect username and/or password',
expected=True)

'url': 'https://mediaserver.unige.ch/proxy/196613/VN3-2569-2023-2024-09-19.mp4',
'only_matching': True,
}]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion from below can be pulled out. kwargs are going to match those of _request_webpage(), though maybe expected_status could made an explicit argument:

Suggested change
def _check_login(self, video_id, note, **kwargs):
_, urlh = self._request_webpage(
'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id),
None, note, **kwargs)
return urlh.getcode()

Comment on lines +59 to +69
try:
# This dumb download only checks if we need to login, as authentication
# is unique (and sometimes optional) for each video
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
self._download_webpage(secure_wp, f'secure_{video_id}')
except ExtractorError as e:
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401:
self._login(video_id)
else:
# The video doesn't require login
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the proposed new method:

Suggested change
try:
# This dumb download only checks if we need to login, as authentication
# is unique (and sometimes optional) for each video
secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
self._download_webpage(secure_wp, f'secure_{video_id}')
except ExtractorError as e:
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401:
self._login(video_id)
else:
# The video doesn't require login
pass
try:
# This dumb download only checks if we need to login, as authentication
# is unique (and sometimes optional) for each video
if self._check_login(video_id, 'secure_{0}'.format(video_id), 401) == 401:
self._login(video_id)
else:
# The video doesn't require login
pass

Comment on lines +72 to +73
course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')
course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several points here:

  • the extraction shouldn't crash if these optional fields are missing
  • the regexes could be improved
  • since 1e8ccdd, we can search for two things at once, like in yt-dlp.
Suggested change
course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')
course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')
course_title, course_id = self._search_regex(
r'''<a\s[^>]*\bhref\s*=\s*('|")?/collection/(?P<courseid>[+\w-]+)(?(1)\1|(?![+-])\b)[^>]*>(?P<course>.*)</a>''',
webpage, 'course details', group=('course', 'courseid'), default=(None, None))

Comment on lines +78 to +84
return {
'id': video_id,
'title': video_title,
'url': video_url,
'channel': course_title,
'channel_id': course_id,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've set an info according to the suggestion above:

Suggested change
return {
'id': video_id,
'title': video_title,
'url': video_url,
'channel': course_title,
'channel_id': course_id,
}
info.update({
'id': video_id,
'title': video_title,
'channel': course_title,
'channel_id': course_id,
})
return info

'id': video_id,
'title': video_title,
'url': video_url,
'channel': course_title,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was extracted with _search_regex() according to the suggestion above:

Suggested change
'channel': course_title,
'channel': clean_html(course_title),

# The video doesn't require login
pass

video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to allow matching \n in the heading text and rely on clean_html() called by the search method to strip unwanted characters.

Suggested change
video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')
video_title = self._html_search_regex(r'(?s)<h1>(.+?)</h1>, 'title')

Although the above strips white space at the start and end of the match, this does it explicitly and crashes if the putative title would be empty:

Suggested change
video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')
video_title = self._html_search_regex(r'(?s)<h1>\s*(\S.*?)\s*</h1>, 'title')

class UnigeIE(InfoExtractor):
_VALID_URL = r'https://mediaserver.unige.ch/play/(?P<id>\d+)'
_TESTS = [{
'url': 'https://mediaserver.unige.ch/play/196613',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is a public video?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants