[unige] Add extractor for UniGE mediaserver #32596

Frankkkkk · 2023-10-12T23:05:58Z

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This enables to extract videos posted to the University of Geneva mediaserver, located at https://mediaserver.unige.ch. It supports:

Single courses/videos
Whole course "playlists" (accessible by clicking on the course's title)

Thanks and cheers !

This enables to extract videos posted to the University of Geneva mediaserver, located at https://mediaserver.unige.ch. It supports: - Single courses/videos - Whole course "playlists" (accessible by clicking on the course's title) Signed-off-by: Frank Villaro-Dixon <frank@villaro-dixon.eu>

dirkf · 2023-12-02T13:33:20Z

Thanks for your work!

Before progressing this PR (and sorry for the slow response), I'd like to check that content is generally accessible outside the University. If not, it would be better to serve a local yt-dl fork for UniGe users.

Frankkkkk · 2023-12-02T13:51:34Z

Thanks for your reply ! It depends on the videos. Some of them are public, and others need authentication. Cheers

…

On December 2, 2023 2:33:31 PM GMT+01:00, dirkf ***@***.***> wrote: Thanks for your work! Before progressing this PR (and sorry for the slow response), I'd like to check that content is generally accessible outside the University. If not, it would be better to serve a local yt-dl fork for UniGe users. -- Reply to this email directly or view it on GitHub: #32596 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

dirkf

Thanks for your work!

I've made a few comments and suggestions. Have a look.

Also, I forgot to say before: please paste a completed Site Support issue template into the PR: it's where you affirm that the site meets the conditions to be supported by yt-dl (spoiler: I expect that it does).

dirkf · 2023-12-05T23:04:33Z

youtube_dl/extractor/unige.py

+    urlencode_postdata,
+)
+
+from youtube_dl.compat import (


Prefer:

Suggested change

from youtube_dl.compat import (

from ..compat import (

dirkf · 2023-12-05T23:07:15Z

youtube_dl/extractor/unige.py

+            self.raise_login_required('You need a username/pwd to access this video')
+
+        try:
+            secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'


None of this fancy f-string stuff here, please (breaks Python 2)! The line could be outside the try block.

Suggested change

secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'

secure_wp = 'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id)

dirkf · 2023-12-05T23:28:47Z

youtube_dl/extractor/unige.py

+        try:
+            secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
+            self._download_webpage(
+                secure_wp, None, 'Logging in',
+                data=urlencode_postdata({
+                    'httpd_username': username,
+                    'httpd_password': password,
+                }), headers={
+                    'Referer': secure_wp,
+                    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
+                })
+        except ExtractorError as e:
+            if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:
+                raise ExtractorError(
+                    'Unable to login: incorrect username and/or password',
+                    expected=True)
+            raise


Although this should work there is another pattern that you may find simpler (at least, it should be equivalent, but I didn't test this specific code).

Also, as we're not reading the page, use _request_webpage() instead of _download_webpage_handle().

Like so:

Suggested change

try:

secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'

self._download_webpage(

secure_wp, None, 'Logging in',

data=urlencode_postdata({

'httpd_username': username,

'httpd_password': password,

}), headers={

'Referer': secure_wp,

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

})

except ExtractorError as e:

if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:

raise ExtractorError(

'Unable to login: incorrect username and/or password',

expected=True)

raise

secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'

_, urlh = self._request_webpage(

secure_wp, None, 'Logging in',

data=urlencode_postdata({

'httpd_username': username,

'httpd_password': password,

}), headers={

'Referer': secure_wp,

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

}, expected_status=400)

if urlh.getcode() == 400:

raise ExtractorError(

'Unable to login: incorrect username and/or password',

expected=True)

But actually, this is being done twice, so it should be pulled out:

Suggested change

try:

secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'

self._download_webpage(

secure_wp, None, 'Logging in',

data=urlencode_postdata({

'httpd_username': username,

'httpd_password': password,

}), headers={

'Referer': secure_wp,

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

})

except ExtractorError as e:

if isinstance(e.cause, compat_HTTPError) and e.cause.code == 400:

raise ExtractorError(

'Unable to login: incorrect username and/or password',

expected=True)

raise

ret_code = self._check_login(

video_id, 'Logging in',

data=urlencode_postdata({

'httpd_username': username,

'httpd_password': password,

}), headers={

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

}, expected_status=400)

if ret_code == 400:

raise ExtractorError(

'Unable to login: incorrect username and/or password',

expected=True)

dirkf · 2023-12-05T23:48:24Z

youtube_dl/extractor/unige.py

+        'url': 'https://mediaserver.unige.ch/proxy/196613/VN3-2569-2023-2024-09-19.mp4',
+        'only_matching': True,
+    }]
+


This suggestion from below can be pulled out. kwargs are going to match those of _request_webpage(), though maybe expected_status could made an explicit argument:

Suggested change

def _check_login(self, video_id, note, **kwargs):

_, urlh = self._request_webpage(

'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id),

None, note, **kwargs)

return urlh.getcode()

dirkf · 2023-12-06T00:03:00Z

youtube_dl/extractor/unige.py

+        try:
+            # This dumb download only checks if we need to login, as authentication
+            # is unique (and sometimes optional) for each video
+            secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
+            self._download_webpage(secure_wp, f'secure_{video_id}')
+        except ExtractorError as e:
+            if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401:
+                self._login(video_id)
+            else:
+                # The video doesn't require login
+                pass


Using the proposed new method:

Suggested change

try:

# This dumb download only checks if we need to login, as authentication

# is unique (and sometimes optional) for each video

secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'

self._download_webpage(secure_wp, f'secure_{video_id}')

except ExtractorError as e:

if isinstance(e.cause, compat_HTTPError) and e.cause.code == 401:

self._login(video_id)

else:

# The video doesn't require login

pass

try:

# This dumb download only checks if we need to login, as authentication

# is unique (and sometimes optional) for each video

if self._check_login(video_id, 'secure_{0}'.format(video_id), 401) == 401:

self._login(video_id)

else:

# The video doesn't require login

pass

dirkf · 2023-12-06T00:48:11Z

youtube_dl/extractor/unige.py

+        course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')
+        course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')


Several points here:

the extraction shouldn't crash if these optional fields are missing

the regexes could be improved

since 1e8ccdd, we can search for two things at once, like in yt-dlp.

Suggested change

course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')

course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')

course_title, course_id = self._search_regex(

r'''<a\s[^>]*\bhref\s*=\s*('|")?/collection/(?P<courseid>[+\w-]+)(?(1)\1|(?![+-])\b)[^>]*>(?P<course>.*)</a>''',

webpage, 'course details', group=('course', 'courseid'), default=(None, None))

dirkf · 2023-12-06T00:53:48Z

youtube_dl/extractor/unige.py

+        return {
+            'id': video_id,
+            'title': video_title,
+            'url': video_url,
+            'channel': course_title,
+            'channel_id': course_id,
+        }


If you've set an info according to the suggestion above:

Suggested change

return {

'id': video_id,

'title': video_title,

'url': video_url,

'channel': course_title,

'channel_id': course_id,

}

info.update({

'id': video_id,

'title': video_title,

'channel': course_title,

'channel_id': course_id,

})

return info

dirkf · 2023-12-06T00:55:22Z

youtube_dl/extractor/unige.py

+            'id': video_id,
+            'title': video_title,
+            'url': video_url,
+            'channel': course_title,


If this was extracted with _search_regex() according to the suggestion above:

Suggested change

'channel': course_title,

'channel': clean_html(course_title),

dirkf · 2023-12-06T01:09:09Z

youtube_dl/extractor/unige.py

+                # The video doesn't require login
+                pass
+
+        video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')


Maybe better to allow matching \n in the heading text and rely on clean_html() called by the search method to strip unwanted characters.

Suggested change

video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')

video_title = self._html_search_regex(r'(?s)<h1>(.+?)</h1>, 'title')

Although the above strips white space at the start and end of the match, this does it explicitly and crashes if the putative title would be empty:

Suggested change

video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')

video_title = self._html_search_regex(r'(?s)<h1>\s*(\S.*?)\s*</h1>, 'title')

dirkf · 2023-12-06T01:12:37Z

youtube_dl/extractor/unige.py

+class UnigeIE(InfoExtractor):
+    _VALID_URL = r'https://mediaserver.unige.ch/play/(?P<id>\d+)'
+    _TESTS = [{
+        'url': 'https://mediaserver.unige.ch/play/196613',


I hope this is a public video?

dirkf requested changes Dec 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[unige] Add extractor for UniGE mediaserver #32596

[unige] Add extractor for UniGE mediaserver #32596

Frankkkkk commented Oct 12, 2023

dirkf commented Dec 2, 2023

Frankkkkk commented Dec 2, 2023 via email

dirkf left a comment •

edited

dirkf Dec 5, 2023

dirkf Dec 5, 2023

dirkf Dec 5, 2023

dirkf Dec 5, 2023

dirkf Dec 6, 2023

dirkf Dec 6, 2023

dirkf Dec 6, 2023

dirkf Dec 6, 2023

dirkf Dec 6, 2023

dirkf Dec 6, 2023

	secure_wp = f'https://mediaserver.unige.ch/proxy/{video_id}/secure.php?view=play&id={video_id}'
	secure_wp = 'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id)

+    def _check_login(self, video_id, note, **kwargs):
+        _, urlh = self._request_webpage(
+            'https://mediaserver.unige.ch/proxy/{0}/secure.php?view=play&id={0}'.format(video_id),
+            None, note, **kwargs)
+        return urlh.getcode()

		course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')
		course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')

-        course_title = self._html_search_regex(r'<a href="/collection/[-\w+]+">(?P<course>.*)</a></div>', webpage, 'unige')
-        course_id = self._html_search_regex(r'<a href="/collection/(?P<courseid>[-\w+]+)">', webpage, 'unige')
+        course_title, course_id = self._search_regex(
+            r'''<a\s[^>]*\bhref\s*=\s*('|")?/collection/(?P<courseid>[+\w-]+)(?(1)\1|(?![+-])\b)[^>]*>(?P<course>.*)</a>''',
+            webpage, 'course details', group=('course', 'courseid'), default=(None, None))

	'channel': course_title,
	'channel': clean_html(course_title),

	video_title = self._html_search_regex(r'<h1>(.+?)</h1>', webpage.replace('\n', ''), 'unige')
	video_title = self._html_search_regex(r'(?s)<h1>(.+?)</h1>, 'title')

[unige] Add extractor for UniGE mediaserver #32596

Are you sure you want to change the base?

[unige] Add extractor for UniGE mediaserver #32596

Conversation

Frankkkkk commented Oct 12, 2023

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf commented Dec 2, 2023

Frankkkkk commented Dec 2, 2023 via email

dirkf left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkf left a comment •

edited