[subscribestar] Refactoring extractor and handling audio content #5580

WyohKnott · 2024-05-11T13:17:30Z

New support for embedded audios
New support for external links compatible with yt-dlp
Add a content_type field at the post level for directory creation
Major rework of the logic
Added a check_if_supported_by_ytdlp helper function in util.py for yt-dlp external links handling

WyohKnott · 2024-05-11T13:18:10Z

gallery_dl/extractor/subscribestar.py

-            "content"    : (extr(
-                '<div class="post-content', '<div class="post-uploads')
-                .partition(">")[2]),
-        }


We remove this as this is the same as the base class.

WyohKnott · 2024-05-11T13:19:46Z

gallery_dl/extractor/subscribestar.py

+                    break
+                else:
+                    content_type = "image"
+


The goal of this part is to check what type of content the post contains to add a content_type to the post data so we can use it in directory name. This is not perfect as a post could probably contain multiple content_type, but I do not have enough samples to test.

WyohKnott · 2024-05-11T13:22:45Z

gallery_dl/extractor/subscribestar.py

+            "link": ('data-href="', '"', self._process_media_item),
+            "audio": ('<source src="', '" type="audio/',
+                      self._process_media_item),
+        }


Here we define 4 type of content types:

gallery : already handled before

attachments : already handled before

link: a new type to extract links from posts bodies

audio: if we a post has embedded audio.

For each type, we have:

the detection rules begin

the dtection rules end

the function that will process the content returned

WyohKnott · 2024-05-11T13:24:19Z

gallery_dl/extractor/subscribestar.py

+                    if segment[key]:
+                        content = processor(segment, key)
+                        if content:
+                            media.append(content)


We send the extracted text to the processor and if the return is not None, we append it to media.

WyohKnott · 2024-05-11T13:25:04Z

gallery_dl/extractor/subscribestar.py

+        for media in gallery_list:
+            if "/previews" in media["url"]:
+                self._warn_preview()
+            return {"url": media["url"], "type": media_type}


Gallery processing, not much changed here.

WyohKnott · 2024-05-11T13:25:52Z

gallery_dl/extractor/subscribestar.py

+            "name": text.unescape(text.extr(item, 'doc_preview-title">', "<")),
+            "url": text.unescape(text.extr(item, 'href="', '"')),
+            "type": media_type,
+        }


attachment processing, not much changed here. I haven't been able to test as I have never seen this type of content.

WyohKnott · 2024-05-11T13:26:49Z

gallery_dl/extractor/subscribestar.py

+                item[media_type]):
+            return {"url": "ytdl:" + item[media_type], "type": media_type}
+        elif media_type == "audio":
+            return {"url": item[media_type], "type": media_type}


Here we process our new handled type:

if link and if downloadable by yt-dlp, then we append it

if audio, we append it

WyohKnott · 2024-05-11T13:27:55Z

gallery_dl/extractor/subscribestar.py

+            "link": True,
+            "audio": True,
+        }
+        media = self._extract_media(html, media_types)


We have rewritten the function to split it in multiple chunks to handle each media type, instead of having a big one.

WyohKnott · 2024-05-11T13:34:11Z

So one VA artist I sub too was pwned from Patreon and moved to SubscribeStar. Having no experience with SubscribeStar I've searched for a scrapper, starting with my faithful gallery-dl. However, it seems that the SubscribeStar backend didn't handle audio content embedded in posts, which for scrapping a VA artist is quite problematic.

Moreover, this VA artist also embedded Google Drive links to some audios, and I am too lazy to download them manually. I'd rather spend an evening rewriting the backend instead.

It works for that one artist I subbed too, but I have not other reference point to check. I don't know for example if gallery type content can contain both videos and pictures, for example.

Anyway, it works well so far.

- New support for embedded audios - New support for external links compatible with yt-dlp - Add a content_type field at the post level for directory creation - Major rework of the logic - Added a check_if_supported_by_ytdlp helper function in util.py for yt-dlp external links handling

WyohKnott commented May 11, 2024

View reviewed changes

WyohKnott changed the title ~~Rework on subscribestar extractor~~ [subscribestar] Rework on extractor May 11, 2024

WyohKnott force-pushed the fix_subscribestar branch from 259842f to e5e752d Compare May 11, 2024 13:35

WyohKnott changed the title ~~[subscribestar] Rework on extractor~~ [subscribestar] Refactoring extractor and handling audio content May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[subscribestar] Refactoring extractor and handling audio content #5580

[subscribestar] Refactoring extractor and handling audio content #5580

WyohKnott commented May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott May 11, 2024

WyohKnott commented May 11, 2024

[subscribestar] Refactoring extractor and handling audio content #5580

Are you sure you want to change the base?

[subscribestar] Refactoring extractor and handling audio content #5580

Conversation

WyohKnott commented May 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WyohKnott commented May 11, 2024