Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[subscribestar] Refactoring extractor and handling audio content #5580

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WyohKnott
Copy link

  • New support for embedded audios
  • New support for external links compatible with yt-dlp
  • Add a content_type field at the post level for directory creation
  • Major rework of the logic
  • Added a check_if_supported_by_ytdlp helper function in util.py for yt-dlp external links handling

"content" : (extr(
'<div class="post-content', '<div class="post-uploads')
.partition(">")[2]),
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove this as this is the same as the base class.

break
else:
content_type = "image"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this part is to check what type of content the post contains to add a content_type to the post data so we can use it in directory name. This is not perfect as a post could probably contain multiple content_type, but I do not have enough samples to test.

"link": ('data-href="', '"', self._process_media_item),
"audio": ('<source src="', '" type="audio/',
self._process_media_item),
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we define 4 type of content types:

  • gallery : already handled before
  • attachments : already handled before
  • link: a new type to extract links from posts bodies
  • audio: if we a post has embedded audio.

For each type, we have:

  • the detection rules begin
  • the dtection rules end
  • the function that will process the content returned

if segment[key]:
content = processor(segment, key)
if content:
media.append(content)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We send the extracted text to the processor and if the return is not None, we append it to media.

for media in gallery_list:
if "/previews" in media["url"]:
self._warn_preview()
return {"url": media["url"], "type": media_type}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gallery processing, not much changed here.

"name": text.unescape(text.extr(item, 'doc_preview-title">', "<")),
"url": text.unescape(text.extr(item, 'href="', '"')),
"type": media_type,
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attachment processing, not much changed here. I haven't been able to test as I have never seen this type of content.

item[media_type]):
return {"url": "ytdl:" + item[media_type], "type": media_type}
elif media_type == "audio":
return {"url": item[media_type], "type": media_type}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we process our new handled type:

  • if link and if downloadable by yt-dlp, then we append it
  • if audio, we append it

"link": True,
"audio": True,
}
media = self._extract_media(html, media_types)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have rewritten the function to split it in multiple chunks to handle each media type, instead of having a big one.

@WyohKnott
Copy link
Author

So one VA artist I sub too was pwned from Patreon and moved to SubscribeStar. Having no experience with SubscribeStar I've searched for a scrapper, starting with my faithful gallery-dl. However, it seems that the SubscribeStar backend didn't handle audio content embedded in posts, which for scrapping a VA artist is quite problematic.

Moreover, this VA artist also embedded Google Drive links to some audios, and I am too lazy to download them manually. I'd rather spend an evening rewriting the backend instead.

It works for that one artist I subbed too, but I have not other reference point to check. I don't know for example if gallery type content can contain both videos and pictures, for example.

Anyway, it works well so far.

@WyohKnott WyohKnott changed the title Rework on subscribestar extractor [subscribestar] Rework on extractor May 11, 2024
 - New support for embedded audios
 - New support for external links compatible with yt-dlp
 - Add a content_type field at the post level for directory creation
 - Major rework of the logic
 - Added a check_if_supported_by_ytdlp helper function in util.py
   for yt-dlp external links handling
@WyohKnott WyohKnott changed the title [subscribestar] Rework on extractor [subscribestar] Refactoring extractor and handling audio content May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant