Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional crawlers progress tracker #29

Open
Gabisonfire opened this issue Feb 2, 2024 · 16 comments
Open

Additional crawlers progress tracker #29

Gabisonfire opened this issue Feb 2, 2024 · 16 comments
Assignees

Comments

@Gabisonfire
Copy link
Collaborator

Gabisonfire commented Feb 2, 2024

Re implement scrapers from the upstream repo

  • 1337x
  • torrent9
  • nyaaPantsu
  • nyaaSis
  • eztv
@purple-emily
Copy link
Collaborator

Looking for suggestions on the easiest way to get started with this. 1337x is a pain. Anyone got any bright ideas?

@purple-emily
Copy link
Collaborator

purple-emily commented Feb 23, 2024

Right I have an update!

In its current form you'll need some dev experience to get this running so if you are a casual user please be wary.

Heres a full EZTV scraper. You'll need to run this on a system that you can access the Knight Crawler Postgres db and an instance of RabbitMQ from. This can be the same RabbitMQ that Knight Crawler uses or a different temporary one.

https://github.com/purple-emily/knight-crawler-scrapers-dirty

It uses python with poetry to install the dependencies. If anyone wants a quick guide on how to run it let me know.

Start one producer and one/two consumers and you should be good.

This is generally a single use script as Knight Crawler gets the most recent releases from EZTV already. You can abort and resume running at any time and the script should take care of this for you.

This will add at least 200,000 new torrents from initial runs. Final numbers to be confirmed later.

This is essentially an alpha release so use with caution. Back up Postgres before running.

It should take between an hour and two hours to get the data and no confirmed numbers on processing it all.

Runs on any system with python. I have provided a start script for each service. ./start_producer.sh or the same command with the .ps1 for windows and then the same for start_a_consumer

@purple-emily purple-emily pinned this issue Feb 23, 2024
@purple-emily purple-emily changed the title Create crawlers to get us 1-1 with master (1337x, ...) Additional crawlers progress tracker Feb 23, 2024
@purple-emily purple-emily self-assigned this Feb 23, 2024
@purple-emily
Copy link
Collaborator

Taking requests for what everyone would like me to prioritise next.

@iPromKnight I don't know if you want to take the logic I have created and convert it to C#. Once we have done a single "full scrape" we don't really have to repeat it. Following the RSS feed gets us all the new releases anyway.

@sleeyax
Copy link
Collaborator

sleeyax commented Feb 23, 2024

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

@purple-emily
Copy link
Collaborator

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

Essentially run the scraper once to get all of the history, and then the RSS feed crawler will keep it up to date.

@sleeyax
Copy link
Collaborator

sleeyax commented Feb 23, 2024

You already said that in your previous comment and I get that. What I mean is where is this scraped history stored? If it's stored in your local database only then no one else can access it unless they also scrape it themselves.

What I'm trying to get at is this: if KC users are expected to run the EZTV scraper themselves to fetch all of the initial history, I think it would make sense to rewrite your POC to C# for consistency. If the DB is somehow shared, then it doesn't matter as much imo.

@Gabisonfire
Copy link
Collaborator Author

@sleeyax It's stored in the local database, we don't have any sort of database sharing at the moment, but it is definitely something I'd like to see happening, but it's going to be a lot of work.

As far as the language is concerned, I don't see a big issue with supporting multiple as long as it's pretty much plug and play. All that matters is that the database schema is respected.

@iPromKnight
Copy link
Collaborator

iPromKnight commented Feb 24, 2024

The problem with having a shared database is then we become susceptible to DMCA actions

When media takedown requests are issued, they are against the hash for the magnet as well as the content.

That's why I've been reluctant to implement anything for that, and rely solely on external.

I'm toying with the idea of taking the idea for #45 and expanding on that so that for a preseed action it could get the cinemeta known IMDb id list and just process lookups in parallel for them using the Helios compatible provider definitions.
This would make scraping outside of rss much more maintainable as we'd have a generic processing pipeline with scrape actions defined in json. It'd also mean that users can easily add their own

@iPromKnight
Copy link
Collaborator

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

@purple-emily
Copy link
Collaborator

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

I’m going to do a refactor of the “deep eztv” crawler I’ve written and then try and use it as a framework to make more. nyaa.si has a rss feed. How easy would it be to add it to the c# scraper?

@iPromKnight
Copy link
Collaborator

iPromKnight commented Feb 24, 2024

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

@purple-emily
Copy link
Collaborator

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

You think that’s something you could do? Or does anyone else want to offer to do it? I can make that the next deep scraper as it’s our most requested in Discord

@Gabisonfire
Copy link
Collaborator Author

@purple-emily I can take care of it. Was going to do Torrent9 but I can probably do both.

@purple-emily
Copy link
Collaborator

@iPromKnight you not able to make a throwaway and join discord even if it’s just to stick it on mute and never speak in the group context so me or Gabi can keep in contact?

@purple-emily
Copy link
Collaborator

As per #98 we now support new releases from nyaa.si.

Support for scraping old releases to come

@dmitrc
Copy link

dmitrc commented May 19, 2024

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

Did we add that abstract XML scraper by any chance?
That could be useful for adding niche trackers with rich catalogs, like RuTracker etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants