Additional crawlers progress tracker #29

Gabisonfire · 2024-02-02T14:53:34Z

Re implement scrapers from the upstream repo

1337x
torrent9
nyaaPantsu
nyaaSis
eztv

purple-emily · 2024-02-05T20:58:30Z

Looking for suggestions on the easiest way to get started with this. 1337x is a pain. Anyone got any bright ideas?

purple-emily · 2024-02-23T12:12:17Z

Right I have an update!

In its current form you'll need some dev experience to get this running so if you are a casual user please be wary.

Heres a full EZTV scraper. You'll need to run this on a system that you can access the Knight Crawler Postgres db and an instance of RabbitMQ from. This can be the same RabbitMQ that Knight Crawler uses or a different temporary one.

https://github.com/purple-emily/knight-crawler-scrapers-dirty

It uses python with poetry to install the dependencies. If anyone wants a quick guide on how to run it let me know.

Start one producer and one/two consumers and you should be good.

This is generally a single use script as Knight Crawler gets the most recent releases from EZTV already. You can abort and resume running at any time and the script should take care of this for you.

This will add at least 200,000 new torrents from initial runs. Final numbers to be confirmed later.

This is essentially an alpha release so use with caution. Back up Postgres before running.

It should take between an hour and two hours to get the data and no confirmed numbers on processing it all.

Runs on any system with python. I have provided a start script for each service. ./start_producer.sh or the same command with the .ps1 for windows and then the same for start_a_consumer

purple-emily · 2024-02-23T15:43:10Z

Taking requests for what everyone would like me to prioritise next.

@iPromKnight I don't know if you want to take the logic I have created and convert it to C#. Once we have done a single "full scrape" we don't really have to repeat it. Following the RSS feed gets us all the new releases anyway.

sleeyax · 2024-02-23T16:06:09Z

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

purple-emily · 2024-02-23T16:46:32Z

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

Essentially run the scraper once to get all of the history, and then the RSS feed crawler will keep it up to date.

sleeyax · 2024-02-23T17:37:47Z

You already said that in your previous comment and I get that. What I mean is where is this scraped history stored? If it's stored in your local database only then no one else can access it unless they also scrape it themselves.

What I'm trying to get at is this: if KC users are expected to run the EZTV scraper themselves to fetch all of the initial history, I think it would make sense to rewrite your POC to C# for consistency. If the DB is somehow shared, then it doesn't matter as much imo.

Gabisonfire · 2024-02-23T19:02:27Z

@sleeyax It's stored in the local database, we don't have any sort of database sharing at the moment, but it is definitely something I'd like to see happening, but it's going to be a lot of work.

As far as the language is concerned, I don't see a big issue with supporting multiple as long as it's pretty much plug and play. All that matters is that the database schema is respected.

iPromKnight · 2024-02-24T12:15:19Z

The problem with having a shared database is then we become susceptible to DMCA actions

When media takedown requests are issued, they are against the hash for the magnet as well as the content.

That's why I've been reluctant to implement anything for that, and rely solely on external.

I'm toying with the idea of taking the idea for #45 and expanding on that so that for a preseed action it could get the cinemeta known IMDb id list and just process lookups in parallel for them using the Helios compatible provider definitions.
This would make scraping outside of rss much more maintainable as we'd have a generic processing pipeline with scrape actions defined in json. It'd also mean that users can easily add their own

iPromKnight · 2024-02-24T12:20:20Z

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

purple-emily · 2024-02-24T14:29:19Z

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

I’m going to do a refactor of the “deep eztv” crawler I’ve written and then try and use it as a framework to make more. nyaa.si has a rss feed. How easy would it be to add it to the c# scraper?

iPromKnight · 2024-02-24T16:32:01Z

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

purple-emily · 2024-02-24T16:40:32Z

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

You think that’s something you could do? Or does anyone else want to offer to do it? I can make that the next deep scraper as it’s our most requested in Discord

Gabisonfire · 2024-02-24T16:45:00Z

@purple-emily I can take care of it. Was going to do Torrent9 but I can probably do both.

purple-emily · 2024-02-24T16:56:46Z

@iPromKnight you not able to make a throwaway and join discord even if it’s just to stick it on mute and never speak in the group context so me or Gabi can keep in contact?

purple-emily · 2024-02-28T00:08:17Z

As per #98 we now support new releases from nyaa.si.

Support for scraping old releases to come

dmitrc · 2024-05-19T20:56:39Z

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

Did we add that abstract XML scraper by any chance?
That could be useful for adding niche trackers with rich catalogs, like RuTracker etc.

purple-emily pinned this issue Feb 23, 2024

purple-emily changed the title ~~Create crawlers to get us 1-1 with master (1337x, ...)~~ Additional crawlers progress tracker Feb 23, 2024

purple-emily self-assigned this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional crawlers progress tracker #29

Additional crawlers progress tracker #29

Gabisonfire commented Feb 2, 2024 •

edited

purple-emily commented Feb 5, 2024

purple-emily commented Feb 23, 2024 •

edited

purple-emily commented Feb 23, 2024

sleeyax commented Feb 23, 2024

purple-emily commented Feb 23, 2024

sleeyax commented Feb 23, 2024

Gabisonfire commented Feb 23, 2024

iPromKnight commented Feb 24, 2024 •

edited

iPromKnight commented Feb 24, 2024

purple-emily commented Feb 24, 2024

iPromKnight commented Feb 24, 2024 •

edited

purple-emily commented Feb 24, 2024

Gabisonfire commented Feb 24, 2024

purple-emily commented Feb 24, 2024

purple-emily commented Feb 28, 2024

dmitrc commented May 19, 2024

Additional crawlers progress tracker #29

Additional crawlers progress tracker #29

Comments

Gabisonfire commented Feb 2, 2024 • edited

purple-emily commented Feb 5, 2024

purple-emily commented Feb 23, 2024 • edited

purple-emily commented Feb 23, 2024

sleeyax commented Feb 23, 2024

purple-emily commented Feb 23, 2024

sleeyax commented Feb 23, 2024

Gabisonfire commented Feb 23, 2024

iPromKnight commented Feb 24, 2024 • edited

iPromKnight commented Feb 24, 2024

purple-emily commented Feb 24, 2024

iPromKnight commented Feb 24, 2024 • edited

purple-emily commented Feb 24, 2024

Gabisonfire commented Feb 24, 2024

purple-emily commented Feb 24, 2024

purple-emily commented Feb 28, 2024

dmitrc commented May 19, 2024

Gabisonfire commented Feb 2, 2024 •

edited

purple-emily commented Feb 23, 2024 •

edited

iPromKnight commented Feb 24, 2024 •

edited

iPromKnight commented Feb 24, 2024 •

edited