Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

bogdan799 · 2023-07-11T21:23:43Z

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you,
Bogdan

pavlovtech · 2023-08-11T00:00:53Z

Hi Bogdan,

Appreciate your feedback and suggestions! It was quite intense at work, so I had little time for improvements.

I like your idea and plan to implement it one way or another. At the moment the only way to stop the engine is to specify the page crawl limit beforehand:

var engine = await new ScraperEngineBuilder()
   ...
    .PageCrawlLimit(100)
    .BuildAsync();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

bogdan799 commented Jul 11, 2023

pavlovtech commented Aug 11, 2023 •

edited

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

Comments

bogdan799 commented Jul 11, 2023

pavlovtech commented Aug 11, 2023 • edited

pavlovtech commented Aug 11, 2023 •

edited