Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

Open
bogdan799 opened this issue Jul 11, 2023 · 1 comment

Comments

@bogdan799
Copy link

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you,
Bogdan

@pavlovtech
Copy link
Owner

pavlovtech commented Aug 11, 2023

Hi Bogdan,

Appreciate your feedback and suggestions! It was quite intense at work, so I had little time for improvements.

I like your idea and plan to implement it one way or another. At the moment the only way to stop the engine is to specify the page crawl limit beforehand:

var engine = await new ScraperEngineBuilder()
   ...
    .PageCrawlLimit(100)
    .BuildAsync();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants