rozetka 3 actors added #533

vladyslav-n · 2021-04-14T12:39:35Z

These are 3 new typescript actors for Hlidac Rozetka Project by Geniusee with edits after previous pull request attempt.

rarous · 2021-04-14T12:53:03Z

Please do not use custom editorconfig, eslintrc nor prettierrc. Also, please, do not use TypeScript. Actors have to be directly usable without any compilation steps.

vladyslav-n · 2021-04-14T13:27:10Z

Dear Aleš, I’m a developer at Geniusee and we are partners with Apify. We usually develop our actors for Apify in typescript with a folder dist included so that the actor doesn’t require any build steps before run. Does it still mean we should change the typescript code to javascript? If it’s crucial for the project requirements, we’ll surely do so. Sincerely yours, Vlad 14 апр. 2021 г., 15:53 +0300, Aleš Roubíček ***@***.***>, писал:

…

Please do not use custom editorconfig, eslintrc nor prettierrc. Also, please, do not use TypeScript. Actors have to be directly usable without any compilation steps. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

vladyslav-n · 2021-04-14T14:08:29Z

Oh, I see Hlidac has dist/ directory in .gitignore, so that I missed that dist directory hasn’t actually been pushed to the repo, sorry for that. I could simply rename it to ‘build/’ so that it will be pushed for sure. In that case will typescript still need to be replaced with javascript? Thanks for your time!

vladyslav-n · 2021-04-16T11:09:44Z

Hey @rarous,
I ported Rozetka actors to javascript, please, check them out.
Have a nice day!

metalwarrior665

Looks pretty good. I haven't check in-depth everything nor I tested if it works. Please look at the comments. Thanks.

actors/rozetka-count/Dockerfile

actors/rozetka-count/src/main.js

metalwarrior665 · 2021-04-16T20:43:04Z

actors/rozetka-count/src/main.js

+    await crawler.run();
+    log.info('Crawl finished.');
+
+    await Apify.pushData({ OUTPUT: await getOrIncStatsValue() })


@zpelechova This is how the dataset should look like?

@zpelechova Since there were no details on the output of the count actor provided in specs, it would be great to know what the correct way of doing this is. I saw an example in another actor as { totalCount: value }. Is this the way it should be done?

actors/rozetka-count/src/tools.js

metalwarrior665 · 2021-04-16T20:50:52Z

actors/rozetka-daily/src/consts.js

@@ -0,0 +1,42 @@
+export const LABELS = {


This second actor looks like most of the code is copy pasted from the first one. That will make it harder for maintenance. I would use a single folder for those and just changed the behaviour via input or env var. cc @rarous

Yeah, it look like having separate actors for count functionality is not good idea.

Well it could be multiple actors but definitely should not be multiple folders. But I will leave that up to you.

I guess, I would be the most efficient to do the count of the results and scraping the data in parallel, considering the crawling logic there has to be very similar. I just don't understand clearly how we should handle the output in case of having one actor for both purposes (though there are plenty ways how handle this, for example we could save the results to separate named datasets for count the the daily actors). But the reuse of the code is great, and I'm very glad that I'm allowed to do that here.
So, to sum up, which way should I choose — separate actors with some shared code or a single actor (some more specs about the output should be provided here then)?

We have already common library for reusable code, but it should be for code that is reusable for all/most of actors.

This case should be IMHO handled just by type: "COUNT" Input parameter. It will be one actor in more modes (we already have this for Black Friday scraping in older actors). It will be scheduled with different input parameters. This mode will just write to Dataset but skip the Keboola upload step.

Sorry, for inconvenience, we are still figuring out the process and shape - count functionality is new requirement for internal benchmarking of scraped data.

I agree with @rarous 's approach, will be the simplest

Hey, I guess it is not properly explained in the docs, but I dont think it makes sense to have an actor four count which does the same as the main actor. The idea behind it is to double check the result, i.e. find and resonably use the numbers which tels how many items there are in each category, like with rozetka here:

…ctor

vladyslav-n · 2021-04-21T12:05:08Z

@rarous @metalwarrior665 Hi guys!
I suppose, the code is ready for the 2-nd round of the code review)

metalwarrior665

regarding my points, it is done. But @rarous needs to check it works with the system and @zpelechova needs to check the output.

janfiedler · 2021-07-12T10:02:57Z

@vladyslav-n I tried run actor with "type" = "COUNT" & "type" = "DAILY" and I am getting quite frequently errors:

ArgumentError: Expected property string "url" to be a URL, got "/laboratornoe-osnashchenie/c4644808/page=11/" in object "requestLike" at Object.ow [as default] (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/ow/dist/index.js:19:23)
      at RequestQueue.addRequest (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/storages/request_queue.js:173:21)
      at enqueueLastPage (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/helpers/enqueueLastPage.js:17:24)
      at countProductsOrSplitPriceRange (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/helpers/countProductsOrSplitPriceRange.js:21:15)
      at handleProductList (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/handleProductList.js:43:15)
      at CheerioCrawler.handlePageFunction [as userProvidedHandler] (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/main.js:68:27)
      at CheerioCrawler._handleRequestFunction (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/crawlers/cheerio_crawler.js:452:49)
      at processTicksAndRejections (node:internal/process/task_queues:96:5)
      at async CheerioCrawler._runTaskFunction (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/crawlers/basic_crawler.js:423:13)
      at async AutoscaledPool._maybeRunTask (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/autoscaling/autoscaled_pool.js:399:13)

It is look like, this happen only when part of url is /page=x/

vladyslav-n · 2021-07-12T11:22:33Z

@janfiedler Hi!
Seems like there's an issue with relative pathes in urls, will look at it and give some feedback tomorrow.

vladyslav-n · 2021-08-02T17:04:49Z

@janfiedler Hi!
Sorry for the delay. Made some updates to the crawling of the actor and also explicitly added a new allowed content-type to the actor — seems like it failed the actor to work at all in both modes COUNT and DAILY. The crawling changes were made due to some changes in the site categories logic.

Vladyslav Nazarenko added 2 commits April 14, 2021 15:33

rozetka 3 actors added

7415bc6

Docker images tags specified

ff2a737

Port Rozetka actors from typescript to typescript

6f8e07b

metalwarrior665 requested changes Apr 16, 2021

View reviewed changes

Actors 1 and 3 combined into one. Minor style changes to the "info" a…

4ad45ab

…ctor

vladyslav-n requested a review from metalwarrior665 April 22, 2021 19:07

metalwarrior665 approved these changes Apr 22, 2021

View reviewed changes

crawling update, allowed content-types update

575613b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rozetka 3 actors added #533

rozetka 3 actors added #533

vladyslav-n commented Apr 14, 2021

rarous commented Apr 14, 2021

vladyslav-n commented Apr 14, 2021 via email

vladyslav-n commented Apr 14, 2021

vladyslav-n commented Apr 16, 2021

metalwarrior665 left a comment •

edited

metalwarrior665 Apr 16, 2021

vladyslav-n Apr 19, 2021

metalwarrior665 Apr 16, 2021

rarous Apr 19, 2021

metalwarrior665 Apr 19, 2021

vladyslav-n Apr 19, 2021

rarous Apr 19, 2021

metalwarrior665 Apr 19, 2021

zpelechova Apr 20, 2021 •

edited

vladyslav-n commented Apr 21, 2021

metalwarrior665 left a comment

janfiedler commented Jul 12, 2021

vladyslav-n commented Jul 12, 2021

vladyslav-n commented Aug 2, 2021

rozetka 3 actors added #533

Are you sure you want to change the base?

rozetka 3 actors added #533

Conversation

vladyslav-n commented Apr 14, 2021

rarous commented Apr 14, 2021

vladyslav-n commented Apr 14, 2021 via email

vladyslav-n commented Apr 14, 2021

vladyslav-n commented Apr 16, 2021

metalwarrior665 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zpelechova Apr 20, 2021 • edited

Choose a reason for hiding this comment

vladyslav-n commented Apr 21, 2021

metalwarrior665 left a comment

Choose a reason for hiding this comment

janfiedler commented Jul 12, 2021

vladyslav-n commented Jul 12, 2021

vladyslav-n commented Aug 2, 2021

metalwarrior665 left a comment •

edited

zpelechova Apr 20, 2021 •

edited