Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate requests being dispatched even with RequestDeduplicationMiddleware in place #36

Open
awebartisan opened this issue Apr 24, 2022 · 16 comments
Assignees
Labels
accepted The issue has been accepted and is ready to be worked on bug Something isn't working

Comments

@awebartisan
Copy link

I have a list of URLs in the database, scrapping specific information these URLs.
I have split these URLs in portions of 50 and dispatch a job by giving the offset from database to start from.

Each job gets the 50 URLs from database and spider starts sending requests. 2 concurrent requests with 1 second delay.
At some point it starts sending duplicate requests as can be seen below and Deduplication middleware doesn't report/drop these requests. Not sure what's going on here. Any thoughts?

[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
@awebartisan
Copy link
Author

Is it possible that multiple instances of same Spider are using same requests??

@ksassnowski
Copy link
Contributor

Are these logs from multiple spider runs or are they all from the same run? The RequestDeduplicationMiddleware only looks at requests that have been sent during the current run. So if you start multiple spiders with the same URLs, they will all scrape the same site.

My first guess would be that you are dispatching multiple jobs at the same time and they all query the same records from the database. Can you maybe show what the code that dispatches your jobs looks like?

@awebartisan
Copy link
Author

awebartisan commented Apr 24, 2022

This is how I am dispatching jobs from a console command.

    public function handle(): int
    {
        for ($offset = 1; $offset <= 1000; $offset = $offset + 50) {
            dispatch(new ScrapeStoreSocialLinksJob($offset));
        }

        return 0;
    }

Below is what my job looks like:

    public $timeout = 300;

    public function __construct(public int $offset)
    {}

    public function handle()
    {
        Roach::startSpider(StoreSocialLinksSpider::class, context: ['offset' => $this->offset]);
    }

These logs are from different RUNs, but from the logs I can see these RUNS start at the same time and end at the same time.

I have even tried to chain these jobs so that next job gets dispatched after first one is completed, but still gets duplicate RUNs.

@ksassnowski
Copy link
Contributor

Can you show what the initialRequests method of your spider looks like?

@awebartisan
Copy link
Author

    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }

Behaviour I noticed in the logs:

  • When first 5 jobs are dispatched, everything works as expected.
  • When one of the 5 jobs is completed, and 6th is dispatched, I see 2 requests being duplicated
  • When second of the jobs from first 5 jobs is completed and 7th is dispatched, I see 3 requests being duplicated

Below are some stats from the logs

[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":150,"requests.dropped":0,"items.scraped":146,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":100,"requests.dropped":0,"items.scraped":98,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":50,"requests.dropped":0,"items.scraped":48,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished

@ksassnowski
Copy link
Contributor

This may be a silly question, but does your ShopifyStore model contain any duplicates? I can't really see what could be going wrong otherwise. It's also a little strange how the requests.sent and items.scraped both change by exactly 50 (which is also your limit). Does your parse method dispatch additional requests for certain responses?

@awebartisan
Copy link
Author

After your comment I went ahead and checked for duplicates in the table. There were indeed some duplicates. Removed them.

But problem still happening.

Below is my Spider's full source code:

<?php

namespace App\Spiders;

use App\Extractors\Stores\AssignCategory;
use App\Extractors\Stores\ExtractContactUsPageLink;
use App\Extractors\Stores\ExtractDescription;
use App\Extractors\Stores\ExtractFacebookProfileLink;
use App\Extractors\Stores\ExtractInstagramProfileLink;
use App\Extractors\Stores\ExtractLinkedInProfileLink;
use App\Extractors\Stores\ExtractTikTokProfileLink;
use App\Extractors\Stores\ExtractTitle;
use App\Extractors\Stores\ExtractTwitterProfileLink;
use App\Models\ShopifyStore;
use App\Processors\SocialLinksDatabaseProcessor;
use Generator;
use Illuminate\Pipeline\Pipeline;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Request;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;

class StoreSocialLinksSpider extends BasicSpider
{
    public array $startUrls = [
        //
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //SocialLinksDatabaseProcessor::class,
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    /**
     * @return Generator<ParseResult>
     */
    public function parse(Response $response): Generator
    {
        $storeData = [
            'store_id' => $response->getRequest()->getMeta('store_id')
        ];

        [, $storeData] = app(Pipeline::class)
            ->send([$response, $storeData])
            ->through([
                ExtractTitle::class,
                ExtractDescription::class,
                ExtractTwitterProfileLink::class,
                ExtractFacebookProfileLink::class,
                ExtractInstagramProfileLink::class,
                ExtractTikTokProfileLink::class,
                ExtractLinkedInProfileLink::class,
                ExtractContactUsPageLink::class
            ])
            ->thenReturn();

        yield $this->item($storeData);
    }

    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }
}

parse() method is not making any additional requests.

My thinking here is that something going on with Spider's instance and container.

@ksassnowski
Copy link
Contributor

So my thinking is that the spiders aren't actually sending duplicate requests, but that the extensions (the Logger and StatsCollector, specifically) are reacting to events from different spiders. Couple more questions:

  • Are your jobs actually being queued or do they run on the sync queue?
  • Can you verify that you actually get duplicated items in your SocialLinksDatabaseProcessor?
  • Are you using Laravel Octane?

@awebartisan
Copy link
Author

  • I am using redis + Laravel Horizon for queues
  • I can verify that in a short time ( but it can be assumed that this processor is just getting the items that are being scrapped, so they will contain duplicates)
  • Not using Laravel Octane

@awebartisan
Copy link
Author

Hey @ksassnowski , you are right about the second part. In my SocialLinksDatabaseProcessor I am not getting duplicate items for the duplicate URLs.

So your thinking about the extensions like Logger and StatsCollector sounds right to me.

@code-poel
Copy link

Just wanted to chime in that I'm experiencing something similar. I have two spiders being executed from a single Laravel Command. Executing one (or the other) results in the StatsCollector outputting expected results. However, if I have both spiders executed, I get a third output of the StatsCollector output that looks like a combination of both. Even if I put a sleep(5) between their execution in the Command, the third, cumulative StatsCollector output occurs...

@ksassnowski
Copy link
Contributor

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

@ksassnowski
Copy link
Contributor

The solution might be to assign every run a unique id and include that as part of the event payload. Then I could scope the events and all corresponding handlers to just that id, even if multiple spiders get started in the same process. I have to check if this can be done without a BC break.

@code-poel
Copy link

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

Yup, that's exactly right. Thanks for the clarification on the root cause!

@ksassnowski ksassnowski added the bug Something isn't working label Jun 18, 2022
@ksassnowski ksassnowski self-assigned this Mar 10, 2023
@ksassnowski ksassnowski added the accepted The issue has been accepted and is ready to be worked on label Mar 10, 2023
@wengooooo
Copy link

This bug has existed for more than 1 year, why hasn't it been fixed by now?

@ksassnowski
Copy link
Contributor

Because no one has opened a PR yet to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted The issue has been accepted and is ready to be worked on bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants