Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to skip creation of the results files and just report if the links are valid? #103

Open
dingo-d opened this issue Apr 3, 2024 · 3 comments

Comments

@dingo-d
Copy link

dingo-d commented Apr 3, 2024

Hi.

I'm wondering if it's possible to use the link checker example to just check for valid links, and maybe store them in a JSON, or CSV file instead of creating binary files and index.html files inside the results folder?

Should I try to create my own persistence handler for this?

Basically, I'd just like to crawl my web to check if there are any 404 pages in my web, I'm not necessarily interested if any of the links on the page is returning 404, just need to check if all my pages are healthy.

@dingo-d
Copy link
Author

dingo-d commented Apr 3, 2024

I created a JsonPersistenceHandler.php

<?php

use VDB\Spider\PersistenceHandler\FilePersistenceHandler;
use VDB\Spider\PersistenceHandler\PersistenceHandlerInterface;
use VDB\Spider\Resource;

class JsonPersistenceHandler extends FilePersistenceHandler implements PersistenceHandlerInterface
{
    protected string $defaultFilename = 'data.json';
    
    #[\Override]
    public function persist(Resource $resource)
    {
        $file = $this->getResultPath() . 'data.json';

        // Check if file exists.
        if (!file_exists($file)) {
            // Create file if it doesn't exist.
            $fileHandler = fopen($file, 'w');
            $results = [];
        } else {
            // Open file if it exists.
            $fileHandler = fopen($file, 'c+');

            // Check if file is not empty before reading.
            if (filesize($file) > 0) {
                // Read file and decode the json.
                $results = json_decode(fread($fileHandler, filesize($file)), true);
            } else {
                $results = [];
            }
        }

        $url = $resource->getUri()->toString();
        $statusCode = $resource->getResponse()->getStatusCode();

        $results[$url] = $statusCode;

        // Move the pointer to the beginning of the file before writing.
        rewind($fileHandler);

        // Write to file.
        fwrite($fileHandler, json_encode($results));

        // Close the file handler.
        fclose($fileHandler);
    }

    #[\Override]
    public function current(): Resource
    {
        return unserialize($this->getIterator()->current()->getContents());
    }
}

And this works kinda. The only thing is that I don't get all the links from the webpage, only 125.

Can the crawler get the sitemap.xml and try to parse that to get all the links?

@mvdbos
Copy link
Owner

mvdbos commented Apr 17, 2024

@dingo-d Currently the spider does not support parsing the sitemap.xml.

Your approach with a custom persistence handler icm with the link checker seems correct.

Are you sure there are more than 125 links on the page/website? If so:

  • Are you sure your XPathExpressionDiscoverer is configured correctly to find all links? Are all links in the DOM on page render, or are some added later with JavaScript? Those won't be found, since PHP-spider does not use a headless browser.
  • Did you set a downloadLimit on the Downloader?
  • Did you leave some of the filters in place, such as UriWithHashFragmentFilter or UriWithQueryStringFilter? With those in place, URLs with fragments on query strings are skipped.
  • Did you set the maxDepth on the discovererSet? If so, and it is 1, the spider will limit itself to the current page and siblings, and not descend further.

Interested to hear what you find.

@dingo-d
Copy link
Author

dingo-d commented Apr 18, 2024

Are you sure your XPathExpressionDiscoverer is configured correctly to find all links? Are all links in the DOM on page render, or are some added later with JavaScript? Those won't be found, since PHP-spider does not use a headless browser.

The site is a WordPress site, so all links should be present. But it's good to know about JS added ones 👍🏼

Did you set a downloadLimit on the Downloader?

Yup, added $spider->getDownloader()->setDownloadLimit(1000000);

Did you leave some of the filters in place, such as UriWithHashFragmentFilter or UriWithQueryStringFilter? With those in place, URLs with fragments on query strings are skipped.

These were the filters I've added:

$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('https')));
$spider->getDiscovererSet()->addFilter(new AllowedHostsFilter(array($seed), $allowSubDomains));
$spider->getDiscovererSet()->addFilter(new UriWithHashFragmentFilter());
$spider->getDiscovererSet()->addFilter(new UriWithQueryStringFilter());

Did you set the maxDepth on the discovererSet? If so, and it is 1, the spider will limit itself to the current page and siblings, and not descend further.

I had $spider->getDiscovererSet()->maxDepth = 2;, I think I added like 10, but that took too long. I think that even with 2, the crawler took over an hour and still didn't finish crawling 😅

All in all I did get the JSON file with some 503 statuses.

The idea was to use it as a site-health checker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants