Skip to content

Latest commit

 

History

History
31 lines (21 loc) · 1.16 KB

8continuous-crawling.md

File metadata and controls

31 lines (21 loc) · 1.16 KB

Continuous Crawling

Prev Home Next

In PulsarRPA, continuous crawling is very simple; you just need to submit links to the UrlPool, and the crawling loop will start automatically. PulsarRPA's infrastructure also ensures core issues such as data quality and scheduling quality.

For small-scale data collection projects, such as monitoring hundreds of competitor product prices, inventory status, and new reviews every day, continuous crawling can be used.

You can start continuous crawling with the following code:

fun main() {
    val context = PulsarContexts.create()

    val parseHandler = { _: WebPage, document: FeaturedDocument ->
        // do something wonderful with the document
        println(document.title + "\t|\t" + document.baseUri)

        // extract more links from the document
        context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
    }

    val urls = LinkExtractors.fromResource("seeds10.txt").map { ParsableHyperlink("$it -refresh", parseHandler) }
    context.submitAll(urls).await()
}

Prev Home Next