Skip to content

"How does it work?" diagrams (WIP)

Matteo Cargnelutti edited this page Apr 4, 2023 · 4 revisions

Simplified diagrams to explain how Scoop captures the web.


Bird's-eye view: Capture, main url is a web page

flowchart LR
    A[Scoop]
    B[Playwright]
    C[Chromium]
    D[Website]
    E[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    C <--> D
    A <-.-> |Capture| E <-.-> C

Unless specified otherwise, everything the browser "sees" is captured via the HTTP proxy, which allows for the enforcement of time and size constraints while preserving partial responses.


Bird's-eye view: Capture, main url is not a web page

flowchart LR
    A[Scoop]
    B[curl, yt-dlp ...]
    C[Resource]
    D[HTTP Proxy]
    A <--> |Controls| B
    B <--> C
    A <-.-> |Capture| D <-.-> B

Unless specified otherwise, everything captured "out of band" goes through the HTTP proxy.


Scoop.constructor() sequence

flowchart TD
    A(Url)
    B(Options)
    A-->C
    B-->C
    C[Scoop class]
    C-->D

    D([Filter options])
    D-->E

    E([Filter url])
    E-->F

    F{{Ready to capture}}

Notes:

  • Filter options: Defaults are used for options that are not explicitly provided.
  • Filter url: Url must be valid in format and not match against blocklist

Scoop.capture() sequence

stateDiagram-v2
    state "Start Browser" as browser
    state "Start Proxy" as intercepter
    state "Detect non-web resource" as nonwebdetect
    state "Capture of non-web resource" as nonwebcapture
    state "Initial page load" as pageload
    state "Capture page info" as pageinfo
    state "Browser scripts" as browserscripts
    state "Network idle" as networkidle
    state "Scroll up" as scrollup
    state "Screenshot" as screenshot
    state "DOM snapshot*" as domsnapshot
    state "PDF snapshot*" as pdfsnapshot
    state "Capture video(s) as attachments" as capturevideo
    state "Detect noarchive directive" as detectnoarchive
    state "Capture of certificates" as certscapture
    state "Gather Provenance Info" as provenanceinfo
    state "Teardown" as teardown

    [*] --> browser
    browser --> intercepter
    intercepter --> nonwebdetect
    nonwebdetect --> nonwebcapture
    nonwebdetect --> pageload

    nonwebcapture --> certscapture

    pageload --> pageinfo
    pageinfo --> browserscripts
    browserscripts --> networkidle
    networkidle --> scrollup
    scrollup --> screenshot
    screenshot --> domsnapshot
    domsnapshot --> pdfsnapshot
    pdfsnapshot --> capturevideo
    capturevideo --> detectnoarchive
    detectnoarchive --> certscapture
    certscapture --> provenanceinfo
    provenanceinfo --> teardown
    teardown --> [*]

Notes:

  • Steps marked with * are deactivated by default.
  • Unless specified otherwise
  • Capture state is used determine if the next step should be run or not.
  • Each steps counts towards the overall capture time and size limits, unless specified otherwise via an option flag. See options list for details.
  • At the end of this capture process, Scoop holds everything it captured in memory, as state of the Scoop class.

Scoop.toWARC() sequence

flowchart TD
    A(Scoop Instance)
    B(gzip flag*)
    A-->C
    B-->C
    C[scoopToWARC]
    C-->D

    D([Check capture state])
    D-->E

    E([Generate WARC info section])
    E-->F

    F([Generate WARC records section])
    F-->G

    G([Merge sections])
    G-->H

    H[WARC as ArrayBuffer]

Notes:

  • The WARC sections are generated using warcio.js, which also handles per-segment GZIP compression

Scoop.toWACZ() sequence

flowchart TD
    A(Scoop Instance)
    B(RAW flag*)
    C(Signing server info*)
    A-->D
    B-->D
    C-->D
    D[scoopToWACZ]
    D-->E

    E([Check capture state])
    E-->F

    F([scoopToWARC])
    F-->G 

    G([js-wacz on WARC])

    H[Signing Server]
    G<-.->|Optional| H

    J[WACZ as ArrayBuffer]
    G-->J

    I[RAW exchanges]
    G<-.->|Optional|I

Capture in CLI context

stateDiagram-v2
    state "Invoke" as args
    state "Parse options" as options
    state "Scoop.capture(url, options)" as capture
    state "capture.toWARC / toWACZ" as export
    state "Save to disk" as save
    state "Exit with error code" as exit

    [*] --> args
    args --> options
    options --> capture
    capture --> export: Scoop Instance
    export --> save: ArrayBuffer
    save --> exit 
    exit --> [*]