Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.1.0 RC #93

Merged
merged 1 commit into from Mar 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Expand Up @@ -92,12 +92,13 @@ js-wacz create --file cool-beans.warc --output cool-beans.wacz

### --pages, -p

Pass a specific [pages.jsonl](https://specs.webrecorder.net/wacz/1.1.1/#pages-jsonl) file.
Path to a folder containing [pages.jsonl](https://specs.webrecorder.net/wacz/1.1.1/#pages-jsonl) files (`pages.jsonl`, `extraPages.jsonl` ...).

If not provided, **js-wacz** is going to attempt to detect pages in WARC records to build its own `pages.jsonl` index.

```bash
js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl
# Assuming the following file exists: /collections/pages/pages.jsonl
js-wacz create -f "collection/*.warc.gz" --pages collection/pages/
```

### --cdxj
Expand Down
10 changes: 5 additions & 5 deletions index.js
Expand Up @@ -600,7 +600,7 @@ export class WACZ {
}

/**
* Copies pages.jsonl and extraPages.jsonl files in this.pagesDir into ZIP.
* Copies pages.jsonl and extraPages.jsonl files in `this.pagesDir` into ZIP.
* @returns {Promise<void>}
*/
copyPagesFilesToZip = async () => {
Expand All @@ -619,8 +619,9 @@ export class WACZ {
const filenameLower = filename.toLowerCase()
const pagesFile = resolve(this.pagesDir, filename)

// Ensure file is JSONL
if (!filenameLower.endsWith('.jsonl')) {
log.warn(`Pages: Skipping file ${pagesFile}, does not end with jsonl extension`)
log.warn(`Pages: Skipping file ${basename(pagesFile)}: does not end with jsonl extension.`)
continue
}

Expand All @@ -644,7 +645,7 @@ export class WACZ {
} catch (err) {
isValidJSONL = false
log.trace(err)
log.warn(`Pages: Skipping file ${pagesFile}, not valid JSONL`)
log.warn(`Pages: Skipping file ${basename(pagesFile)}: not valid JSONL / page entry.`)
break
}
}
Expand All @@ -656,7 +657,7 @@ export class WACZ {
}

/**
* Streams all the files listes in `this.WARCs` to the output ZIP.
* Streams all the files listed in `this.WARCs` to the output ZIP.
* @returns {Promise<void>}
*/
writeWARCsToZip = async () => {
Expand Down Expand Up @@ -886,7 +887,6 @@ export class WACZ {
addCDXJ = (cdjx) => {
this.stateCheck()
this.indexFromWARCs = false

this.cdxTree.setIfNotPresent(cdjx, true)
}

Expand Down
4 changes: 3 additions & 1 deletion types.js
Expand Up @@ -3,7 +3,9 @@
* @typedef {Object} WACZOptions
* @property {string|string[]} input - Required. Path(s) to input .warc or .warc.gz file(s). Glob-compatible.
* @property {string} output - Required. Path to output .wacz file. Will default to PWD + `archive.wacz` if not provided.
* @property {boolean} [detectPages=true] - If true (default), will attempt to detect pages in WARC records.
* @property {boolean} [indexFromWARCs=true] - If true, will attempt to generate CDXJ indexes from processed WARCs. Automatically disabled if `addCDXJ()` is called.
* @property {boolean} [detectPages=true] - If true (default), will attempt to detect pages in WARC records. Automatically disabled if `pages` is provided or `addPages()` is called.
* @property {?string} pages - Path to a folder containing pages files (pages.jsonl, extraPages.jsonl ...).
* @property {?string} url - If set, will be added to datapackage.json as `mainPageUrl`.
* @property {?string} ts - If set, will be added to datapackage.json as `mainPageDate`. Can be any value that `Date()` can parse.
* @property {?string} title - If set, will be added to datapackage.json as `title`.
Expand Down