1.7.4

capture_http support for chunk-encoded requests #116
indexer: option to enable verify_http #116
Enable writing block digests for warcinfo records #115

1.7.3

Fix documentation for capture_http filter_records #110
Fix capture_http with http and https proxies #113

1.7.2

Ensure 1.1 revisit profile used with WARC/1.1 revisits #96
Include record offsets in warcio check output #98
CI fix for python 2.7, use jinja<3.0.0 (#105)
Fix in StatusAndHeaders when writing, then reading record #106
Fix issues related to http header re-encoding, ensure correct content-length and %-encoding #106, #107

1.7.1

Windows fixes: Fix reading from stdin, ensure all WARCs/ARCs are treated as binary #86
Fix ensure_digest(block=True) breaking on an existing record, RecordBuilder supports header_filter #85

1.7.0

Docs and Misc Cleanup: add docs for extract tool, correct doc for get_statuscode(), move all CLI tools to separate modules for better reusability.
Support indexing a WARC read from stdin #79
Automatically %-encode urls that have a space in WARC-Target-URI #80
Separate record creation into RecordBuilder class to allow building WARC records without a WARCWriter, which now derives from RecordBuilder #63
Support the ability to optionally check ARC/WARC record's block and payload digests #54, #58, #68, #77
- Creation of ArchiveIterator and ArcWarcRecordLoader now accept an check_digests boolean keyword argument indicating if each records digest should be checked, defaults to False
- Core digest checking functionality is provided by DigestChecker and DigestVerifyingReader importable from warcio.digestverifyingreader
- New block and payload digest checking utility class, Checker, has been added and is importable from warcio.checker
- The CLI has been updated to provide warcio check, a command for performing block and payload digest checking
Ensured that ARCHeadersParser's splitting on spaces does not split any spaces in uri's #62
Move the compute_headers_buffer method and headers_buff property to the StatusAndHeaders and fix incorrect digests in some test WARCs #67
Ensured that the BaseWARCWriter does not use a mutable default value for the warc_header_dict keyword argument #70

1.6.3

Make warcio recompress more robust in fixing improperly compressed WARCs, --verbose mode for printing results #52
BufferedReader supports streaming all members of multi-member gzip file with read_all_members=True option.

1.6.2

Ensure any non-ascii data in http headers is %-encoded, even if non-conformant to RFC 8187 #51

1.6.1

Fixes for warcio.utils.open() not opening files in binary mode in Python 2.7 on Windows #49
capture_http() various fixes and improvements, default writer, WARC-IP-Address header support #50

1.6.0

Support WARC/1.1 standard WARC records, reading #39 and writing #46 with microsecond precision WARC-Date
Support simplified semantics for capturing http traffic to a WARC #43
Support parsing incorrect wget 1.19 WARCs with angle brackets, eg: WARC-Target-URI: <uri> #42
Correct encoding of non-ascii HTTP headers per RFC 8187 #45
New Util Added: warcio.utils.open provides exclusive creation mode open(..., 'x') for Python 2.7

1.5.3

ArchiveIterator calls new close_decompressor() function in BufferedReader instead of close() to only close decompressor, not underlying stream. #35

1.5.2

Write any errors during decompression to stderr #31
to_native_str() returns original value unchanged if not a string/bytes type
WarcWriter.create_visit_record() accepts additional WARC headers dictionary
ArchiveIterator.close() added which calls decompressor.flush() to address possible issues in #34
Switch Warc-Record-ID uuid creation to uuid4() from uuid1()

1.5.1

remove test/data from wheel build, as it breaks latest setuptools wheel installation
add Content-Length when adding Content-Range via StatusAndHeaders.add_range #29

1.5.0

new extract cli command #26 (by @nlevitt)
fix for writing WARC record with no content-type #27 (by @thomaspreece)
better verification of chunk header before attempting to de-chunk with ChunkedDataReader
MANIFEST.in added (by @pmlandwehr)

1.4.0

Indexing API improvements:
- Indexer class moved to indexer.py and all aspects of indexing process can be extended.
- Support for accessing http headers with http:-prefixed fields #22
- Special fields: filename field and http:status
- JSON offset and length fields returned as strings for consistency.
- ArchiveIterator API: add get_record_offset() and get_record_length() to return current offset/length, iterator now tracks current record
StatusAndHeaders accepts headers in more flexible formats (mapping, byte or string) and normalizes to string tuples #19

1.3.4

Continuous read for more data to decompress (introduced in 1.3.2 for brotli decomp) should only happen if no unused data remaining. Otherwise, likely at gzip member end.

1.3.3

Set default read block_size to 16384, ensure block_size is never None (caused an issue in py2.7)

1.3.2

Fixes issues with BufferedReader returning empty response due to brotli decompressor requiring additional data, for more details see: #21

1.3.1

Fixes #15, including:
WARCWriter.create_warc_record() works correctly when specifying a payload with no length param.
Writing DNS records now works (tests included).
HTTP headers only expected for writing request, response records if the URI has a http: or https: scheme (consistent with reading).

1.3

Support for reading "streaming" WARC records, with no Content-Length set. Content-Length and digests computed as expected when the record is written.
Additional tests for streaming WARC records, loading HTTP headers+payload from buffer, POST request record, arc2warc conversion.
recompress command now parses records fully and generates correct block and payload digests.
WARCWriter.writer.create_record_from_stream() removed, redundant with ArcWarcRecordLoader()

1.2

Support for special field offset to include WARC record offset when indexing (by @nlevitt, #4)
ArchiveIterator supports full iterator semantics
WARC headers encoded/decoded as UTF-8, with fallback to ISO-8859-1 (see #6, #7)
ArchiveIterator, StatusAndHeaders and WARCWriter now available from package root (by @nlevitt, #10)
StatusAndHeaders supports dict-like API (by @nlevitt, #11)
When reading, http headers never added by default, unless ensure_http_headers=True is set (see #12, #13)
All tests run on Windows, CI using Appveyor
Additional tests for writing/reading resource, metadata records
warcio -V now outputs current version.

1.1

Header filtering: support filtering via custom header function, instead of an exclusion list
Add tests for invalid data passed to recompress, remove unused code

1.0

Initial Release!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELIST.rst

CHANGELIST.rst

1.7.4

1.7.3

1.7.2

1.7.1

1.7.0

1.6.3

1.6.2

1.6.1

1.6.0

1.5.3

1.5.2

1.5.1

1.5.0

1.4.0

1.3.4

1.3.3

1.3.2

1.3.1

1.3

1.2

1.1

1.0

Files

CHANGELIST.rst

Latest commit

History

CHANGELIST.rst

File metadata and controls

1.7.4

1.7.3

1.7.2

1.7.1

1.7.0

1.6.3

1.6.2

1.6.1

1.6.0

1.5.3

1.5.2

1.5.1

1.5.0

1.4.0

1.3.4

1.3.3

1.3.2

1.3.1

1.3

1.2

1.1

1.0