Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: WARC record iterator #134

Merged
merged 5 commits into from
May 29, 2024
Merged

feat: WARC record iterator #134

merged 5 commits into from
May 29, 2024

Conversation

maeb
Copy link
Member

@maeb maeb commented Apr 10, 2024

The gowarc.WarcFileReader does not provide a simple way to get the size of the next record.

This PR introduces an iterator abstraction for iterating over WARC records. In addition to returning the size of each record the iterator encapsulates:

  • filtering records
  • limiting the number of records
  • selecting the nth record

This PR only includes the implementation of the iterator. Future pull requests will start using it.
edit: added use cases

@maeb maeb marked this pull request as draft April 11, 2024 07:21
@trym-b
Copy link
Contributor

trym-b commented Apr 12, 2024

As mentioned IRL yesterday, I think we should not merge this without also merging a use case at the same time (in the same PR). Otherwise, we are basically merging dead code.

@maeb maeb force-pushed the feat/warc-record-iterator branch from 3b95fb8 to 9e00cd7 Compare April 12, 2024 08:21
@maeb maeb force-pushed the feat/warc-record-iterator branch 2 times, most recently from 7364e80 to 857876f Compare April 12, 2024 09:10
@maeb maeb force-pushed the feat/warc-record-iterator branch 4 times, most recently from b943277 to 79a6fc2 Compare May 7, 2024 08:22
@maeb maeb marked this pull request as ready for review May 7, 2024 08:24
@maeb maeb force-pushed the feat/warc-record-iterator branch from 79a6fc2 to e9271b8 Compare May 22, 2024 10:20
maeb added 5 commits May 29, 2024 08:57
The gowarc.WarcFileReader does not provide a simple way to get the size of the next
WARC record.

This commit introduces an iterator abstraction for iterating over WARC records. In
addition to returning the size of each record the iterator encapsulates the filtering
of records, limiting the number of records and selecting the nth record.
@maeb maeb force-pushed the feat/warc-record-iterator branch from e9271b8 to 12d5dc9 Compare May 29, 2024 06:57
@maeb maeb merged commit bb24934 into main May 29, 2024
8 checks passed
@maeb maeb deleted the feat/warc-record-iterator branch May 29, 2024 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants