Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on using SQLite for the WACZ format #62

Open
phiresky opened this issue Sep 19, 2021 · 18 comments
Open

Thoughts on using SQLite for the WACZ format #62

phiresky opened this issue Sep 19, 2021 · 18 comments

Comments

@phiresky
Copy link

phiresky commented Sep 19, 2021

This is a response to your email, but I thought I'd post it publicly since it might interest other people.


Here's some fairly unstructured thoughts without a real "conclusion":

I'd say SQLite is definitely a possibility for that use case.

  • SQLite has an "integrated" zip-file replacement https://sqlite.org/sqlar/doc/trunk/README.md (not that great for this since it uses DEFLATE and doesn't directly have any other file metadata or indexes, but it's precedent for storing files in SQLite).

  • SQLite has a BLOB reading API that allows reading a byte sequence stored in there to be read as a stream (as opposed to getting it completely with a normal SQL query)

  • Audacity recently changed their project format which consists of many audio file tracks to use SQLite instead of a directory of files: https://www.audacityteam.org/audacity-3-0-0-released/ . These can be pretty large audio files so it seems to work fine for that.

  • SQLite has a limitation of 2GB for one blob which is probably not that relevant for webrecorder but can definitely be hit when using it to archive normal files

  • I've looked at the WACZ format before and it seems well thought out, but a bit unnecessarily complicated - kind of reinventing a database structure inside a ZIP file. I also understand in general why people like plain text etc formats - but is there a specific reason why you need to retain normal WARC files with side-car indexes in your WACZ format?

  • Putting everything in a cleanly normalized SQLite database has a few obvious advantages - for one it becomes trivial to continuously insert more requests into it without having to rewrite any indexes, all indexing is managed automatically and adding a new key by which to query data is trivial. For example in browsertrix-crawler, I couldn't really find out if I can stop the crawling and resume it later, or if it always has to run through completely, and I guess it only writes out the final archive at the end?

  • You can write to and read from SQLite database from multiple processes at the same time with ease (single-writer but multiple-reader with journal_mode=WAL)

  • The SQLite database storage format is pretty efficient - for example it transparently uses var-ints for all integers. Text/binary data is not compressed though.

  • I see that accessing CDXJ uses binary search for looking up stuff - SQLite uses a B-Tree (as do basically all databases) which in general should result in a constant factor of fewer requests, since the tree depth is lower. It can also be updated more efficiently (afaik). Since request latency is significant for partial fetching, this is probably a good improvement. You can check the depth of the b-tree with sqlite_analyze. I just looked at a DB with 2M rows, and the SQLite index has depth 4, while a normal binary tree lookup would have depth 20.

  • SQLite always first looks up the rowid in a index b-tree, then the real row in the table b-tree. Since each HTTP request has fairly high overhead, it can make sense to pre-cache the inner nodes of the b-trees beforehand. Someone implemented this streamlining HTTP request pattern with a sidecar file phiresky/sql.js-httpvfs#15 . In your WACZ format, the equivalent would be caching the first few layers of the binary tree induced by the binary search (but it would be less useful, since the tree is much deeper).

  • CDXJ gzips groups of lines - this is something that would be harder with SQLite since you can't compress many rows together without denormalizing the DB in some way (e.g. storing many real rows per DB row).

  • To solve the above I've actually developed https://github.com/phiresky/sqlite-zstd , which allows transparent row-level compression in SQLite with dictionaries - so if you have 10000 similar entries all the common information need only be stored once. It's great, but a bit experimental and right now no wasm support.

  • SQlite would be pretty much a blackbox in this context - you probably don't want to modify SQLite core so if you find anything that's outside of the flexibility of SQLite will be hard to do.

  • One disadvantage about SQLite is that it would be harder to reason about how much data needs to be fetched exactly when querying by something specifically, and having a wasm "blob" between layers of your code is annoying in general.

  • SQLite is very robust and established, but using it from within the browser is not. So if your use case was only native programs I'd say definitely use it instead of a home-brew format, but for web-based it's less clear.

  • To use SQLite as your file format, you'd have to decide on the database structure. So that would be something like:

create table requests(id integer primary key autoincrement, timestamp date, headers text, content blob);
create table entry_points(request_id integer references requests(id), ts date, title text, text text); -- pages  
create table archive_metadata(key text primary key, value text);

You could then also easily add a full-text search search index so you can search for websites with specific text in the titles or even content.

The "cleanest" way would be to put everything in 3NF form. That way you could add indexes on whatever your want, including looking for requests with a specific header value etc. But you probably want to denormalize the data somewhat since you always want all the headers together with the content (like in WARC), without any joining overhead. Then it's harder to index the headers individually, but the main use case is (probably) efficient. Putting it in a more normal form would mean something like

create table requests(id integer primary key autoincrement, timestamp date, ...);
create table headers(id integer primary key autoincrement, key text);
create table header_values(id integer primary key autoincrement, header integer references headers(id), value text);
create table headers(request_id integer references requests(id), header_value references header_values(id));

Note that this is much more complicated, but it also deduplicates the data since most headers and even header values are probably used multiple times.

@phiresky
Copy link
Author

One more thing is that you probably want compression, and might want hash-based indexing as you do in your archiveweb.page system. So that would make it somethingl ike

create table requests(id integer primary key autoincrement, timestamp date, ..., headers_compression_format text, headers blob, content_hash references contents(sha256));
create table contents(sha256 text primary key, compression_format text, data blob);

@phiresky
Copy link
Author

I'll see if I can convert a warc archive to the above sqlite format later today, then we can check how many http requests it has to do

@ikreymer
Copy link
Member

Thanks for the detailed response! To clarify, I was thinking specifically of using sqlite in place of the CDXJ index, and possibly page and text indices, not the raw WARC data.

I've looked at the WACZ format before and it seems well thought out, but a bit unnecessarily complicated - kind of reinventing a database structure inside a ZIP file. I also understand in general why people like plain text etc formats - but is there a specific reason why you need to retain normal WARC files with side-car indexes in your WACZ format?

Yes, a key reason is backwards (and forwards) compatibility, there are many PBs of WARC data out there, and it is a standard format that is widely used in the web archiving community. By simply storing the WARCs in the WACZ, this ensures there is a way to transition between the two, for those that only support WARCs. Any new version of the WARC standard can also be accommodated. There are also efforts to support zstd compression in WARCs as well.

The CDX(J) format, although in common use now is less standardized, and there is no standard for page or text indices thus far. I am thinking of sqlite in place of the CDX(J) and/or the page/text indices.

The table could look something like:

create table cdx(id integer primary key autoincrement, url text, timestamp date, status integer, mime text, digest text, length integer, offset integer, filename text);

A quick test shows that a sqlite table is about similar size (slightly smaller) than the uncompressed CDXJ, while a compressed CDXJ would still be a lot smaller.. so there is that. Probably a custom grouping approach as you've experimented with might still be necessary.
The main use cases for this will probably be around web based access, so definitely want to optimize for that.

@phiresky
Copy link
Author

phiresky commented Sep 19, 2021

I was thinking specifically of using sqlite in place of the CDXJ index

Sure, that makes sense. I still think you could fairly easily make a bidirectional conversion from a set of warc files to a extensible SQLite DB and ensure there's no information loss, and a specification of that could be the next version of the WARC file format. But I can understand the want to keep around the original files and to have a less drastic transition.

A quick test shows that a sqlite table is about similar size

That sounds kind of unexpected to me since the JSON format has a lot of overhead (when not compressed).
I just tried it with the index in fr.wacz (8000 entries)

Converted with the following command:

rg '(.*?) (\d+) \{(.*)\}' cdx-00000 --replace '{"revurl": "$1", "ts": $2, $3}' > cdx-0000.jsonl
jq -s . cdx-0000.jsonl > cdx-0000.json
sqlite-utils insert cdx-0000.sqlite3 cdx cdx-0000.json
l cdx*

2,7M Sep 19 21:26 cdx-00000.jsonl
2,1M Sep 19 21:27 cdx-00000.sqlite3
2,5M Sep 19 21:19 cdx-00000

So trivially converted, that's 16% somewhat smaller than the uncompressed DB, though when I add an index on revurl, it is also 2.6MByte, so I guess you're pretty correct.

But there's also a ton of redundancy in there because it's not normalized.

With queries like

create table filenames(id integer primary key not null, filename text not null);
insert or ignore into filenames(filename) select filename from cdx;
update cdx set filename = (select id from filenames where filenames.filename = cdx.filename);

The sqlite size goes down to 1.3MByte, a 48% size reduction over the uncompressed file. Still 5x larger than the compressed file, but that doesn't allow random access.
That 1.3MByte consists of 30% for the URLs, 33% for the index on the domain-reversed URL, 9% for the digests, and 28% for everything else.

Testing with

strace -e pread64 sqlite3 -cmd 'pragma mmap_size=0;' cdx-00000.sqlite3 "select * from cdx where revurl = (select id from revurls where revurl = 'org,supdigital,filmingrevolution)/img/bullet-proof-vest.png')"

Shows that finding the entry for one url takes 7 HTTP requests of 4kB each, 3 for the revurl b-tree lookup and 4 for the cdx table lookup. Not sure if that's good or bad. I guess with "only" 8000 entries it's still better to just compress and download the whole thing.

Attached is the SQLite db and the original cdxj file:
cdx-sqlite.zip

@phiresky
Copy link
Author

One obvious reason to use SQLite for the indexes would be that you can easily add a fairly efficient full-text-search index for fuzzy-matching in e.g. page titles (or contents) without downloading all of them. No idea how you would build that otherwise, I don't think there's any precedent for full-text / inverse index look ups from a static http server.

Other than that the indexes should be more efficient and more flexible than self-written ones.

@ikreymer
Copy link
Member

Thanks for checking it with this file, that seems like a good test case, and can definitely find something bigger with a lot more data!

By comparison, inside that WACZ, the size is compressed cdx-00000.gz (263902) + cluster.idx (2430) = 266332, so ~266K, which is about 10x smaller.

For CDX lookup, there should only be 2 HTTP requests, one for the compressed block (from cdx-00000.gz) and one to get the WARC record from one of the WARCs. (There's a few other requests initially from the WACZ itself). 7 HTTP requests sounds not so great by comparison, so this doesn't seem like a good alternative based on this dataset.

Of course, this index is particularly optimized for url + timestamp lookup, so other types of queries would be more tricky.

One obvious reason to use SQLite for the indexes would be that you can easily add a fairly efficient full-text-search index for fuzzy-matching in e.g. page titles (or contents) without downloading all of them. No idea how you would build that otherwise, I don't think there's any precedent for full-text / inverse index look ups from a static http server.

Perhaps application to page index / text index might be more interesting. Yes, I was thinking of some sort of incremental sharded loading that is populated into the local FlexSearch instance, but you're right, not aware of any existing standard approach for this!

@ikreymer
Copy link
Member

The sqlite size goes down to 1.3MByte, a 48% size reduction over the uncompressed file. Still 5x larger than the compressed file, but that doesn't allow random access.
That 1.3MByte consists of 30% for the URLs, 33% for the index on the domain-reversed URL, 9% for the digests, and 28% for everything else.

Hm, the domain-reversed URL we may actually be able to skip, as wabac.js currently doesn't use it - experimenting if replay on this scale can be done sufficiently well without it. Most other operations are done on range queries in IndexedDB with extra filtering.

To solve the above I've actually developed https://github.com/phiresky/sqlite-zstd , which allows transparent row-level compression in SQLite with dictionaries - so if you have 10000 similar entries all the common information need only be stored once. It's great, but a bit experimental and right now no wasm support.

Yeah, I think probably some sort of batch row compression like this would be needed to get comparable storage size to the compressed cdx...

@phiresky
Copy link
Author

7 HTTP requests sounds not so great by comparison, so this doesn't seem like a good alternative based on this dataset.

The cluster.idx meta-index is basically like caching the first few layers of the b-tree, so the comparison isn't totally fair. The second request to the SQLite db will already make fewer b-tree reads without any pre-caching (2 HTTP requests into the revurl index, 2 requests into the cdx table). If a website does 20 requests it'll probably amortize fairly quickly.

domain-reversed URL [...] wabac.js currently doesn't use it

Huh, so it just does exact matching by URL?

I just tried again with a 64kB page size and querying by URL, I get 4 http requests for the first lookup by URL, and 2 for the second lookup with a different random URL. With similar URLs, the next query has 0 or 1 http requests.

So yeah, just comparing these it doesn't look like SQLite has much of an advantage, and only gets the same "performance" as the manually-tuned cdxj index with some tuning of its own.

inside that WACZ, the size is compressed 266K

Yes, you're not going to get that same size with SQLite without pretty hacky stuff. Though I'm not sure if the total size here is too much of an issue, considering it's only fetched in smaller parts, and the total size will in any case be much smaller than the rest of the WARC files.

Yes, I was thinking of some sort of incremental sharded loading that is populated into the local FlexSearch instance, but you're right, not aware of any existing standard approach for this!

I'd say The SQLite FTS scales to maybe 5 - 50GB of plain text content with this method (static hosting), though the DB will get pretty large. quickwit-oss/tantivy#1067 scales to searching in multiple terabytes of text, fetching only 1-2MB of data from the index.

@ikreymer
Copy link
Member

7 HTTP requests sounds not so great by comparison, so this doesn't seem like a good alternative based on this dataset.

The cluster.idx meta-index is basically like caching the first few layers of the b-tree, so the comparison isn't totally fair. The second request to the SQLite db will already make fewer b-tree reads without any pre-caching (2 HTTP requests into the revurl index, 2 requests into the cdx table). If a website does 20 requests it'll probably amortize fairly quickly.

Yes, you're right, the .idx file is sort of equivalent to the b-tree cache!

domain-reversed URL [...] wabac.js currently doesn't use it

Huh, so it just does exact matching by URL?

It does a lot of prefix search, usually from '?', and then finds best match based on a few heuristics and rules
The rev-url form was needed for certain locality, but actually not as important in general case. When needed, a URL can simply be transformed into a special 'fuzzy redirect url' that then points to original, this is done for youtube.

I just tried again with a 64kB page size and querying by URL, I get 4 http requests for the first lookup by URL, and 2 for the second lookup with a different random URL. With similar URLs, the next query has 0 or 1 http requests.

So yeah, just comparing these it doesn't look like SQLite has much of an advantage, and only gets the same "performance" as the manually-tuned cdxj index with some tuning of its own.

Yeah, perhaps not worth the switch just for this.. Perhaps if additional querying capability was needed, that could be worth it, but this is intended primarily for replay. Thanks again for trying this out!

inside that WACZ, the size is compressed 266K

Yes, you're not going to get that same size with SQLite without pretty hacky stuff. Though I'm not sure if the total size here is too much of an issue, considering it's only fetched in smaller parts, and the total size will in any case be much smaller than the rest of the WARC files.

Yeah, that's probably right at this scale. It matters when dealing with PBs of data but not at scale of most WACZ files.

Yes, I was thinking of some sort of incremental sharded loading that is populated into the local FlexSearch instance, but you're right, not aware of any existing standard approach for this!

I'd say The SQLite FTS scales to maybe 5 - 50GB of plain text content with this method (static hosting), though the DB will get pretty large. tantivy-search/tantivy#1067 scales to searching in multiple terabytes of text, fetching only 1-2MB of data from the index.

That is interesting! Perhaps optimization/index format improvements should focus on around search, as we don't yet have a good scalable solution for full-text search! I wasn't familiar with this project before (hope that PR gets merged!)

@ikreymer
Copy link
Member

ikreymer commented Apr 2, 2022

Interested in exploring the idea of use SQLite just for the full-text search.. In this way, it'll be an optional enhancement, but not required.
The CDXJ approach works well for the core replay, now having it scale to over 1TB, ex: https://replayweb.page/?source=https%3A%2F%2Ffiles.sucho.org%2Farchives%2Fchnu-cv-ua.wacz#view=pages&url=http%3A%2F%2Fchnu.cv.ua%2Findex.php%3Fpage%3Dua&ts=20220324190736

but the full-text search, don't yet have a scalable solution.

The idea would be to essentially convert the pages.jsonl / extra-pages.jsonl to a sqllite db. The table would consist of the page id, url, timestamp, title and extracted text.

I'd say The SQLite FTS scales to maybe 5 - 50GB of plain text content with this method (static hosting), though the DB will get pretty large. quickwit-oss/tantivy#1067 scales to searching in multiple terabytes of text, fetching only 1-2MB of data from the index.

If we're talking about just text content, I think that scale seems sufficient!

@DiegoPino
Copy link

DiegoPino commented Apr 2, 2022 via email

@ikreymer
Copy link
Member

ikreymer commented Apr 2, 2022

Ilya, this might be interesting for you quickwit-oss/tantivy#1067 (Hint: faster and more efficient than sql lite)

yes, that PR is from @phiresky , who mentions it above :)

It doesn't seem like its going to get merged, though based on lack of activity there :/

Another factor is that this would add a dependency on tantivy, which is not as ubiquitous as sqlite.

@phiresky
Copy link
Author

phiresky commented Apr 2, 2022

It doesn't seem like its going to get merged, though based on lack of activity there :/

I didn't check in a while, but if I remember correctly the authors of tantivy actually added some other changes that should make a wasm build easier (smaller PR with much less changes). Still implementation work someone would need to do though which the authors of tantivy themselves probably aren't too interested in doing themselves.

The idea would be to essentially convert the pages.jsonl / extra-pages.jsonl to a sqllite db

Can you send a (large) pages.jsonl file so I can play around? https://files.sucho.org/archives/chnu-cv-ua.wacz doesn't contain the plain text extraction

@phiresky
Copy link
Author

phiresky commented Apr 2, 2022

@ikreymer here's the python code I used for a different project that loads a set of WARC files into an SQLite database https://github.com/phiresky/warc-sqlite (with a suboptimal format and without advanced functionality right now). I added a small CLI and docs. Try it with pipx install git+https://github.com/phiresky/warc-sqlite

@ikreymer
Copy link
Member

ikreymer commented Apr 3, 2022

Can you send a (large) pages.jsonl file so I can play around? https://files.sucho.org/archives/chnu-cv-ua.wacz doesn't contain the plain text extraction

Yes, we turned off text extraction due to scaling issues for now.

Here are some WACZ files that do have text in the pages.jsonl:
https://files.sucho.org/archives/dako-gov-ua.wacz
https://files.sucho.org/archives/echo-msk-ru.wacz

Can send the separate pages.jsonl later. Thanks so much for taking a look (whenever you have time!)

@phiresky
Copy link
Author

phiresky commented Apr 3, 2022 via email

@ikreymer
Copy link
Member

ikreymer commented Apr 4, 2022

Thanks, I extracted them from the links you sent. Do you think the search should rather happen within the service worker or on the client side? In the service worker there might be a bit of an issue in that you can't create another (web)worker from the service worker, so running the sqlite engine in there might block other queries while it's running

That's a good point, I was imagining it would be a part of wabac.js, or as an add on to it, since it handles the wacz loading. wabac.js can also be run as a webworker - it is used that way for the initial loading. Then, replayweb.page could just perform search queries via the /api endpoints that wabac.js provides. But if its easier to prototype different, can also move it later!

@phiresky
Copy link
Author

I implemented a prototype: webrecorder/wabac.js#63

We'll need bigger pages.jsonl files to really test it, the 50MB ones aren't too interesting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants