Support xz-compressed files #46

joverlee521 · 2022-08-24T01:17:33Z

I am planning to use pyfastx within Nextstrain's Augur to support a new data curation command and it would be really helpful to be able to support xz-compressed files. Would you be open to extending pyfastx to support xz-compressed files?

Groups working with large files are using xz to save space because xz has a better compression ratio than gzip. For example, Nextstrain hosts a file of all GenBank SARS-CoV-2 genomes that is xz-compressed.

With the condition that the file was originally compressed in multiple short blocks, it is possible to randomly access xz-compressed files. python-xz is an example of this in pure Python and xz-random-access is an example of this in C.

Thank you!

The text was updated successfully, but these errors were encountered:

lmdu · 2022-11-24T13:53:41Z

Thank you! In the future, I will consider to add support for parsing xz compressed FASTA/Q files.

corneliusroemer · 2022-12-06T03:38:52Z

Great to hear @lmdu! We are slowly moving from xz to zstd due to faster compression/decompression at no compression ratio loss compared to xz.

Just like xz random access, zstd random access seems to be possible as well. I've found these resources:

joverlee521 mentioned this issue Dec 6, 2022

Add RKI data to open (genbank) data nextstrain/ncov-ingest#365

Merged

6 tasks

joverlee521 mentioned this issue May 17, 2023

ENH(curate): Support zstd compressed fasta, metadata and/or ndjson nextstrain/augur#1219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support xz-compressed files #46

Support xz-compressed files #46

joverlee521 commented Aug 24, 2022

lmdu commented Nov 24, 2022

corneliusroemer commented Dec 6, 2022

Support xz-compressed files #46

Support xz-compressed files #46

Comments

joverlee521 commented Aug 24, 2022

lmdu commented Nov 24, 2022

corneliusroemer commented Dec 6, 2022