Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some sequences are missing in pyfastx.Fasta object #41

Open
dawnmy opened this issue Mar 12, 2022 · 4 comments
Open

some sequences are missing in pyfastx.Fasta object #41

dawnmy opened this issue Mar 12, 2022 · 4 comments

Comments

@dawnmy
Copy link

dawnmy commented Mar 12, 2022

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

@lmdu
Copy link
Owner

lmdu commented Mar 15, 2022

Thank you for reporting this issue. I will check that. A new version will be released soon.

@floccinauc
Copy link

floccinauc commented Aug 31, 2023

Any updates on this? I'm getting the same error: I'm loading a large fasta file (~59M entries), and for some of the indices (when accessing by string key and by integer index), I'm getting a key does not exist error. Reloading the file solves the problem for given keys, but shifts it to others.
I'm using pyfastx 1.1.0

@lmdu
Copy link
Owner

lmdu commented Aug 31, 2023

Thanks. Could you provide me your code and data https links.

@floccinauc
Copy link

floccinauc commented Aug 31, 2023

I'm using the unzipped version of this file https://stringdb-downloads.org/download/protein.sequences.v12.0.fa.gz.
As for my code, the simple snippet below does not seem to reproduce this error:

import pyfastx
from tqdm import tqdm
FILEPATH="/dccstor/bmfmbio/datasets/STRING/all/protein.sequences.v12.0.fa"
loaded_fasta = pyfastx.Fasta(FILEPATH)
for idx in tqdm(range(int(5e7))):
a = loaded_fasta[idx]

Maybe it has to do with multiple workers accessing the same fasta file? I'm afraid I cannot post the actual code I'm using at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants