Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated sequence not being captured entirely, intermittent data issues #128

Open
wlymanambry opened this issue Jan 2, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@wlymanambry
Copy link

wlymanambry commented Jan 2, 2024

Describe the bug
I'm using seqrepo to load many small protein sequences, on the order of millions. What we are finding is that the data loaded does not always match the input data that was loaded. We are still investigating but it appears to be missing repeated sequence (of *) at the beginning of the sequence. As an example, this file was loaded:
image

but the returned sequence only contains 7 asterisks instead of the expected 36:
image

This doesn't appear to be a one off. We are seeing a good amount of this in our testing.
image

To Reproduce
Whatever is causing this doesn't appear to be reproducible. Reloading the sequence a second time appears to be resolve the issue.

Expected behavior
The loaded sequences + reported sequence would match the input sequence.

@jsstevenson jsstevenson added the bug Something isn't working label Jan 2, 2024
@wlymanambry
Copy link
Author

Sorry, I should have added, if I check the compressed file that was written, it has also omitted some of the repeating sequence:
Preloaded sequence:
image

Compressed post loaded sequence:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants