Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when fetching large database #31

Open
JejM opened this issue Nov 14, 2019 · 3 comments
Open

Issue when fetching large database #31

JejM opened this issue Nov 14, 2019 · 3 comments

Comments

@JejM
Copy link

JejM commented Nov 14, 2019

When trying to collect the whole ITS2 database for Viridiplantae, the process breaks at batch 22300. It broke at different batches during previous trials (e.g. 23600). Possibly this is outside the scope of BCdatabaser and it cannot allow such a large download

Specifics:
Ubuntu 18.04.3
BCdatbaser is run through docker and set up according to instructions
primer file is identical to the one provided here (i.e. Sickel et al. 2015)
attached the log file: bcdatabaser.log

@iimog
Copy link
Member

iimog commented Nov 14, 2019

Hi @JejM, thanks for reporting this. Sorry to hear that you have trouble creating the database you want. In general there is no limitation on the database size from bcdatabaser. However, we had problems with network connections to NCBI, especially when we had many or large requests in a short time. Unfortunatelly, bcdatabaser is not yet very robust against these network problems (see #16). We plan to work on some mechanism to re-try failed batches but this is not yet implemented. Currently, if a single batch fails and there are >2000 batches to download in your case the whole bcdatabaser run fails.
One thing you can try to verify that it is indeed a temporary issue is to docker exec into your docker container and re-run the last command to see whether it succeeds this time or whether it produces a reproducible error:

tail -n+22201 viridiplantae.its2.14-11-19_trimmed/list.filtered.txt | head -n 100 | cut -f1 | epost -db nuccore | efetch -format fasta >>viridiplantae.its2.14-11-19_trimmed/sequences.fa

Let me know if it is another error that we can work on to fix, otherwise feel free to add your 👍 to #16 to increase its priority.

@JejM
Copy link
Author

JejM commented Nov 15, 2019

Thanks for the feedback. Reading other issues properly would have prevented the duplicate, apologies. When I re-ran the fetch again, but only taking 1 replicate of every taxon, it finished correctly. So, as you said, it is most likely related to network issues with NCBI. With a little 'luck', fetching large databases can still work with the current version of bcdatabaser. It is definitely the most comfortable method out there. Thank you for this.

@chiras
Copy link
Collaborator

chiras commented Nov 15, 2019

@JejM You can also download this dataset: https://zenodo.org/record/3339029#.Xc6XPC1oTKg there is a full ITS2 plant dataset already deposited that has been generated with the BCdatabaser and the web default settings

iimog added a commit that referenced this issue Nov 2, 2021
Retry to download failed batches, Issues #16 & #31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants