Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty *faa|missing files #95

Open
mwylerCH opened this issue Apr 23, 2024 · 5 comments
Open

Empty *faa|missing files #95

mwylerCH opened this issue Apr 23, 2024 · 5 comments
Labels

Comments

@mwylerCH
Copy link

Dear Dev team,
your tool is pretty handy, However, I noticed that he tries to download files that don't exist.
In particular, if I'm trying to download the amino acid sequences (eg GCA_016840515.1_ASM1684051v
1_protein.faa.gz) I'm getting a file and no error. However, when looking into it I will find a xml with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Object not found!</title>
<link rev="made" href="mailto:%5bno%20address%20given%5d" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Object not found!</h1>
<p>


    The requested URL was not found on this server.



    If you entered the URL manually please check your
    spelling and try again.



</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:%5bno%20address%20given%5d">webmaster</a>.

</p>

<h2>Error 404</h2>
<address>
  <a href="/">ftp.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>

Of course it would be easy to identify these, but I think it's a issue when I'm filtering genomes with for example -A "species:1".
Please correct me if I'm missing something
Greetings

@pirovc
Copy link
Owner

pirovc commented Apr 24, 2024

Hi, thanks for reporting. I will take a look in this issue to see if there's a bug. Did you try to use the -m parameter to check for file integrity? I believe that would solve your issue.

@mwylerCH
Copy link
Author

mwylerCH commented Apr 24, 2024

Hi,
yes, and I forgot the command (with v0.6.3):

NAME=bacteria
$HOME/genome_updater.sh \
   -A "species:1" -d "genbank" -g "$NAME" -f "protein.faa.gz" -o "$TEMPDIR/AA_${NAME}" -t 20 -m -L curl

@mwylerCH
Copy link
Author

mwylerCH commented Apr 24, 2024

I can further confirm the issue with genomic dna (GCA_037198385.1_ASM3719838v1_genomic.fna.gz).
The sequence has a "suppressed" status on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037198385.1/).
Command:

$HOME/genome_updater.sh \
#   -A "species:1" -d "genbank" -g "viral" \
#   -l "complete genome" -f "genomic.fna.gz" \
#   -o "$TEMPDIR/virus_RefSeq" -t 50 -m -L curl

File content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Object not found!</title>
<link rev="made" href="mailto:%5bno%20address%20given%5d" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Object not found!</h1>
<p>


    The requested URL was not found on this server.



    If you entered the URL manually please check your
    spelling and try again.



</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:%5bno%20address%20given%5d">webmaster</a>.

</p>

<h2>Error 404</h2>
<address>
  <a href="/">ftp.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>

@pirovc
Copy link
Owner

pirovc commented Apr 24, 2024

Indeed, if the MD5 file is not available, genome_updater is keeping the file. For now, you can change the following line to return 1 and genome_updater should skip those files. I will fix this bug in the next release.

@pirovc pirovc added the bug label Apr 24, 2024
@mwylerCH
Copy link
Author

Sorry that I'm repling only now, but I still get the same error, even with:

if [ -z "${ftp_md5}" ]; then
                echolog "${file_name} MD5checksum file not available [${md5checksums_url}] - FILE KEPT"  "0"
                return 1
            else

run with the same command as stated above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants