Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyani download is blocked if downloaded file cannot be uncompressed. #383

Closed
widdowquinn opened this issue Mar 9, 2022 · 6 comments · May be fixed by #385
Closed

pyani download is blocked if downloaded file cannot be uncompressed. #383

widdowquinn opened this issue Mar 9, 2022 · 6 comments · May be fixed by #385
Assignees
Labels
bug something isn't working how it should

Comments

@widdowquinn
Copy link
Owner

Summary:

pyani downloads are blocked if a downloaded file cannot be uncompressed.

Description:

Using pyani download sometimes recovers corrupt compressed files from NCBI. If these throw an error with gunzip, the whole download halts.

What should happen is that the error is noted, and pyani continues with the remaining downloads.

Reproducible Steps:

Three attempts, same error:

pyani download -v -l 01-download -t 1847 -o genomes --email me@dev.null -f
[...]
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 11696811
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_021343995.1_ASM2134399v1
GCF_021343995.1_ASM2134399v1_genomic.fna.gz: 4194304it [00:08, 513382.53it/s]
GCF_021343995.1_ASM2134399v1_hashes.txt: 1048576it [00:00, 479455631.87it/s]
[WARNING] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check failed. Please check and retry.
gunzip: data stream error
gunzip: genomes/GCF_021343995.1_ASM2134399v1_genomic.fna.gz: uncompress failed
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/pyani_py39/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/pyani_script.py", line 126, in run_main
    returnval = args.func(args)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 372, in subcmd_download
    classes, labels, skippedlist = download_data(args, api_key, asm_dict)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 157, in download_data
    extract_genomes(args, dlstatus, esummary)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 185, in extract_genomes
    download.extract_contigs(dlstatus.outfname, ename)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/download.py", line 551, in extract_contigs
    return subprocess.run(cmd, stdout=efh, check=True, shell=False)
  File "/Users/lpritc/opt/anaconda3/envs/pyani_py39/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['gunzip', '-c', 'genomes/GCF_021343995.1_ASM2134399v1_genomic.fna.gz']' returned non-zero exit status 1.

Current Output:

[WARNING] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check failed. Please check and retry.
gunzip: data stream error
gunzip: genomes/GCF_021343995.1_ASM2134399v1_genomic.fna.gz: uncompress failed
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/pyani_py39/bin/pyani", line 33, in <module>
    sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/pyani_script.py", line 126, in run_main
    returnval = args.func(args)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 372, in subcmd_download
    classes, labels, skippedlist = download_data(args, api_key, asm_dict)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 157, in download_data
    extract_genomes(args, dlstatus, esummary)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/scripts/subcommands/subcmd_download.py", line 185, in extract_genomes
    download.extract_contigs(dlstatus.outfname, ename)
  File "/Users/lpritc/Documents/Development/GitHub/pyani/pyani/download.py", line 551, in extract_contigs
    return subprocess.run(cmd, stdout=efh, check=True, shell=False)
  File "/Users/lpritc/opt/anaconda3/envs/pyani_py39/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['gunzip', '-c', 'genomes/GCF_021343995.1_ASM2134399v1_genomic.fna.gz']' returned non-zero exit status 1.

Expected Output:

The equivalent of the below, for the downloaded genome:

[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 11715191
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_000196675.2_ASM19667v2
GCF_000196675.2_ASM19667v2_genomic.fna.gz: 2097152it [00:04, 450203.82it/s]
GCF_000196675.2_ASM19667v2_hashes.txt: 1048576it [00:00, 3168621405.69it/s]
[INFO] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check passed
[INFO] [pyani.scripts.subcommands.subcmd_download]: Label and class file entries
	Label: 2863143433ceedac6d96ff91f7a3797f	GCF_000196675.2_ASM19667v2_genomic	P. dioxanivorans CB1190
	Class: 2863143433ceedac6d96ff91f7a3797f	GCF_000196675.2_ASM19667v2_genomic	Pseudonocardia dioxanivorans

pyani Version:

v0.3-alpha

Python Version:

3.9

Operating System:

macOS 12.2.1

@widdowquinn widdowquinn added the bug something isn't working how it should label Mar 9, 2022
@widdowquinn
Copy link
Owner Author

I think this is different to #70

@baileythegreen
Copy link
Contributor

I believe this is also causing tests to fail, specifically these:

tests/test_subcmd_01_download.py::test_download_dry_run FAILED                                                                    [ 70%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia FAILED                                                              [ 71%]
tests/test_subcmd_01_download.py::test_download_kraken FAILED                                                                     [ 72%]

dryrun_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../tmp/pytest-of-baileythegreen/pytest-17/test_download_dry_run0/C_blochmannia'), retries=20, taxon='203804', timeout=10)

    def test_download_dry_run(dryrun_namespace):
        """Dry run of C. blochmannia download."""
>       subcommands.subcmd_download(dryrun_namespace)

tests/test_subcmd_01_download.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
    classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
    esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
    summary = entrez_esummary(
pyani/download.py:237: in wrapper
    return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
    record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
    self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Bio.Entrez.Parser.DataHandler object at 0x127a6b280>, tag = 'eSummaryResult', attrs = {}

    def handleMissingDocumentDefinition(self, tag, attrs):
        """Raise an Exception if neither a DTD nor an XML Schema is found."""
>       raise ValueError(
            "As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
        )
E       ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.

../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
WARNING  pyani.scripts.subcommands.subcmd_download:subcmd_download.py:360 Dry run only: will not overwrite or download
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING  pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
	Query: 203804
	asm count: 9
	UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
______________________________________________________ test_download_c_blochmannia ______________________________________________________

base_download_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr...ytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia'), retries=20, taxon='203804', timeout=10)

    def test_download_c_blochmannia(base_download_namespace):
        """Test C. blochmannia download."""
>       subcommands.subcmd_download(base_download_namespace)

tests/test_subcmd_01_download.py:133: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
    classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
    esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
    summary = entrez_esummary(
pyani/download.py:237: in wrapper
    return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
    record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
    self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Bio.Entrez.Parser.DataHandler object at 0x1275e7160>, tag = 'eSummaryResult', attrs = {}

    def handleMissingDocumentDefinition(self, tag, attrs):
        """Raise an Exception if neither a DTD nor an XML Schema is found."""
>       raise ValueError(
            "As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
        )
E       ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.

../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
INFO     pyani.scripts:__init__.py:39 Creating output directory /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia
WARNING  pyani.scripts:__init__.py:42 Output directory overwrite forced
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING  pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
	Query: 203804
	asm count: 9
	UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:139 eSummary information (GCF_014857065.1_ASM1485706v1):
	Species Taxid: 2681987
	TaxID: 2681987
	Accession: GCF_014857065.1
	Name: ASM1485706v1
	Organism: Blochmannia endosymbiont of Colobopsis nipponica
	Genus: Blochmannia
	Species: endosymbiont of Colobopsis nipponica
	Strain: 
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:239 Retrieving URLs for GCF_014857065.1_ASM1485706v1
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:292 Downloaded from URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/857/065/GCF_014857065.1_ASM1485706v1/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:293 Wrote assembly to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:294 Wrote MD5 hashes to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_hashes.txt
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:298 Local MD5 hash: fbd87dfdbb889fad197db147c90790f8
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:299 NCBI MD5 hash: fbd87dfdbb889fad197db147c90790f8
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:301 MD5 hash check passed
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:184 Extracting archive /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:211 Creating local MD5 hash for /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG    pyani.scripts.subcommands.subcmd_download:subcmd_download.py:214 Writing hash to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.md5
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:161 Label and class file entries
	Label: fb08eedc0cf49e1cf44a95539ae4fd7c	GCF_014857065.1_ASM1485706v1_genomic	B. endosymbiont of Colobopsis nipponica 
	Class: fb08eedc0cf49e1cf44a95539ae4fd7c	GCF_014857065.1_ASM1485706v1_genomic	Blochmannia endosymbiont of Colobopsis nipponica
INFO     pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 5431901
_________________________________________________________ test_download_kraken __________________________________________________________

kraken_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../private/tmp/pytest-of-baileythegreen/pytest-17/test_download_kraken0/kraken'), retries=20, taxon='203804', timeout=10)

    def test_download_kraken(kraken_namespace):
        """C. blochmannia download in Kraken format."""
>       subcommands.subcmd_download(kraken_namespace)

tests/test_subcmd_01_download.py:138: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
    classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
    esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
    summary = entrez_esummary(
pyani/download.py:237: in wrapper
    return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
    record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
    self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Bio.Entrez.Parser.DataHandler object at 0x1039aeb80>, tag = 'eSummaryResult', attrs = {}

    def handleMissingDocumentDefinition(self, tag, attrs):
        """Raise an Exception if neither a DTD nor an XML Schema is found."""
>       raise ValueError(
            "As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
        )
E       ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.

../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError

@widdowquinn
Copy link
Owner Author

I think we need to investigate why these tests are now failing due to the uncompression, when they were previously working. Is there something about the download that has changed?

@baileythegreen
Copy link
Contributor

If you are able to reproduce the failures, you are welcome to try. I no longer seem to be able to. I did find this github issue, part of which seemed to indicate something like this could be caused by a temporary issue, but I can't say if that's what happened here.

The traceback I copied above is the only example I have of those tests failing locally.

@widdowquinn
Copy link
Owner Author

The DTD file issue is different (I've encountered it before).

With the issue I originally raised, several .fna.gz files were downloaded by the script that were corrupt. They could not be opened in pyani download and they could not be opened from the command line, either.

@widdowquinn
Copy link
Owner Author

On testing the above command again today (2022-03-15) the downloads proceed without error.

I'm calling this as a transitory issue, possibly a fault at NCBI's end, and closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should
Projects
None yet
2 participants