Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After the interruption, download from beginning #15

Open
opconty opened this issue Aug 21, 2017 · 3 comments
Open

After the interruption, download from beginning #15

opconty opened this issue Aug 21, 2017 · 3 comments

Comments

@opconty
Copy link

opconty commented Aug 21, 2017

I have downloaded parts of the zip files, while download processing some error occured,when I restart the download process,it will from the first one to download.so ,here is my Temporary solution:
inside download function,
for fname, url, request in iter_google_store(ngram_len, verbose=verbose, lang=lang):
# add this new if sentence to check
if os.path.exists(str(output.join(fname))):
print('already exist')
continue
else:
with output.join(fname).open('wb') as f:
print(output.join(fname),'downloading...')
for num, chunk in enumerate(request.iter_content(1024)):
if verbose and not divmod(num, 1024)[1]:
sys.stderr.write('.')
sys.stderr.flush()
f.write(chunk)
Maybe this question has been handled,but are there any better solutions.thanks.

@tianhuil
Copy link

tianhuil commented Sep 9, 2019

HI, there is a PR that solves this from my fork pending but you can pip install it in the meantime

> pip install git+git://github.com/tianhuil/google-ngram-downloader.git@master

@lehigh123
Copy link

lehigh123 commented Feb 28, 2021

@tianhuil If I install your fork above will I be able to run

google-ngram-downloader download -n 3 -o .

in a directory where I already have some of the length 3 ngrams downloaded? Or do I need to specify that I am using your specific version to get the functionality where the downloads will not restart from the beginning?

@lehigh123
Copy link

lehigh123 commented Mar 2, 2021

If anyone else happens upon this post. I wanted a way to be able to stop the downloads and then come back and continue downloading where I'd left off. This is useful when downloading any nGrams greater than size 1 since they take many hours to download. The current implementation just restarts from the very first ngram. If you update the util.py class and add

  1. This import to use Os functions
import os
  1. This simple if check inside the iter_google_store for loop
        if os.path.isfile(fname):
            sys.stderr.write(fname)
            continue

and re-run the command in an output file where there are already some ngrams downloaded it will continue downloading at the next undownloaded ngram.

Here is some sample output in a directory where I'd downlaoded A, B and some C ngrams:

/Volumes/Seagate » google-ngram-downloader download -n 3 -o . -v                                                                                                    @MacBook-Pro-5
googlebooks-eng-all-3gram-20120701-0.gz
googlebooks-eng-all-3gram-20120701-1.gz
googlebooks-eng-all-3gram-20120701-2.gz
googlebooks-eng-all-3gram-20120701-3.gz
googlebooks-eng-all-3gram-20120701-4.gz
googlebooks-eng-all-3gram-20120701-5.gz
googlebooks-eng-all-3gram-20120701-6.gz
googlebooks-eng-all-3gram-20120701-7.gz
googlebooks-eng-all-3gram-20120701-8.gz
googlebooks-eng-all-3gram-20120701-9.gz
googlebooks-eng-all-3gram-20120701-aa.gz
googlebooks-eng-all-3gram-20120701-ab.gz
googlebooks-eng-all-3gram-20120701-ac.gz
googlebooks-eng-all-3gram-20120701-ad.gz
googlebooks-eng-all-3gram-20120701-ae.gz
googlebooks-eng-all-3gram-20120701-af.gz
googlebooks-eng-all-3gram-20120701-ag.gz
googlebooks-eng-all-3gram-20120701-ah.gz
googlebooks-eng-all-3gram-20120701-ai.gz
googlebooks-eng-all-3gram-20120701-aj.gz
googlebooks-eng-all-3gram-20120701-ak.gz
googlebooks-eng-all-3gram-20120701-al.gz
googlebooks-eng-all-3gram-20120701-am.gz
googlebooks-eng-all-3gram-20120701-an.gz
googlebooks-eng-all-3gram-20120701-ao.gz
googlebooks-eng-all-3gram-20120701-ap.gz
googlebooks-eng-all-3gram-20120701-aq.gz
googlebooks-eng-all-3gram-20120701-ar.gz
googlebooks-eng-all-3gram-20120701-as.gz
googlebooks-eng-all-3gram-20120701-at.gz
googlebooks-eng-all-3gram-20120701-au.gz
googlebooks-eng-all-3gram-20120701-av.gz
googlebooks-eng-all-3gram-20120701-aw.gz
googlebooks-eng-all-3gram-20120701-ax.gz
googlebooks-eng-all-3gram-20120701-ay.gz
googlebooks-eng-all-3gram-20120701-az.gz
googlebooks-eng-all-3gram-20120701-a_.gz
googlebooks-eng-all-3gram-20120701-ba.gz
googlebooks-eng-all-3gram-20120701-bb.gz
googlebooks-eng-all-3gram-20120701-bc.gz
googlebooks-eng-all-3gram-20120701-bd.gz
googlebooks-eng-all-3gram-20120701-be.gz
googlebooks-eng-all-3gram-20120701-bf.gz
googlebooks-eng-all-3gram-20120701-bg.gz
googlebooks-eng-all-3gram-20120701-bh.gz
googlebooks-eng-all-3gram-20120701-bi.gz
googlebooks-eng-all-3gram-20120701-bj.gz
googlebooks-eng-all-3gram-20120701-bk.gz
googlebooks-eng-all-3gram-20120701-bl.gz
googlebooks-eng-all-3gram-20120701-bm.gz
googlebooks-eng-all-3gram-20120701-bn.gz
googlebooks-eng-all-3gram-20120701-bo.gz
googlebooks-eng-all-3gram-20120701-bp.gz
googlebooks-eng-all-3gram-20120701-bq.gz
googlebooks-eng-all-3gram-20120701-br.gz
googlebooks-eng-all-3gram-20120701-bs.gz
googlebooks-eng-all-3gram-20120701-bt.gz
googlebooks-eng-all-3gram-20120701-bu.gz
googlebooks-eng-all-3gram-20120701-bv.gz
googlebooks-eng-all-3gram-20120701-bw.gz
googlebooks-eng-all-3gram-20120701-bx.gz
googlebooks-eng-all-3gram-20120701-by.gz
googlebooks-eng-all-3gram-20120701-bz.gz
googlebooks-eng-all-3gram-20120701-b_.gz
googlebooks-eng-all-3gram-20120701-ca.gz
googlebooks-eng-all-3gram-20120701-cb.gz
googlebooks-eng-all-3gram-20120701-cc.gz
googlebooks-eng-all-3gram-20120701-cd.gz
googlebooks-eng-all-3gram-20120701-ce.gz
googlebooks-eng-all-3gram-20120701-cf.gz
googlebooks-eng-all-3gram-20120701-cg.gz
googlebooks-eng-all-3gram-20120701-ch.gz
googlebooks-eng-all-3gram-20120701-ci.gz
googlebooks-eng-all-3gram-20120701-cj.gz
googlebooks-eng-all-3gram-20120701-ck.gz
googlebooks-eng-all-3gram-20120701-cl.gz
googlebooks-eng-all-3gram-20120701-cm.gz
googlebooks-eng-all-3gram-20120701-cn.gz
Downloading http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20120701-co.gz .

...continues downloading the rest of the ngrams beginning at co

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants