Fixed issue with links not being found for new google response format #309

FarisHijazi · 2020-04-08T23:12:34Z

the new 2020 google images update changes where the image information is stored, I found that they're stored in a script in variable AF_initDataCallback

This implementation is backward compatible (using rg_meta), and if that doesn't work, then it will parse the new info.

This code was tested with both python3 and python2

…itDataCallback() the google page contains info in a script variable `AF_initDataCallback` See the javascript that parses it: https://gist.github.com/FarisHijazi/6c9ba3fb315d0ce9bfa62c10dfa8b2f8 This commit is an implementation to this code.fix-2020-format I have added an iterator that returns rg_meta objects

_parse_AF_initDataCallback() the beautifulsoup lib returns text differently for python2, also some unicode decoding had to be done differently for python3 also there were some issues with siteAndNameInfo being accessed unsafely, also got fixed

hackgoofer · 2020-04-09T06:43:00Z

can you add bs4 into the requirements.txt?

justin-fay · 2020-04-13T02:30:31Z

Hi, I tried to use this PR locally and am getting errors when running.
My environment:

beautifulsoup4==4.9.0
bs4==0.0.1
-e git+git@github.com:hardikvasa/google-images-download.git@8d60f981d48ee7b5fb46f9541d427f8e81481706#egg=google_images_download
selenium==3.141.0
soupsieve==2.0
urllib3==1.25.8

The command I am using to download images

googleimagesdownload --keywords "Phyllopertha horticola" --limit 10 --chromedriver '/usr/bin/chromedriver'

The exception raised when running

Item no.: 1 --> Item name = Phyllopertha horticola
Evaluating...
Starting Download...
Traceback (most recent call last):
  File "/home/justin/projects/fastai/homework1/env/bin/googleimagesdownload", line 11, in <module>
    load_entry_point('google-images-download', 'console_scripts', 'googleimagesdownload')()
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 1124, in main
    paths,errors = response.download(arguments)  #wrapping response in a variable just for consistency
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 934, in download
    paths, errors = self.download_executor(arguments)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 1061, in download_executor
    items,errorCount,abs_path = self._get_all_items(raw_html,main_directory,dir_name,limit,arguments)    #get all image items and download images
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 753, in _get_all_items
    self._parse_AF_initDataCallback(page)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 901, in _parse_AF_initDataCallback
    metas = get_metas(page)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 858, in get_metas
    entry = entries[-1]
IndexError: list index out of range

hackgoofer · 2020-04-25T19:14:59Z

yup, verified this PR doesn't work.

this won't fix failures, but it will catch any errors in _parse_AF_initDataCallback() and will stop them from rising any higher

FarisHijazi · 2020-05-12T02:09:34Z

can you add bs4 into the requirements.txt?

done

I didn't want to add it to the requirements as it is an optional requirement.
the code should still run without errors without bs4

ghost · 2020-05-14T12:16:04Z

Hi
I also have a problem after installing bs4.

The command that I am running:
googleimagesdownload --keywords "tree" --limit 10 --chromedriver /Users/reza/Downloads/chromedriver/chromedriver

The error that I got:

Item no.: 1 --> Item name = tree
Evaluating...
Starting Download...
WARNING: _parse_AF_initDataCallback failed list index out of range
Traceback (most recent call last):
File "/Users/reza/projects/tmp-test/my_env/bin/googleimagesdownload", line 11, in
load_entry_point('google-images-download==2.8.0', 'console_scripts', 'googleimagesdownload')()
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 1129, in main
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 939, in download
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 1066, in download_executor
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 765, in _get_all_items
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 722, in _get_next_item
TypeError: 'NoneType' object is not an iterator

marian-code · 2020-05-19T19:48:43Z

This works for me. It is also the right solution to the problem. Might have a few bugs that will have to be sorted before it works for everyone but @FarisHijazi is right about AF_initDataCallback. I checked it and the required information certainly is there. So it just needs to be parsed for it.

cooperdk · 2020-05-29T23:16:12Z

Works, but it thinks every image is a GIF, even when it's not.

cooperdk · 2020-05-30T00:11:53Z

At line 776, before:

if arguments['metadata']:

Insert this:

                imageURL = object['image_link']
                object['image_format'] = imageURL.split(".")[-1]

If you don't, your script will think that all images are in GIF format.
This is an easy fix and it will only look at the original file extension.

Joeclinton1 · 2020-06-27T13:50:53Z

This is a duplicate of #298

FarisHijazi added 3 commits April 8, 2020 17:24

fix: fixed illegal characters like ('?') showing in filenames

a944105

FarisHijazi changed the title ~~Fixed issue with links not being found for new google response format #298~~ Fixed issue with links not being found for new google response format Apr 8, 2020

FarisHijazi added 2 commits May 12, 2020 05:04

fix: surrounded _parse_AF_initDataCallback() with try/except

5d8a38b

this won't fix failures, but it will catch any errors in _parse_AF_initDataCallback() and will stop them from rising any higher

chore(deps): Updated requirements: added bs4

15ed539

FarisHijazi mentioned this pull request Jan 25, 2021

Download Images MauryaRitesh/ImageDownloader#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed issue with links not being found for new google response format #309

Fixed issue with links not being found for new google response format #309

FarisHijazi commented Apr 8, 2020 •

edited

hackgoofer commented Apr 9, 2020

justin-fay commented Apr 13, 2020

hackgoofer commented Apr 25, 2020

FarisHijazi commented May 12, 2020

ghost commented May 14, 2020

marian-code commented May 19, 2020

cooperdk commented May 29, 2020

cooperdk commented May 30, 2020

Joeclinton1 commented Jun 27, 2020

Fixed issue with links not being found for new google response format #309

Are you sure you want to change the base?

Fixed issue with links not being found for new google response format #309

Conversation

FarisHijazi commented Apr 8, 2020 • edited

hackgoofer commented Apr 9, 2020

justin-fay commented Apr 13, 2020

hackgoofer commented Apr 25, 2020

FarisHijazi commented May 12, 2020

ghost commented May 14, 2020

marian-code commented May 19, 2020

cooperdk commented May 29, 2020

cooperdk commented May 30, 2020

Joeclinton1 commented Jun 27, 2020

FarisHijazi commented Apr 8, 2020 •

edited