Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed issue with links not being found for new google response format #309

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FarisHijazi
Copy link

@FarisHijazi FarisHijazi commented Apr 8, 2020

the new 2020 google images update changes where the image information is stored, I found that they're stored in a script in variable AF_initDataCallback

This implementation is backward compatible (using rg_meta), and if that doesn't work, then it will parse the new info.

This code was tested with both python3 and python2

…itDataCallback()

the google page contains info in a script variable `AF_initDataCallback`
See the javascript that parses it: https://gist.github.com/FarisHijazi/6c9ba3fb315d0ce9bfa62c10dfa8b2f8
This commit is an implementation to this code.fix-2020-format

I have added an iterator that returns rg_meta objects
_parse_AF_initDataCallback()

the beautifulsoup lib returns text differently for python2, also some unicode decoding had to be done differently for python3

also there were some issues with siteAndNameInfo being accessed unsafely, also got fixed
@FarisHijazi FarisHijazi changed the title Fixed issue with links not being found for new google response format #298 Fixed issue with links not being found for new google response format Apr 8, 2020
@hackgoofer
Copy link

can you add bs4 into the requirements.txt?

@justin-fay
Copy link

Hi, I tried to use this PR locally and am getting errors when running.
My environment:

beautifulsoup4==4.9.0
bs4==0.0.1
-e git+git@github.com:hardikvasa/google-images-download.git@8d60f981d48ee7b5fb46f9541d427f8e81481706#egg=google_images_download
selenium==3.141.0
soupsieve==2.0
urllib3==1.25.8

The command I am using to download images

googleimagesdownload --keywords "Phyllopertha horticola" --limit 10 --chromedriver '/usr/bin/chromedriver'

The exception raised when running

Item no.: 1 --> Item name = Phyllopertha horticola
Evaluating...
Starting Download...
Traceback (most recent call last):
  File "/home/justin/projects/fastai/homework1/env/bin/googleimagesdownload", line 11, in <module>
    load_entry_point('google-images-download', 'console_scripts', 'googleimagesdownload')()
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 1124, in main
    paths,errors = response.download(arguments)  #wrapping response in a variable just for consistency
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 934, in download
    paths, errors = self.download_executor(arguments)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 1061, in download_executor
    items,errorCount,abs_path = self._get_all_items(raw_html,main_directory,dir_name,limit,arguments)    #get all image items and download images
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 753, in _get_all_items
    self._parse_AF_initDataCallback(page)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 901, in _parse_AF_initDataCallback
    metas = get_metas(page)
  File "/home/justin/projects/fastai/homework1/google-images-download/google_images_download/google_images_download.py", line 858, in get_metas
    entry = entries[-1]
IndexError: list index out of range

@hackgoofer
Copy link

yup, verified this PR doesn't work.

this won't fix failures, but it will catch any errors in _parse_AF_initDataCallback() and will stop them from rising any higher
@FarisHijazi
Copy link
Author

can you add bs4 into the requirements.txt?

done

I didn't want to add it to the requirements as it is an optional requirement.
the code should still run without errors without bs4

@ghost
Copy link

ghost commented May 14, 2020

Hi
I also have a problem after installing bs4.

The command that I am running:
googleimagesdownload --keywords "tree" --limit 10 --chromedriver /Users/reza/Downloads/chromedriver/chromedriver

The error that I got:

Item no.: 1 --> Item name = tree
Evaluating...
Starting Download...
WARNING: _parse_AF_initDataCallback failed list index out of range
Traceback (most recent call last):
File "/Users/reza/projects/tmp-test/my_env/bin/googleimagesdownload", line 11, in
load_entry_point('google-images-download==2.8.0', 'console_scripts', 'googleimagesdownload')()
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 1129, in main
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 939, in download
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 1066, in download_executor
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 765, in _get_all_items
File "/Users/reza/projects/tmp-test/my_env/lib/python3.8/site-packages/google_images_download-2.8.0-py3.8.egg/google_images_download/google_images_download.py", line 722, in _get_next_item
TypeError: 'NoneType' object is not an iterator

@marian-code
Copy link

This works for me. It is also the right solution to the problem. Might have a few bugs that will have to be sorted before it works for everyone but @FarisHijazi is right about AF_initDataCallback. I checked it and the required information certainly is there. So it just needs to be parsed for it.

@cooperdk
Copy link

Works, but it thinks every image is a GIF, even when it's not.

@cooperdk
Copy link

At line 776, before:

if arguments['metadata']:

Insert this:

                imageURL = object['image_link']
                object['image_format'] = imageURL.split(".")[-1]

If you don't, your script will think that all images are in GIF format.
This is an easy fix and it will only look at the original file extension.

@Joeclinton1
Copy link

This is a duplicate of #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants