Skip to content

rukshar69/Web-Scraping-Image-Scraping

Repository files navigation

Web-Scraping-Image-Scraping

Techinstr

Scraping a Shopify site named Techinstr to scrape product images

Data Scraping

  • Used Scrapy to scrape product data from Techinstr shopify site. The spider is in products.py.
  • Even though the website is javascript-enabled, by disabling the javascript, the website shows simple pagination. Since disabled javascript is Scrapy's default behavior, we can implement a simple pagination traversal in the spider code.
  • The images are scraped using a custom image pipeline. The custom pipeline changes the saved images' filenames. The filenames are slugified versions of product titles. This slugification is done using Python's slugify library
  • Necessary changes have been added in settings.py to accommodate the custom image pipeline
  • The downloaded images were 1000x1000 which is too large. We downsized the image size to 100x100 by adding IMAGES_THUMBS variable in the settings.py.
  • Overall there are about 130 products spread about 15 pages
  • The scraped data is saved in a json file that includes product title, price, and image info. like image file path
  • The scraped images are save in a folder that is determined by IMAGES_STORE variable in settings.py.

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

  • Price distribution of item
  • Top 10 cheapest and most expensive items
  • 9 random scraped images from the site

Books to Scrape

Scraping bookstoscrape to scrape images of books and their info

Data Scraping

  • Framework: Scrapy
  • Scrapy spider: books.py
  • The spider scrapes the info. of 1000 books and saves the info. in books.json . The info. contains price, title, genre, how many books are in stock, rating, image_url that is used to scrape the image, image meta data that has the path where the scraped images are saved. The spider scrapes through 50 pages
  • The images are scraped using a custom image pipeline. The custom pipeline changes the saved images' filenames. The filenames are slugified versions of book titles. This slugification is done using Python's slugify library. Necessary changes have been added in settings.py to accommodate the custom image pipeline
  • The scraped images are save in a folder that is determined by IMAGES_STORE variable in settings.py.

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

  • Genre-based histogram and box plot for price and in_stock distribution
  • Bar plot for genre-based ratings distribution
  • Top 10 lists for most expensive, cheapest, highest and lowest rated books
  • Genre-based Scatter Plot and Correlation Heatmap for Price vs Rating
  • Bar plot for genre distribution
  • Show the cover of a selected book along with its title, price, and other info.

QuoteFancy

Scraping Top 100 Motivational Quotes from quotefancy.

Data Scraping

  • Framework: Scrapy & Playwright. To use playwright with scrapy, scrapy-playwright Python module had to be installed.
  • Quotes images, quote texts, author names, upvotes, and downvotes for quotes were scraped. The upvotes and downvotes used javascript. So, playwright was used to scrape upvotes and downvotes. The spider crawler is image_scraper_2
  • The scraped for 100 top quotes are saved in quotes.json. The scraped image_quotes are saved in downloads folder
  • The quote text is further cleaned by removing stopwords using nltk library and by removing non-alphabetic symbols using regex. It's saved in quote_filtered_text.json. This text is later used to generate world cloud in the streamlit app. The code can be found in create_clean_quote_text.py

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

  • Showing chosen quote's associated image and other related info.
  • Distribution of upvotes (box plot and histogram)
  • Word cloud using the cleaned quote text data from quote_filtered_text.json
  • Scatter plot between Upvotes and Downvotes

Reference

Releases

No releases published

Packages

No packages published

Languages