Web-Scraping-Image-Scraping

Techinstr

Scraping a Shopify site named Techinstr to scrape product images

Data Scraping

Used Scrapy to scrape product data from Techinstr shopify site. The spider is in products.py.
Even though the website is javascript-enabled, by disabling the javascript, the website shows simple pagination. Since disabled javascript is Scrapy's default behavior, we can implement a simple pagination traversal in the spider code.
The images are scraped using a custom image pipeline. The custom pipeline changes the saved images' filenames. The filenames are slugified versions of product titles. This slugification is done using Python's slugify library
Necessary changes have been added in settings.py to accommodate the custom image pipeline
The downloaded images were 1000x1000 which is too large. We downsized the image size to 100x100 by adding IMAGES_THUMBS variable in the settings.py.
Overall there are about 130 products spread about 15 pages
The scraped data is saved in a json file that includes product title, price, and image info. like image file path
The scraped images are save in a folder that is determined by IMAGES_STORE variable in settings.py.

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

Price distribution of item
Top 10 cheapest and most expensive items
9 random scraped images from the site

Books to Scrape

Scraping bookstoscrape to scrape images of books and their info

Data Scraping

Framework: Scrapy
Scrapy spider: books.py
The spider scrapes the info. of 1000 books and saves the info. in books.json . The info. contains price, title, genre, how many books are in stock, rating, image_url that is used to scrape the image, image meta data that has the path where the scraped images are saved. The spider scrapes through 50 pages
The images are scraped using a custom image pipeline. The custom pipeline changes the saved images' filenames. The filenames are slugified versions of book titles. This slugification is done using Python's slugify library. Necessary changes have been added in settings.py to accommodate the custom image pipeline
The scraped images are save in a folder that is determined by IMAGES_STORE variable in settings.py.

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

Genre-based histogram and box plot for price and in_stock distribution
Bar plot for genre-based ratings distribution
Top 10 lists for most expensive, cheapest, highest and lowest rated books
Genre-based Scatter Plot and Correlation Heatmap for Price vs Rating
Bar plot for genre distribution
Show the cover of a selected book along with its title, price, and other info.

QuoteFancy

Scraping Top 100 Motivational Quotes from quotefancy.

Data Scraping

Framework: Scrapy & Playwright. To use playwright with scrapy, scrapy-playwright Python module had to be installed.
Quotes images, quote texts, author names, upvotes, and downvotes for quotes were scraped. The upvotes and downvotes used javascript. So, playwright was used to scrape upvotes and downvotes. The spider crawler is image_scraper_2
The scraped for 100 top quotes are saved in quotes.json. The scraped image_quotes are saved in downloads folder
The quote text is further cleaned by removing stopwords using nltk library and by removing non-alphabetic symbols using regex. It's saved in quote_filtered_text.json. This text is later used to generate world cloud in the streamlit app. The code can be found in create_clean_quote_text.py

Streamlit App

Streamlit App Link: https://image-mining.streamlit.app/

Features:

Showing chosen quote's associated image and other related info.
Distribution of upvotes (box plot and histogram)
Word cloud using the cleaned quote text data from quote_filtered_text.json
Scatter plot between Upvotes and Downvotes

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bookstoscrape		bookstoscrape
quotefancy		quotefancy
techinstr		techinstr
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
streamlit_app_techinstr.py		streamlit_app_techinstr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bookstoscrape

bookstoscrape

quotefancy

quotefancy

techinstr

techinstr

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

streamlit_app_techinstr.py

streamlit_app_techinstr.py

Repository files navigation

Web-Scraping-Image-Scraping

Techinstr

Data Scraping

Streamlit App

Books to Scrape

Data Scraping

Streamlit App

QuoteFancy

Data Scraping

Streamlit App

Reference

About

Releases

Packages

Languages

rukshar69/Web-Scraping-Image-Scraping

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping-Image-Scraping

Techinstr

Data Scraping

Streamlit App

Books to Scrape

Data Scraping

Streamlit App

QuoteFancy

Data Scraping

Streamlit App

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages