Skip to content

AndreM92/Social_Media_Crawler_2024

Repository files navigation

Social_Media_Crawler_2024

This repository contains my new Social Media Crawler for Data Analytics
(Written for German social media pages, but easily transferable to other languages)
All Crawlers are written in Python. I primarily use Selenium as the framework, but I've also created some smaller Crawlers using Playwright.

Currently, the project includes the Social Media platforms Facebook, Instagram, LinkedIn, TikTok, X/Twitter and YouTube. However, functionalities such as capturing comments are not fully implemented in some cases. The updates for these features will be implemented in the coming days.

The functionalities of the crawlers include:

  • Automated logins
  • Several ways to get around cookie banners
  • Searching for and getting on a profile (not included in all files)
  • Collection of the profile stats
  • Scrolling down the feed until you reach a specific date
  • Scraping of the date, content, images, links, likes, number of comments and number of shares of every posting
  • Saving the data in DataFrames
  • and finally an export of the DataFrames to an excel file

Issues

  • The DateTime on Facebook is no longer displayed as text, requiring me to address this issue through less precise methods such as text scraping, screenshots, and image reading (utilizing Pillow/Pytesseract)
  • the links are partially provided only in XML format: "use xlink:href="#gid111" xmlns:xlink="http://www.w3.org/1999/xlink"></use" Consequently, I am unable to scrape their text content directly. Although I devised a method to associate visible dates with posts, this approach introduces some inaccuracies.
  • Certain crawlers necessitate a headed browser due to my use of PyAutoGUI to navigate around bot blocking and handle exceptional page settings.

I appreciate any suggested solutions to overcome these challenges.

About

This repository contains my new Social Media Crawler for Data Analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages