Python Web Scraping Project

Scraping Top Repositories for GitHub Topics

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

TODO (Intro):

Introduction about web scraping Introduction about GitHub and the problem statement Mention the tools you're using (Python, requests, Beautiful Soup, Pandas) Here are the steps we'll follow:

We're going to scrape https://github.com/topics We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description For each topic, we'll get the top 25 repositories in the topic from the topic page For each repository, we'll grab the repo name, username, stars and repo URL For each topic we'll create a CSV file in the following format:

```bash
 Repo Name,Username,Stars,Repo URL
 three.js,mrdoob,69700,https://github.com/mrdoob/three.js
 libgdx,libgdx,18300,https://github.com/libgdx/libgdx 
````

Scrape the list of topics from Github

Explain how you'll do it:

use requests to download the page user Beautiful soup to parse and extract information convert to Pandas Dataframe

Introduction to our project
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
 Repo Name,Username,Stars,Repo URL
 three.js,mrdoob,69700,https://github.com/mrdoob/three.js
 libgdx,libgdx,18300,https://github.com/libgdx/libgdx 
```
Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.
Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
Creating CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

Notes Refer to the Jupyter notebook, documentation added. The Final code comprise of all the code we have done above and also scrapping of the top 25 repositories out of a topic page.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
README.md		README.md
Web Scraping.ipynb		Web Scraping.ipynb
topics.csv		topics.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

data

data

README.md

README.md

Web Scraping.ipynb

Web Scraping.ipynb

topics.csv

topics.csv

Repository files navigation

Python Web Scraping Project

About

Releases

Packages

Languages

iamutkarshb/scraping-github-topic-repositories

Folders and files

Latest commit

History

Repository files navigation

Python Web Scraping Project

About

Topics

Resources

Stars

Watchers

Forks

Languages