Faculty Scraper is a Python web scraping tool designed to extract data from a faculty directory website. It retrieves information such as faculty names, colleges, email addresses, subjects taught, and research topics. The scraped data is stored in a list of dictionaries and can be exported to a CSV file for further analysis. For those interested in a more in-depth understanding, I highly recommend reading my article: Medium Link. It covers the code implementation, step-by-step explanations, and the benefits of utilizing concurrent features for efficient data extraction.
The following Python packages are required to run the scraper:
bs4
(BeautifulSoup): Used for HTML parsing.requests
: Used for sending HTTP requests.concurrent.futures
: Used for concurrent execution of scraping tasks.pandas
: Used for data manipulation and CSV export.re
: Used for email address validation.logging
: Used for error handling and logging.
-
Import the
FacultyScraper
class from thefaculty_scraper.FacultyScraper
module.from faculty_scraper.FacultyScraper import FacultyScraper
-
Create an instance of the
FacultyScraper
class with the URL of the faculty directory website.url = "https://example.com/faculty-directory" scraper = FacultyScraper(url)
-
Scrape the data from the faculty directory website.
data = scraper.scrape_data()
-
Dump the scraped data into a CSV file.
scraper.dump_to_csv("faculty_data.csv")
-
Retrieve the scraped data as a Pandas DataFrame.
df = scraper.return_df()
Contributions are welcome! If you would like to contribute to Faculty Scraper, follow these steps:
-
Fork the repository.
-
Create a new branch for your feature or bug fix.
-
Make your changes in the branch.
-
Commit your changes with descriptive commit messages.
-
Push your branch to your forked repository on GitHub.
-
Open a pull request from your branch to the main repository.
-
Provide a clear and descriptive title for your pull request, along with a detailed description of the changes you have made.
-
Wait for the project maintainers to review your pull request. They may provide feedback or ask for additional changes.
-
Once your pull request is approved and merged, your changes will become a part of the project.
Please note that by contributing to this project, you agree to abide by the Code of Conduct.
This project is licensed under the MIT License.
Here's an example that demonstrates the usage of the FacultyScraper
class:
from faculty_scraper.FacultyScraper import FacultyScraper
url = "https://engineering.buffalo.edu/computer-science-engineering/people/faculty-directory/full-time.html"
scraper = FacultyScraper(url)
data = scraper.scrape_data()
scraper.dump_to_csv("Department of Computer Science and Engineering Faculty Data.csv")
df = scraper.return_df()
In this example, the FacultyScraper
is initialized with the URL of the faculty directory website. The scrape_data()
method is called to extract the faculty information, which is then dumped into a CSV file named "Department of Computer Science and Engineering Faculty Data.csv"
. The scraped data is also returned as a Pandas DataFrame for further analysis.
Note: The current implementation of the scraper is specifically designed for the URL: "https://engineering.buffalo.edu/computer-science-engineering/people/faculty-directory/full-time.html". If you want to scrape a different faculty directory website, you will need to modify the code accordingly referer the steps at Contributing.