Skip to content

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

Notifications You must be signed in to change notification settings

ortanaV2/Data-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Data-Scraper

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

The script asks for a keyword to search for. It compares the keyword with the file-name and its contents. As soon as it finds the keyword in it, it is listed as a match and output at the end.

File Content Read

The scraper is able to read only the following text-based files:

  • .docx
  • .pdf
  • .txt

Usage

The scraper is searching the ./DATA directory by default. To change that you have to edit the variable directory.

Line 9: directory = "./DATA"

Note

It iterates through every file in the directory. To speed up the process, it is recommended to limit the amount of files.

Requirements

How to install the required libraries.

pip install pdfplumber
pip install docx

Improving

Suggestions for improvements are welcome.

About

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

Topics

Resources

Stars

Watchers

Forks

Languages