Frisbee is a small utility to collect email addresses from search engines and other free-form text sources. Frisbee makes it simple to find email addresses posted on the web by taking user-fed input and translating it into an automated search query. Users can extend frisbee by adding modules for new search engines or other obscure data sources.
Install the library:
pip install frisbee
or python setup.py install
Run a search
frisbee search -e bing -d bnpparibas.com -l 50 --greedy --save
Search in bulk
frisbee search -e bing -f domains -l 50 --save
This sample code shows some of the range of functionality within the module:
from frisbee import Frisbee
# Create an instance
frisbee = Frisbee(save=True)
# Describe your job
jobs = [{'engine': 'bing', 'modifier': 'site:github.com',
'domain': 'foo.bar', 'limit': 50}]
# Execute the jobs
frisbee.search(jobs)
# Get the results
results = frisbee.get_results()
Below is an example job result:
[{
"engine": "bing",
"modifier": "site:github.com",
"domain": "blockade.io",
"limit": 50,
"results": {
"start_time": "2018-12-13 16:54:15",
"end_time": "2018-12-13 16:54:19",
"emails": [
"info@blockade.io"
],
"duration": "4",
"processed": 44
},
"project": "zealous_kirch"
}]
- Ability to search for email addresses from search engine results
- Modular design that can be extended easily to include new sources
- Modifier options that can filter or target search query
- Limit option to reduce the number of results parsed
- Greedy option to learn from collected results and fuzzy to find related
- Save output describing job request and results
- Individual or bulk look-ups using the command line utility
- Feature: Added a bulk option to the command line tool to ease usage
- Change: Replaced multiprocessing with concurrent.futures to simplify logic
- Change: Split logic of dynamic module loading and future work outside of the Frisbee class
- Change: Reverted back to the BS4 parsing versus raw text
- Change: Replaced the regular expression processing to be more efficient
- Change: Progressively save results as they come in to avoid any losses from a deadlock
- Change: Randomize the top-level directory to avoid conflicts
- Feature: Clean SERPs to remove files or other formats we can't inspect
- Change: Use text extraction instead of BS4 HTML parsing to get body of websites (ensures clean email extraction)
- Change: Increased logging and timeout parameters
- Feature: Added typing to the core code
- Feature: Added a fuzzy flag to find related domains
- Feature: Activated greedy option to save and output to screen
- Bugfix: Wrapped loading of HTML for cases where data is dirty
- Initial push!