Skip to content

raymelon/tagalog-dictionary-scraper

Repository files navigation

Tagalog Dictionary Scraper 📒 Tweet

Ating pag-ibayuhin ang ating talahuluganan!

Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.

42,723 words (as of Feb 19, 2023)

See the word list at tagalog_dict.txt

License: GPL v3 Build Status codecov

contributions welcome

API Resource

Served through GitHub Pages, the scraped words are accessible via REST resource.

Host

https://raymelon.github.io/tagalog-dictionary-scraper/

Method

GET

Resources Available

Resource Display Endpoint
csv default /tagalog_dict.csv
csv with lines /tagalog_dict_lines.csv
json default /tagalog_dict.json
json with lines /tagalog_dict_lines.json
txt default /tagalog_dict.txt

How is it done? 💪

Each webpage is loaded and parsed, extracting the words enclosed in <h2 class='word-entry'> tag.

Included is tagalog.pinoydictionary.com html snippet containing the source of http://tagalog.pinoydictionary.com/list/a/ to serve as point of reference on how dictionary words from the page are extracted.

Disclaimer: I do not own the html code cited above, it is owned by tagalog.pinoydictionary.com.

How did the project started? 💭

The main purpose of this project is for a Scrabble ® Tagalog dictionary database, but other uses may vary.

Tools ✏️

  python -m pip install -U pip beautifulsoup4
  python -m pip install -U pip requests-futures

Notes 📌

License License: GPL v3

GNU General Public License 3.0

Releases

No releases published

Packages

No packages published

Languages