Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

Return and seach for word categories #84

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

claw89
Copy link

@claw89 claw89 commented May 1, 2021

Supersedes #80

For my use case, I need to obtain information on all words in a particular category. For example, I might need to collect etymology information for all English words derived from the Bible.

The two main additions of this PR are:

  • An optional argument "return_categories" in parser.fetch(): when return_categories=True, parser.fetch() returns the word information and a list of the associated categories.
  • A new function parser.fetch_category(): this function returns a list of the words organized under the specified category.

Using these addition, the above use case can be completed with the following code:

bible_words = parser.fetch_category('English terms derived from the Bible')
for word in bible_words:
    print("==============================")
    print(word)
    word_info = parser.fetch(word)
    for item in word_info:
        print(item['etymology'])

Word entries on wiktionary are associated with various
categories. This commit adds the list of associated
categories to the returned json structure.
The new function fetch_category returns the words
included under the provided category. Words are returned
in a list.
A return_categories option is added to the fetch
function defaulting to false; with this option set
to false, the fetch function will return the
original word information in json format.
If this option is set to true, the function will
return a pair of the word information and a list
of its categories. This change was made to make sure
the function passes the unit test.
Category pages on wiktionary may have associated subcategories.
This commit adds the option to return these subcategories as
a list along with the category words. The function fetch_category
can now return a pair of lists (i.e., words and subcategories)
Revised code for parsing words on a category page for consistency
with the approach for parsing subcategories.
Wiktionary limits category pages to 200 words per page.
This commit ensures that fetch_category returns all the
words by updating self.soup to the next page of words.
Updated readme to include examples using categories
This commit corrects the parser_next_page_links
function which was limited to Category:English_phrasebook.
The category name is now passed as an argument, so the
function is applicable to all categories.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant