Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word frequency calculation is wrong #46

Open
BALaka-18 opened this issue Jul 29, 2020 · 0 comments
Open

Word frequency calculation is wrong #46

BALaka-18 opened this issue Jul 29, 2020 · 0 comments

Comments

@BALaka-18
Copy link

BALaka-18 commented Jul 29, 2020

According to the function of frequency calculation :

def _build_frequency_dist(self, phrase_list):

    """Builds frequency distribution of the words in the given body of text.
    :param phrase_list: List of List of strings where each sublist is a
                        collection of words which form a contender phrase.
    """

    self.frequency_dist = Counter(chain.from_iterable(phrase_list))

Tracing back to the calculation of phrase_list :

def _generate_phrases(self, sentences):

    """Method to generate contender phrases given the sentences of the text
    document.
    :param sentences: List of strings where each string represents a
                      sentence which forms the text.
    :return: Set of string tuples where each tuple is a collection
             of words forming a contender phrase.
    """
    phrase_list = set()
    # Create contender phrases from sentences.
    for sentence in sentences:
        word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
        phrase_list.update(self._get_phrase_list_from_words(word_list))
    return phrase_list

Clearly, phrase_list is a set, and contains unique keywords. So if keywords repeat in a text, they're ignored, and the value of frequency, as tested by me, comes out faulty.

I have modified the Rake() object to ensure the calculations are correct. @csurfer ,kindly assign me this issue, so I can create a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant