Double-counting the documents containing an item #10

jianle4github · 2017-12-16T22:52:05Z

If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times. This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:

idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )

I changed the code from:

for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []

        self.index[term].append(document_pos)
        self.tf[term].append(terms.count(term))

to:

for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []

        if document_pos not in self.index[term]:
            self.index[term].append(document_pos)
            self.tf[term].append(terms.count(term))

by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-counting the documents containing an item #10

Double-counting the documents containing an item #10

jianle4github commented Dec 16, 2017

Double-counting the documents containing an item #10

Double-counting the documents containing an item #10

Comments

jianle4github commented Dec 16, 2017