feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

CocaineCong · 2023-11-07T01:16:00Z

Please provide Issues links to:

Issues: feature: any plan on implementing the bm25 algorithm ? #181

Provide test code:

I added the stopwords test code in hmm/idf/idf_test.go

err = te.LoadStopWords()
tt.Nil(t, err)

and I had run and passed all test code about stopwords . (including hmm/idf/idf_test.go and examples/hmm/main.go)

Description

1. stopwords

the module of stopwords is a standalone module. it would be better if we extracted stopwords module out of the idf file path.

and then I found the stopwords in TagExtracter only used in cutting words to ignore stopword.

for _, w := range t.seg.Cut(text, true) {
	w = strings.TrimSpace(w)
	if utf8.RuneCountInString(w) < 2 {
		continue
	}
	if t.stopWord.IsStopWord(w) {
		continue
	}

	if f, ok := freqMap[w]; ok {
		freqMap[w] = f + 1.0
	} else {
		freqMap[w] = 1.0
	}
}

2. extracker

extract the extracker module so that we we can implementing more relevance algorithm base extracker module.

before:

// TagExtracter is extract tags struct.
type TagExtracter struct {
	seg gse.Segmenter
	idf *Idf
	stopWord *StopWord
}

after:

// TagExtracter is extract tags struct.
type TagExtracter struct {
	seg gse.Segmenter

	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
}

3. relevance

refactor the idf module and extract the relevance module by strategy pattern to support more relevance algorithm, such as idf, tfidf, bm25 and so on.

before:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
}

after:

type Idf struct {
	median float64
	freqs []float64
	Base
}

type BM25 struct {
	K1 float64
	N float64
	Base
}

type Base struct {
	// loading some stop words
	StopWord *stop_word.StopWord

	// loading segmenter for cut word
	Seg gse.Segmenter
}

And then, I'm implementing the Relevance by Strategy Pattern.

such as :

// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
	// AddToken add text, frequency, position on obj
	AddToken(text string, freq float64, pos ...string) error

	// LoadDict load file from incoming parameters,
	// if incoming params no exist, will load file from default file path
	LoadDict(files ...string) error

	// LoadDictStr loading dict file by file path
	LoadDictStr(pathStr string) error

	// LoadStopWord loading word file by filename
	LoadStopWord(fileName ...string) error

	// Freq find the frequency, position, existence information of the key
	Freq(key string) (float64, string, bool)

	// TotalFreq the total number of tokens in the dictionary
	TotalFreq() float64

	// FreqMap get frequency map
	// key: word, value: frequency
	FreqMap(text string) map[string]float64

	// ConstructSeg return the segment with weight
	ConstructSeg(text string) segment.Segments
}

default Idf:

func NewIdf() Relevance {
	idf := &Idf{
		freqs: make([]float64, 0),
	}
	idf.StopWord = stop_word.NewStopWord()
	return Relevance(idf)
}

implement the interface function

// AddToken add a new word with IDF into the dictionary.
func (i *Idf) AddToken(text string, freq float64, pos ...string) error {
	err := i.Seg.AddToken(text, freq, pos...)

	i.freqs = append(i.freqs, freq)
	sort.Float64s(i.freqs)
	i.median = i.freqs[len(i.freqs)/2]
	return err
}

// LoadDict load the idf dictionary
func (i *Idf) LoadDict(files ...string) error {
	if len(files) <= 0 {
		files = i.Seg.GetIdfPath(files...)
	}

	return i.Seg.LoadDict(files...)
}

// Freq return the Idf of the word
func (i *Idf) Freq(key string) (float64, string, bool) {
	return i.Seg.Find(key)
}

....

all this change had ran and passed test code.

vcaesar · 2023-11-07T15:54:00Z

Just push the full edition, this small changed not need PR and review.

CocaineCong · 2023-11-08T01:17:13Z

Just push the full edition, this small changed not need PR and review.

ok, PTAL.

:update dict readme

feat: add tfidf & bm25 in TagExtracter

CocaineCong · 2023-11-16T17:23:43Z

update at 2023-11-17. PTAL @vcaesar

Please provide Issues links to:

#183

I found the origin dict in path data/dict/zh is not satisfied to calculate tfidf & bm25

the dict only has tf value and position value, but no idf value to calculate tfidf and no origin corpus to calculate average document length.

so I add two new dict files in path data/dict/zh , and add some information in data/dict/README.md for the source of dict files.

Description

1. two new dict files

in tf_idf.txt, the first column of this document is the term , the second is the word frequency of the corresponding term, and the third is the inverse document frequency of the corresponding term, so we can so easy to read tf & idf value from dict
file path if client no input dict file.

in tf_idf_origin.txt, just the origin corpus text, and we will load it when we new bm25.

2. the detail of tfidf's implemention

2.1 loading tfidf dict file

just like idf, we implementing some interface function by strategy pattern.

for example: the LoadDict function

// LoadDict load dict for TFIDF seg
func (t *TFIDF) LoadDict(files ...string) error {
	if len(files) <= 0 {
		files = t.Seg.GetTfIdfPath(files...)
	}
	dictFiles := make([]*types.LoadDictFile, len(files))
	for i, v := range files {
		dictFiles[i] = &types.LoadDictFile{
			FilePath: v,
			FileType: consts.LoadDictTypeTFIDF,
		}
	}

	return t.Seg.LoadTFIDFDict(dictFiles)
}

Differenting from idf , here we defined GetTfIdfPath() to loading default TFIDF dict path , because the file format is different.

In order to distinguish between different document dict, we define the FileType to restrict some stuff...

const (
	// dict file type to loading
	// LoadDictTypeIDF dict of IDF to loading
	LoadDictTypeIDF = iota + 1
	// LoadDictTypeTFIDF dict of TFIDF to loading
	LoadDictTypeTFIDF
	// LoadDictTypeBM25 dict of BM25 to loading
	LoadDictTypeBM25
	// LoadDictTypeWithPos dict of with position to loading
	LoadDictTypeWithPos
	// LoadDictCorpus dict of corpus to loading
	LoadDictCorpus
)

LoadTFIDFDict is similar with LoadDict , it just read form different dict path

for i := 0; i < len(dictFiles); i++ {
      err := seg.ReadTFIDF(dictFiles[i])
      if err != nil {
	      return err
      }
}

and ReadTFIDF is open dict to read

// ReadTFIDF read the dict file
func (seg *Segmenter) ReadTFIDF(file string) error {
	if !seg.SkipLog {
		log.Printf("Load the gse dictionary: \"%s\" ", file)
	}

	dictFile, err := os.Open(file)
	if err != nil {
		log.Printf("Could not load dictionaries: \"%s\", %v \n", file, err)
		return err
	}
	defer dictFile.Close()

	reader := bufio.NewReader(dictFile)
	return seg.ReaderTFIDF(reader, file)
}

and in ReaderTFIDF , we just handle idf value like tf value

freq = seg.Size(size, text, freqText)
inverseFreq = seg.Size(size, text, idfText)
if freq == 0.0 || inverseFreq == 0.0 {
	continue
}

// Add participle tokens to the dictionary
words := seg.SplitTextToWords([]byte(text))
token := Token{text: words, freq: freq, inverseFreq: inverseFreq}
seg.Dict.AddToken(token)

2.2 process on tfidf calculation

in the Freq function, we are defined FindTFIDF to return tf, idf, existence .

// Freq return the TFIDF of the word
func (t *TFIDF) Freq(key string) (float64, interface{}, bool) {
	return t.Seg.FindTFIDF(key)
}

the result why we return interface type is we compatible with idf function. the idf is return position info which is string type... and tfidf & bm25 need idf value which is float64 type...

tfidf calculation process is as follows

// calculateIdf calculate the word's weight by TFIDF
func (t *TFIDF) calculateWeight(term string) float64 {
	tf, idf, _ := t.Freq(term)
	return tf * idf.(float64)
}

// ConstructSeg construct segment with weight
func (t *TFIDF) ConstructSeg(text string) segment.Segments {
	// make segment list by total freq num
	ws := make([]segment.Segment, 0)
	for k := range t.FreqMap(text) {
		ws = append(ws, segment.Segment{Text: k, Weight: t.calculateWeight(k)})
	}

	return ws
}

2.3 result

in hmm/relevance/tfidf_test.go
output:

// output:
// segments:  5 [{消费者 135.35978394451678} {汽车 132.5431762274668} {消费 99.74972568967256} {增强 96.4479152517576} {下跌 62.99878978351253}]
// results:  [{消费 1} {刺激 0.5486451492724487} {下跌 0.4311204839551169} {汽车 0.4095437392771989} {购车 0.4064546007671519}]

that all for TFIDF

3. the detail of BM25's implemention

just like TFIDF, I am not to say the similarities stuff..

3.1 corpus and constant

in process calculate of BM25 , we need to get the document average length, so we have to load corpus.

so we are implemented LoadCorpus

// LoadCorpus for calculate the average length of corpus
func (bm25 *BM25) LoadCorpus(path ...string) (err error) {
	averLength, err := bm25.Seg.LoadCorpusAverLen(path...)
	if err != nil {
		return
	}

	bm25.AverageDocLength = averLength
	return
}

and the detail how to calculate average length as following

func (seg *Segmenter) ReadCorpus(file string) (corpusAverLen float64, err error) {
	if !seg.SkipLog {
		log.Printf("Load the gse dictionary: \"%s\" ", file)
	}
	var corpusNumber float64 = 0
	var corpusLength float64 = 0
	dictFile, err := os.Open(file)
	if err != nil {
		log.Printf("Could not load dictionaries: \"%s\", %v \n", file, err)
		return
	}
	defer dictFile.Close()

	// new the Scanner to read file content
	scanner := bufio.NewScanner(dictFile)
	// read file content by line
	for scanner.Scan() {
		corpusNumber++
		line := scanner.Text()
		corpusLength += float64(utf8.RuneCountInString(line))
	}
	corpusAverLen = corpusLength / corpusNumber

	return
}

what's more, we will defined K1 and B if client don't defined this themselves.
in consts/dict_file.go

const (
	// BM25DefaultK1 default k1 value for calculate bm25
	BM25DefaultK1 = 1.25

	// BM25DefaultK1 default B value for calculate bm25
	BM25DefaultB = 0.75
)

3.2 result

in hmm/relevance/bm25_test.go

output:

// output:
// segments:  5 [{想象 13.489829905298084} {活力 12.86320693643856} {充满 12.480977334559475} {这里 9.56153393671824} {历史 8.738605467373437}]
// results:  [{积淀 1} {活力 0.7380261680439799} {有 0.6602549059736358} {历史 0.6573229314364966} {想象 0.39804353825110805}]

That's all , thanks for reading and reviewing. first time to submitting such a large pr..

vcaesar · 2023-11-16T20:14:31Z

Ok, I will review and test it.

CocaineCong · 2023-12-01T15:28:33Z

@vcaesar hey, any problem on this pr ? I will fix it if any problem on this pr. 🫡

feat:extract stopwords module

529689f

CocaineCong changed the title ~~feat:extract stopwords module~~ feat:extract stopwords module from the idf file path Nov 7, 2023

feat:refactor extracker & idf by strategy pattern

2bfba24

CocaineCong changed the title ~~feat:extract stopwords module from the idf file path~~ feat:extract tag extracker module & refactor idf by strategy pattern Nov 8, 2023

vcaesar added the enhancement label Nov 8, 2023

vcaesar added this to the v0.90.0 milestone Nov 8, 2023

vcaesar added the Proposal label Nov 8, 2023

CocaineCong and others added 13 commits November 11, 2023 09:18

feat:support idf

a062fe5

feat:support load tfidf dict file

724124f

feat:support freq return tf idf value

7214b18

feat:support cal term tfidf wight

ee624dc

feat:support bm25

bfb53cb

feat:tf idf origin

b9c4785

feat:load corpus

ef7608b

feat:cal corpus avgerage length

14699cd

feat: support loading corpus from new bm25 function

29806a5

feat

d0edae3

:update dict readme

feat:add README for dict

5ee7803

feat:add output result

b641c6f

Merge pull request #1 from CocaineCong/feature-tfidf

1896473

feat: add tfidf & bm25 in TagExtracter

CocaineCong changed the title ~~feat:extract tag extracker module & refactor idf by strategy pattern~~ feat: refactor idf module, implementing idf & bm25 in TagExtracter by strategy pattern Nov 16, 2023

CocaineCong changed the title ~~feat: refactor idf module, implementing idf & bm25 in TagExtracter by strategy pattern~~ feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

CocaineCong commented Nov 7, 2023 •

edited

vcaesar commented Nov 7, 2023

CocaineCong commented Nov 8, 2023

CocaineCong commented Nov 16, 2023

vcaesar commented Nov 16, 2023

CocaineCong commented Dec 1, 2023

feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

Are you sure you want to change the base?

feat: refactor idf module, implementing tfidf & bm25 in TagExtracter by strategy pattern #183

Conversation

CocaineCong commented Nov 7, 2023 • edited

Description

1. stopwords

2. extracker

3. relevance

vcaesar commented Nov 7, 2023

CocaineCong commented Nov 8, 2023

CocaineCong commented Nov 16, 2023

Description

1. two new dict files

2. the detail of tfidf's implemention

2.1 loading tfidf dict file

2.2 process on tfidf calculation

2.3 result

3. the detail of BM25's implemention

3.1 corpus and constant

3.2 result

vcaesar commented Nov 16, 2023

CocaineCong commented Dec 1, 2023

CocaineCong commented Nov 7, 2023 •

edited