Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve TF-IDF agent by tuning matches threshold #95

Open
xavierfigueroav opened this issue Mar 22, 2022 · 1 comment
Open

Improve TF-IDF agent by tuning matches threshold #95

xavierfigueroav opened this issue Mar 22, 2022 · 1 comment

Comments

@xavierfigueroav
Copy link
Contributor

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

for counter, value in enumerate(all_documents_matrix, start=0):
sim_score = self.__cosine_similarity(value, search_martix)
if sim_score >= 0.3:
matches.append({
'shortname': self.licenseList.iloc[counter]['shortname'],
'sim_type': "TF-IDF Cosine Sim",
'sim_score': sim_score,
'desc': ''
})
matches.sort(key=lambda x: x['sim_score'], reverse=True)
if self.verbose > 0:
print("time taken is " + str(time.time() - startTime) + " sec")
return matches

Using the evaluation.py script, I've carried out some experiments:

Algorithm Time elapsed Accuracy
1 tfidf (CosineSim) (thr=0.30) 30.19 59.0%
2 tfidf (CosineSim) (thr=0.17) 35.29 61.0%
3 tfidf (CosineSim) (thr=0.16, max_df=0.10) 27.34 62.0%
4 tfidf (CosineSim) (thr=0.16) 36.42 62.0%
5 tfidf (CosineSim) (thr=0.15) 38.45 62.0%
6 tfidf (CosineSim) (thr=0.10) 39.91 62.0%
7 tfidf (CosineSim) (thr=0.00) 61.49 62.0%
8 Ngram (CosineSim) - 57.0%
9 Ngram (BigramCosineSim) - 56.0%
10 Ngram (DiceSim) - 55.0%
11 wordFrequencySimilarity - 23.0%
12 DLD - 17.0%
13 tfidf (ScoreSim) - 13.0%
  • Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
  • Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
  • In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
    • Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
    • Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

  • I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
  • All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
  • My findings may help improve other agents that use thresholds, such as Ngram.
  • This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.
@GMishx
Copy link
Member

GMishx commented Mar 28, 2022

That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.

Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants