improvment #37

sareaghaei · 2021-05-05T11:23:28Z

Hi Antonin
I am working on Opentapioca to improve its accuracy to some extend in order to apply it in our project.
I tried to use other features besides the current features of the vectors.
connection_count: connection_count(tagi) = sum(tagi.edges.intersection(hrtag j)/ hrtag j.edges), hrtag j is the tag with the highest rank among the detected tags for phrase j.
hop_count: hop_count(tagi)= sum(1- tagi.edges.intersection(tagj)/ tagi.edges.union(tagj)), j is any detected tag for any phrase in the input sentence.
cosine_similarity: applying S-Bert to generate embeddings of descriptions of tag-candidates and the input sentence and then using cosine similarity between the generated vectors.
I also used XGBoost ranker(learn to rank) instead of SVM classifier.
None of mentioned solutions fulfilled increasing F1.
Do you have any suggestion for me?

wetneb · 2021-05-05T12:37:51Z

It's great you tried this! I don't really know to be honest - perhaps you could tell me more about the domain you are looking at (which dataset?). Have you observed a specific problem that motivated the addition of these features?

sareaghaei · 2021-05-05T12:53:13Z

Since the current features of the vectors are independent of the context, I tried to add some context-sensitive features. Currently I am working with RSS-dataset to train the model (although I have tried with the merged_RSS_istex dataset as well as).
I am not sure about the domain which we intend to apply Opentapioca in our project.
@ziodave, do u know for which domain(s) we will mainly use the tool to annotate?

ziodave · 2021-05-05T12:55:53Z

We don't have a specific domain, we would like this to work on any content.

sareaghaei · 2021-05-05T13:30:30Z

Is there any suggestion regarding which dataset might be more helpful for our goal?
(afaik, RSS is related to news excerpts and ISTEX is about author affiliations of articles)

wetneb · 2021-05-05T14:43:23Z

At the moment there is still some dependency on the context (as we discussed before here) - that was designed to "replace" context-sensitive features, in a sense. But it is totally possible that directly adding context-sensitive features helps too!

If you want to improve the performance of the heuristics, I would do as follows:

analyze the errors done on a dataset you care about
try to get a sense of what is common to those errors, what information the classifier is missing
design a feature which represents this information
implement it, train the classifier with it, and go to 1.

For me this process is very much examples-driven - I design the features with some examples in mind.

sareaghaei · 2021-05-05T14:51:37Z

Yeah, the goal of the adjacency matrix in your work is to make it context-sensitive.
Yes, the examples of apple caused I thought about using context-sensitive features directly.
Thanks for your hints :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvment #37

improvment #37

sareaghaei commented May 5, 2021 •

edited

wetneb commented May 5, 2021

sareaghaei commented May 5, 2021

ziodave commented May 5, 2021

sareaghaei commented May 5, 2021

wetneb commented May 5, 2021

sareaghaei commented May 5, 2021 •

edited

improvment #37

improvment #37

Comments

sareaghaei commented May 5, 2021 • edited

wetneb commented May 5, 2021

sareaghaei commented May 5, 2021

ziodave commented May 5, 2021

sareaghaei commented May 5, 2021

wetneb commented May 5, 2021

sareaghaei commented May 5, 2021 • edited

sareaghaei commented May 5, 2021 •

edited

sareaghaei commented May 5, 2021 •

edited