Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvment #37

Open
sareaghaei opened this issue May 5, 2021 · 6 comments
Open

improvment #37

sareaghaei opened this issue May 5, 2021 · 6 comments

Comments

@sareaghaei
Copy link

sareaghaei commented May 5, 2021

Hi Antonin
I am working on Opentapioca to improve its accuracy to some extend in order to apply it in our project.
I tried to use other features besides the current features of the vectors.
connection_count: connection_count(tagi) = sum(tagi.edges.intersection(hrtag j)/ hrtag j.edges), hrtag j is the tag with the highest rank among the detected tags for phrase j.
hop_count: hop_count(tagi)= sum(1- tagi.edges.intersection(tagj)/ tagi.edges.union(tagj)), j is any detected tag for any phrase in the input sentence.
cosine_similarity: applying S-Bert to generate embeddings of descriptions of tag-candidates and the input sentence and then using cosine similarity between the generated vectors.
I also used XGBoost ranker(learn to rank) instead of SVM classifier.
None of mentioned solutions fulfilled increasing F1.
Do you have any suggestion for me?

@wetneb
Copy link
Member

wetneb commented May 5, 2021

It's great you tried this! I don't really know to be honest - perhaps you could tell me more about the domain you are looking at (which dataset?). Have you observed a specific problem that motivated the addition of these features?

@sareaghaei
Copy link
Author

Since the current features of the vectors are independent of the context, I tried to add some context-sensitive features. Currently I am working with RSS-dataset to train the model (although I have tried with the merged_RSS_istex dataset as well as).
I am not sure about the domain which we intend to apply Opentapioca in our project.
@ziodave, do u know for which domain(s) we will mainly use the tool to annotate?

@ziodave
Copy link
Contributor

ziodave commented May 5, 2021

We don't have a specific domain, we would like this to work on any content.

@sareaghaei
Copy link
Author

Is there any suggestion regarding which dataset might be more helpful for our goal?
(afaik, RSS is related to news excerpts and ISTEX is about author affiliations of articles)

@wetneb
Copy link
Member

wetneb commented May 5, 2021

At the moment there is still some dependency on the context (as we discussed before here) - that was designed to "replace" context-sensitive features, in a sense. But it is totally possible that directly adding context-sensitive features helps too!

If you want to improve the performance of the heuristics, I would do as follows:

  1. analyze the errors done on a dataset you care about
  2. try to get a sense of what is common to those errors, what information the classifier is missing
  3. design a feature which represents this information
  4. implement it, train the classifier with it, and go to 1.

For me this process is very much examples-driven - I design the features with some examples in mind.

@sareaghaei
Copy link
Author

sareaghaei commented May 5, 2021

Yeah, the goal of the adjacency matrix in your work is to make it context-sensitive.
Yes, the examples of apple caused I thought about using context-sensitive features directly.
Thanks for your hints :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants