Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Feature/adding relevance scores #300

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tgalery
Copy link
Member

@tgalery tgalery commented May 6, 2014

This PR adds a feature to dbpedia-spotlight, namely, weights associated with the annotations extracted.To do so, all you need to do is add a line containing relevance_scoring=default to your model.properties file in the model folder. If that line is not present, spotlight behaves as usual.

We use:

  • the context vector overlap
  • number a times a topic is spotted in the text
  • the overlap among context words for the topics. For example "Microsoft" being a context words for the topics.
  • We normalize the output score using min-max normalization (0-1) ( this step could be improved)

Given some toy data that we had manually annotated, we realized that this method gives results close to human judgements than those given by other topic extractors such as Zemanta or Alchemy. If you think this is a good idea, we could manually annotate some establish dataset, like the milne witten, and write a paper as an attempt to reproduce the results in a more formal way.

@jodaiber
Copy link
Member

This looks good, any open issues?

@tgalery
Copy link
Member Author

tgalery commented May 13, 2014

Hi @jodaiber no open issues that I know of. I could rebase on master, but I wonder whether the missing config would break the branch, as in the other pr.
Would be good for other people to test this to see if it's working as expected.

@pablomendes
Copy link
Member

@tgalery, this is great! I have wanted to add relevance for a while now,
but it was always trumped by other more serious issues. Never got to it!

Luis Marujo had a paper and shared a number of datasets in his LREC2012
paper that we could try to use for evaluation if their definition of
relevancy is at all related to yours.

I would love to help with the paper however I can.
On May 13, 2014 10:08 AM, "tgalery" notifications@github.com wrote:

Hi @jodaiber https://github.com/jodaiber no open issues that I know of.
I could rebase on master, but I wonder whether the missing config would
break the branch, as in the other pr.
Would be good for other people to test this to see if it's working as
expected.


Reply to this email directly or view it on GitHubhttps://github.com//pull/300#issuecomment-42983399
.

@tgalery
Copy link
Member Author

tgalery commented May 19, 2014

Thanks @pablomendes ! Glad to have such nice feedback. I'm gonna take a look at the Majuro paper when I have a chance and see if I can re-use his dataset. I will keep you posted on how things develop here (I guess I can find your email in google, yeah?). Meanwhile if you or @jodaiber want to test the branch more and give feedback on the scores you get back, feel free to do so.

@tgalery
Copy link
Member Author

tgalery commented Jun 6, 2014

hi @pablomendes I was taking a look at the Majuro dataset and apparently it can be found here https://github.com/snkim/AutomaticKeyphraseExtraction . However, when I openned the data, it didn't seem to contain the relevance scores.

I just took a look at the Roder et al paper and it seems that they have some publically available datasets for keyword extraction and disambiguation. I was wondering if maybe using that in a mechanical turk interface to collect the data would be a good idea. Or else maybe contacting Majuro himself.

dav009 and others added 2 commits July 10, 2014 16:34
adding relevance scores to rest flow

outputing the relevance scores

adding field for relevance-score

calculate relevance only if specified in properties

fix for no spot spotted topic cases

only one topic spotted fix

more comments

adding context text interect filter
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants