Discount SF Idea #298
base: master
Are you sure you want to change the base?
Discount SF Idea #298
Conversation
I think we should have a more detailed conversion about this, because the way the surface form probabilities are discounted is something that prevents recall drastically.
and many many more. To be honest, I don't understand how a SF is represented in the SF store and never used to retrieve any candidate, given that its annotation probability is too low. To counter this problem, we have been looking for smart ways to generate a list of interesting SFs (cross-referencing with word net, and noun extraction from specific corpora) and use it as an input to our spotlight model editor to increase these SFs annotation probabilities so they become spottable. However, we then face a problem due to the nature of the stores. For example, if you take an SF like 'bus driver'. We would have to generate at leas 4 case variations ('Bus Driver', 'Bus driver', bus Driver') to make the right candidates available. This seems a bit of an overkill (and note that we'd be missing all capitalised versions, and if we add plurals, that would be 8 SFs)). Additionally, something interesting happens with articles. Given how people generate the links, an SF like 'The giants' would be associated with candidate X, but an SF like 'giants' not. It frequently means that the right candidate is missing because of syntactic context in which the SF has occurred. This also might affect the way the annotated counts and total counts for a particular SF are calculated To be honest, I dunno what's the best way to tackle this issue, but I'd propose a number of semi normalisation steps when generating the sf store:
And maybe do not reduce the annotation probability to a level that makes an sf non-spottable. I know that these would generate a lot of junk, but it would be role of other factors, related to features of the candidate or context that would role those out. Maybe we could even have some way to estimate theta values for those features given some annotated data or something. Maybe these steps could incorporated into the extraction framework, or else a script that tweaks the raw counts, but I think this is a serious point, and if we solve it right, spotlight would be miles ahead of the competition. |
thanks for your work on this. I entirely agree that this is a big issue to be solved. Additionally to what you mentioned, we also need a reliable way of estimating how likely any of those transformations is, so that we do not overgenerate. You might want to have a look at the Han et al. paper that the statistical backend implements. To tackle this issue, they run a word-alignment tool on surface forms to automatically learn these kind of transformations. It would be great to have this in Spotlight as well. Jo |
On your bus driver example: If you are really really strict about only using the SF store (which doesn't make sense in my head), then this might be a problem. But this is easier in the actual implementation. For example, the way the FSA spotter works is it goes through all known surface forms, and converts them to lower cased stems. Your SF entry would be "Bus driver", this would be something like "bus driv" in stem form. Now, when you spot a text and see "Bus DRIVER", it is first stemmed and lowercased, then the FSA is asked for entries of the form "bus driv". It will return the IDs of all SFs that are mapped to those two stems. Now all you have to do -- and this is way sub-optimal at the moment -- is to take what you see in the text (Bus DRIVER) and compare it to all SF candidates that you got from the spotter. You bascially need a function score(Bus DRIVER, Bus driver), i.e. a score for the text SF and the SF in the DB. If the score is high enough, you accept it as a candidate and if there are multiple matching SFs you select the one with the highest score. The word-alignment-based method from the paper would be another way of calculating this score. And, as you can imagine, while this approach can deal with lowercase and morphology (plural), it will not be able to deal with determiners or misspellings unless this is added to the FSA spotter. So my suggestion for dealing with this issue is:
Edit: Sorry, I haven't touched this stuff for too long. What I said is not entirely correct. It should be the case that the spotter gives you a set of candidates, but this is not true currently. It actually just tells you "yes, this is a candidate" at the moment. So to make this actually work like I mentioned, either the spotter would have to remember the SFs or the stemmed version should be added to
|
Cool, that's what we though too. I was taking a look at my PR on the trimming of whitespaces here and it got me thinking that the distinction between the SurfaceFormStore and it's lower case counterpart would not fit the proposal you just outlined. Should we have instead a |
So we started playing with the ideas discussed here. We actually had a play with a machine translation model as a replacement for a similarity function between spot and the surface forms in a store, but we haven't plugged it to our working branch yet. The branch we've been playing with it at the moment is this https://github.com/idio/dbpedia-spotlight/tree/feature/TG-betterSurfaceFormMatching-and-relevance . Basically we re-appropriated the lowercasemap structure to hold a map between stems and the associated surface forms. We regenerated the surface form store and the fsa dictionary. It compiles and works but I think still there are problems. I created a gist for it here https://gist.github.com/tgalery/ddaa39df6ac272732062 . I think the crux of the problem is this: we are conflating two roles of a stem-based top extraction:
Although I believe a stem-based mechanism to play a meaningful role in (1), it seems to course-grained to handled (2), specially if no linguistic knowledge is involved in generating the spots. A good example is a verbal compound like "whisking up" which might be reduced to "whisk" and take to refer to whiskey. I talked about some of those points in the gist, but am eager to see what you guys think of this. |
Hey @tgalery, thanks for looking into this. For your point 6 in the doc, I think words that are common terms are not that big of a problem if we consider P(an entity | "in") for example, because the really common words will be filtered out by the low probability/high entropy of in (in will stand for a lot of things but for none in the majority of time). I was thinking, since we already have a linear model for the spotter, which has really basic features at the moment (P(annotated|sf), is number?, is accronym?, bias), maybe any of the transformations for the mapping could just be incorporated into this. We could define all common transformations (e.g. dropping a determiner, different case etc.) and have a log-linear model for which we estimate the weights from the corpus. E.g. if it is a direct mapping ["United", "States"] -> "United States" (SF), the weight would be:
So if there was a direct mapping To generate more candidates, one other thing to try could be to use a token-based Levensthein FSA automaton (e.g. via https://github.com/danieldk/dictomaton) that can retrieve all SFs from the lexicon that differ in 1 or 2 edits (i.e. dropped or added a token) in linear time. Jo |
I have create a branch that merges the idea presented in this PR and the dissociation between a spotter score and a disambiguator score made in this PR #312 . My Idea was to keep the spotting loose and hope that an increased requirement on the disambiguator score to get rid of the junk. Here's the text example:
There are some good news and some bad ones. The bad ones is that the disambiguator score seems to do little work in comparison to the spotter score. It starts prunning entities only if we raise to the 0.9 level in this example, whereas raising the spotter score to 0.2 to 0.35 knocks down quite a lot of entities. This can be seen in the logs:
The good news is that the mechanism that does the filtering seems to work alright. But the output score of the disambiguator seems to be off scale somehow. I wonder whether two things might be happening (i) maybe there's some over or under normalisation going on, or (ii) representation of the context is not ideal for the purposes of scoring. Any thoughts ? |
This is meant to be a review/feedback pr :) :
verbose explanation and details :
https://gist.github.com/dav009/67cfc2787d07761a55d5