Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Discount SF Idea #298

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dav009
Copy link
Member

@dav009 dav009 commented May 2, 2014

This is meant to be a review/feedback pr :) :

verbose explanation and details :
https://gist.github.com/dav009/67cfc2787d07761a55d5

@tgalery
Copy link
Member

tgalery commented Jun 5, 2014

I think we should have a more detailed conversion about this, because the way the surface form probabilities are discounted is something that prevents recall drastically.
For example, very frequent unigram SFs are almost by virtue of the way data is annotated on wikipedia, made impossible to spot. The include:

[
billing, invoice, menu, statistics, headset, debit, credit, bank, cost, job title, buffet,
company, table, family residence, food, Parcelforce, transaction, fee, parcel,
payment, money, account, customer, technology, Direct debit, font, telephony,
supervisor, laboratory
]

and many many more.

To be honest, I don't understand how a SF is represented in the SF store and never used to retrieve any candidate, given that its annotation probability is too low.

To counter this problem, we have been looking for smart ways to generate a list of interesting SFs (cross-referencing with word net, and noun extraction from specific corpora) and use it as an input to our spotlight model editor to increase these SFs annotation probabilities so they become spottable.

However, we then face a problem due to the nature of the stores. For example, if you take an SF like 'bus driver'. We would have to generate at leas 4 case variations ('Bus Driver', 'Bus driver', bus Driver') to make the right candidates available. This seems a bit of an overkill (and note that we'd be missing all capitalised versions, and if we add plurals, that would be 8 SFs)).

Additionally, something interesting happens with articles. Given how people generate the links, an SF like 'The giants' would be associated with candidate X, but an SF like 'giants' not. It frequently means that the right candidate is missing because of syntactic context in which the SF has occurred. This also might affect the way the annotated counts and total counts for a particular SF are calculated

To be honest, I dunno what's the best way to tackle this issue, but I'd propose a number of semi normalisation steps when generating the sf store:

  • remove head determiners and prepositions from sfs ('the Giants' -> 'Giants')
  • remove plural endings from sfs ('Giants' -> 'giant')
  • normalise on lower case ('Giant-> giant) [Why not]
  • no sf that fully coincide with determiners, pronouns and maybe other stop words should be included in the store. [Maybe this is debatable]

And maybe do not reduce the annotation probability to a level that makes an sf non-spottable. I know that these would generate a lot of junk, but it would be role of other factors, related to features of the candidate or context that would role those out. Maybe we could even have some way to estimate theta values for those features given some annotated data or something.

Maybe these steps could incorporated into the extraction framework, or else a script that tweaks the raw counts, but I think this is a serious point, and if we solve it right, spotlight would be miles ahead of the competition.

@jodaiber
Copy link
Member

jodaiber commented Jun 5, 2014

Hey @tgalery, @dav009,

thanks for your work on this. I entirely agree that this is a big issue to be solved. Additionally to what you mentioned, we also need a reliable way of estimating how likely any of those transformations is, so that we do not overgenerate. You might want to have a look at the Han et al. paper that the statistical backend implements. To tackle this issue, they run a word-alignment tool on surface forms to automatically learn these kind of transformations. It would be great to have this in Spotlight as well.

Jo

@jodaiber
Copy link
Member

jodaiber commented Jun 5, 2014

On your bus driver example:

If you are really really strict about only using the SF store (which doesn't make sense in my head), then this might be a problem. But this is easier in the actual implementation. For example, the way the FSA spotter works is it goes through all known surface forms, and converts them to lower cased stems. Your SF entry would be "Bus driver", this would be something like "bus driv" in stem form. Now, when you spot a text and see "Bus DRIVER", it is first stemmed and lowercased, then the FSA is asked for entries of the form "bus driv". It will return the IDs of all SFs that are mapped to those two stems. Now all you have to do -- and this is way sub-optimal at the moment -- is to take what you see in the text (Bus DRIVER) and compare it to all SF candidates that you got from the spotter.

You bascially need a function score(Bus DRIVER, Bus driver), i.e. a score for the text SF and the SF in the DB. If the score is high enough, you accept it as a candidate and if there are multiple matching SFs you select the one with the highest score.

The word-alignment-based method from the paper would be another way of calculating this score. And, as you can imagine, while this approach can deal with lowercase and morphology (plural), it will not be able to deal with determiners or misspellings unless this is added to the FSA spotter.

So my suggestion for dealing with this issue is:

  • remove determiners/prepositions when creating the candidate mapping for the spotter
  • improve the way that "acceptability" of a SF is scored (at the moment, this happens here)

Edit: Sorry, I haven't touched this stuff for too long. What I said is not entirely correct. It should be the case that the spotter gives you a set of candidates, but this is not true currently. It actually just tells you "yes, this is a candidate" at the moment. So to make this actually work like I mentioned, either the spotter would have to remember the SFs or the stemmed version should be added to getSurfaceFormsNormalized in MemorySurfaceFormStore. But the idea is still what we should do, I think:

  1. have a coarse mapping from lowercase stems without articles to real surface forms
  2. score all candidates from this mapping with a function that invloves adding articles, edit distance, case etc.

@tgalery
Copy link
Member

tgalery commented Jun 16, 2014

Cool, that's what we though too. I was taking a look at my PR on the trimming of whitespaces here and it got me thinking that the distinction between the SurfaceFormStore and it's lower case counterpart would not fit the proposal you just outlined. Should we have instead a RawSFStore for storing the annotations as they occur in wikipedia (stripping extra white spaces and what not) and a NormalSFStore for storing the normalised SFs without head articles and prepositions, plus the mappings to their RawSF counterparts and the associations with candidates. Does that make sense or would it create some other sort of problem ?

@tgalery tgalery mentioned this pull request Jul 11, 2014
@tgalery
Copy link
Member

tgalery commented Jul 11, 2014

So we started playing with the ideas discussed here. We actually had a play with a machine translation model as a replacement for a similarity function between spot and the surface forms in a store, but we haven't plugged it to our working branch yet.

The branch we've been playing with it at the moment is this https://github.com/idio/dbpedia-spotlight/tree/feature/TG-betterSurfaceFormMatching-and-relevance . Basically we re-appropriated the lowercasemap structure to hold a map between stems and the associated surface forms. We regenerated the surface form store and the fsa dictionary. It compiles and works but I think still there are problems. I created a gist for it here https://gist.github.com/tgalery/ddaa39df6ac272732062 .

I think the crux of the problem is this: we are conflating two roles of a stem-based top extraction:

  1. A way to extract a spot and measure its similarity to possible surface forms via some sort of similarity function. For example, "The Giants" might be associated with the SF "Giants" because they reduce to the same stem and as such we can measure their similarity and integrate that output to other components in the entity-mention model.
  2. A way to generate topic associations. For example, it could be the case that "The giants" is associated to the NewYork Giants, but "Giants" is not. In this case, occurrences of the latter surface form would not be able to retrieve the sports team as a candidate. Now, with a stem based extractaction mechanism, we could get all the surface forms that reduce to a common stem and return the output or the SF similarity function and all their candidates to the next stage in the annotation pipeline.

Although I believe a stem-based mechanism to play a meaningful role in (1), it seems to course-grained to handled (2), specially if no linguistic knowledge is involved in generating the spots. A good example is a verbal compound like "whisking up" which might be reduced to "whisk" and take to refer to whiskey. I talked about some of those points in the gist, but am eager to see what you guys think of this.

@jodaiber
Copy link
Member

Hey @tgalery,

thanks for looking into this. For your point 6 in the doc, I think words that are common terms are not that big of a problem if we consider P(an entity | "in") for example, because the really common words will be filtered out by the low probability/high entropy of in (in will stand for a lot of things but for none in the majority of time).

I was thinking, since we already have a linear model for the spotter, which has really basic features at the moment (P(annotated|sf), is number?, is accronym?, bias), maybe any of the transformations for the mapping could just be incorporated into this. We could define all common transformations (e.g. dropping a determiner, different case etc.) and have a log-linear model for which we estimate the weights from the corpus. E.g. if it is a direct mapping ["United", "States"] -> "United States" (SF), the weight would be:

P("annotate" | "United States")^w0 x edit_distance_penalty^w1 x determiner_dropped_penalty^w2 ...

So if there was a direct mapping edit_distance_penalty and determiner_dropped_penalty would be 1.0 and the overall score would be P("annotate" | "United States") directly. If there is a bigger distance, this score would be discounted (and by how much we could learn from the corpus). Since we would learn the weights, we could just specify all possible changes as features and each the model would learn how important they are for each language.

To generate more candidates, one other thing to try could be to use a token-based Levensthein FSA automaton (e.g. via https://github.com/danieldk/dictomaton) that can retrieve all SFs from the lexicon that differ in 1 or 2 edits (i.e. dropped or added a token) in linear time.

Jo

@tgalery
Copy link
Member

tgalery commented Sep 3, 2014

I have create a branch that merges the idea presented in this PR and the dissociation between a spotter score and a disambiguator score made in this PR #312 . My Idea was to keep the spotting loose and hope that an increased requirement on the disambiguator score to get rid of the junk.

Here's the text example:

 Need a tube of toothpaste, but don’t want to wait? Google wants to drone that to you, the Mountain View-based technology giant announced today.Google follows Amazon in announcing that it is building consumer delivery-facing drone technology. Amazon previously disclosed that it is working to build drones that can deliver small parcels to shoppers.The two companies have differing visions, however. Google’s plan appears slanted towards incredibly quick delivery, perhaps in as little as two minutes, a long profile in The Atlantic indicated. Amazon, instead, is focusing on a timeframe closer to thirty minutes.

There are some good news and some bad ones. The bad ones is that the disambiguator score seems to do little work in comparison to the spotter score. It starts prunning entities only if we raise to the 0.9 level in this example, whereas raising the spotter score to 0.2 to 0.35 knocks down quite a lot of entities. This can be seen in the logs:

[ConfidenceFilter] - (c=0.99) filtered out by similarity score threshold (0.885<0.990): SurfaceForm[consumer] -0.885-> DBpediaResource[Consumer] - at position *199* in - Text[...  follows Amazon in announcing that it is building consumer delivery-facing drone technology. Amazon  ...]
 INFO 2014-09-03 11:10:11,766 Grizzly-2222(4) [ConfidenceFilter] - (c=0.99) filtered out by similarity score threshold (0.573<0.990): SurfaceForm[companies] -0.573-> DBpediaResource[Corporation] - at position *356* in - Text[... hat can deliver small parcels to shoppers.The two companies have differing visions, however. Google' ...]
 INFO 2014-09-03 11:10:11,767 Grizzly-2222(4) [ConfidenceFilter] - (c=0.99) filtered out by similarity score threshold (0.987<0.990): SurfaceForm[timeframe] -0.987-> DBpediaResource[Timeline] - at position *577* in - Text[... ntic indicated. Amazon, instead, is focusing on a timeframe closer to thirty minutes. ...]
 INFO 2014-09-03 11:10:11,767 Grizzly-2222(4) [ConfidenceFilter] - (c=0.99) filtered out by similarity score threshold (0.959<0.990): SurfaceForm[closer] -0.959-> DBpediaResource[Distance] - at position *587* in - Text[... ated. Amazon, instead, is focusing on a timeframe closer to thirty minutes. ...]

The good news is that the mechanism that does the filtering seems to work alright. But the output score of the disambiguator seems to be off scale somehow. I wonder whether two things might be happening (i) maybe there's some over or under normalisation going on, or (ii) representation of the context is not ideal for the purposes of scoring.

Any thoughts ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants