Wikipedia articles coverage #4

nickvosk · 2015-04-30T14:56:20Z

Hi @dav009, very promising work here!

I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.

Out of 4342357 articles, only 226319 had a matching vector (~5%). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.

Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.

Thanks.

The text was updated successfully, but these errors were encountered:

dav009 · 2015-04-30T15:07:47Z

yes, I think one reason for losing some wikipedia identifiers is definitely the min threshold that had to be used. It seems that deeplearning4j fixed many of the problems that they initially had, so I have to give a try to runnin it with deeplearning4j and a lower min threshold. Another way could be "manipulating" the corpus to assure wikipedia identifiers are above that threshold.
There is another reason: be aware that as it currently stands, it only adds wikipedia identifiers which have explicit links within the wikipedia corpus. i.e: suppose you have an article with an id wikipedia_id1 but it is never linked anywhere, then it wont appear in the model that we generate.
My rough guess is that many of the identifiers that are not in the model are due to this.
I have a ToDo to address this. I want to substitute one of the mentions within wikipedia_id1's article, sort of to create a fake link. (Again that substitution has to be above the previous threshold issue )
then it comes a second part of resolving redirects..

Thanks, this is definitely an important issue to address

nickvosk · 2015-04-30T15:20:16Z

Yes, the lack of explicit links is definitely one of the problems. I think that doing entity linking inside each article might lead to better coverage (by restricting the candidate entities to the ones that are already linked inside the article). This might lead to some false positives in some cases though.

Also, the first phrase in a wiki article does not have an explicit link but it could be linked to the id of the article without much risk . :)

dav009 · 2015-04-30T15:26:59Z

Creating a fake link at the beginning of the article could be an alrternative, but I kinda don't like it because then the generated vector will be poluted with dates,locations that are usually aggregated in the first line of a wikipedia article (i.e the pronounciation)

I was thinking on something like this:

suppose you have the article Barack_Obama then later on in some of the paragrams either obama or barack will be mentioned. just assume that mention is for Barack_Obama and create a few fake links. Whether correct or incorrect, probably the context will still make sense as it is within the article of the entity, and probably the context is much better than by creating it at the beginning.

nickvosk · 2015-04-30T15:37:01Z

I think that indeed the first line needs some preprocessing to solve these issues, but I don't think that the vectors are gonna be polluted by adding the first line, as it usually contains quite useful context :)

Yes, we are describing almost the same thing for intra-article entity linking :) I am proposing that you can even expand that logic to every link in the article (by considering their corresponding mentions). It would be interesting to create a small collection to evaluate this.

dav009 · 2015-04-30T15:43:27Z

got it.
well, yeah, considering how little coverage of ids this currently has, it is worth going for it.

dav009 · 2015-05-18T10:26:29Z

@nickvosk A good remark that we could use here might be [1]. As the surface forms referring to the article's entity are usually on Bold just like the PR there suggested, it seems to be a more informed assumption.

[1] dbpedia-spotlight/dbpedia-spotlight#356

Edit: Updated wrong link

nickvosk · 2015-05-20T16:42:54Z

Can you elaborate on how this would fix the coverage problem @dav009 ?

Also, this paper looks relevant :
Noraset, Thanapon, Chandra Bhagavatula, and Doug Downey. Adding High-Precision Links to Wikipedia.

dav009 · 2015-05-21T17:14:39Z

@nickvosk :) good reference, I think I saw it before on ACL.
Well, It would help to find the right anchors to create 'fake links" of a topic within its own article.

AS the paper suggest we could as well run some NEL, with some very high confidence values to add some extra links and probably get above the min-threshold imposed by the implementation of gensim's word2vec

nickvosk · 2015-05-21T17:21:26Z

@dav009 exactly :)

dav009 · 2015-05-29T11:41:40Z

Looking at some old raw counts via dbpedia spotlight project it seems that out of 6M topics in those counts 4M have less than 5 links.

Surpringly filtering topics with more than 50 links give us: 268836, which is similar to our current coverage: 226319.

mal added the icebox label Sep 10, 2015

jsgriffin added monster and removed monster labels Apr 11, 2016

mal removed the fandango label Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikipedia articles coverage #4

Wikipedia articles coverage #4

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

dav009 commented May 18, 2015

nickvosk commented May 20, 2015

dav009 commented May 21, 2015

nickvosk commented May 21, 2015

dav009 commented May 29, 2015

Wikipedia articles coverage #4

Wikipedia articles coverage #4

Comments

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

nickvosk commented Apr 30, 2015

dav009 commented Apr 30, 2015

dav009 commented May 18, 2015

nickvosk commented May 20, 2015

dav009 commented May 21, 2015

nickvosk commented May 21, 2015

dav009 commented May 29, 2015