Skip to content
This repository has been archived by the owner on Feb 22, 2021. It is now read-only.

Whether "dbc:Earthquakes" or "dbr:Vanilla_Ice" available in English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram #19

Open
zhq2009 opened this issue Jun 19, 2016 · 6 comments
Labels

Comments

@zhq2009
Copy link

zhq2009 commented Jun 19, 2016

Hello,

We are currently using the dataset "English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram". We are search for some test cases on http://dbpedia.org/page/Earthquake. We have tried "DBPEDIA_ID/Vanilla_Ice" is available in the dataset. But when we try "dbc:Earthquakes" or "dbr:Vanilla_Ice", we will get error message KeyError "dbc:Earthquakes" not in vocabulary and "dbr:Vanilla_Ice" not in vocabulary. We are wonder whether the dataset stores data as "dbc:" or "dbo:"?

Thank you

@dav009
Copy link
Contributor

dav009 commented Jun 20, 2016

Hi @zhq2009,

I did not tag categories, so at the moemnt we dont have any dbc: nor dbo:.
If this were to be included then the categories should be tagged on the text everytime an item of the category occurs?

@dav009
Copy link
Contributor

dav009 commented Jun 20, 2016

If I understand correctly, you want vectors for categories as well ? Category vectors could be easily done by appending the categories of a mention.

dbos sound a bit more complicated though

@zhq2009
Copy link
Author

zhq2009 commented Jun 21, 2016

Hello David,

Yes, we are looking for the vector categories as well.

Thank you for your help.

Sincerely,

Hanqing

On Mon, Jun 20, 2016 at 12:10 PM, David Przybilla notifications@github.com
wrote:

If I understand correctly, you want vectors for categories as well ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#19 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ARKSjGQQIahpjDXDHaOzwk57ezpyir_uks5qNrt1gaJpZM4I5NQl
.

@dav009
Copy link
Contributor

dav009 commented Jun 21, 2016

if you provide :

  • the map from page/resource to category (input: Wikipedia Dump, output: Category-Resource mapping)
  • a way to load them

I would gladly add the changes for supporting the gen of such vectors

@dav009 dav009 added the icebox label Jun 22, 2016
@zhq2009
Copy link
Author

zhq2009 commented Jun 27, 2016

Sure, We can discuss later to decide how to map and load.

We are currently wondering why some entities (dbr) are not available in the DBpedia vectors? If you could provide some suggestions, that would be great.

Thank you.

@dav009
Copy link
Contributor

dav009 commented Jun 29, 2016

some reasons:

  • gensim(the tool used to create the vectors) cant deal with a very big vocubulary set, do "tokens" with less frequency than certain threshold were dropped
  • they are not linked in wikipedia
  • redirects are not resolved (there are no canonicals)
  • it could be such resource did not exist in hte wiki we processed

@mal mal removed the fandango label Jan 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants