Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity matching using wikidata #15

Open
sumit-agrwl opened this issue Apr 30, 2022 · 11 comments
Open

Entity matching using wikidata #15

sumit-agrwl opened this issue Apr 30, 2022 · 11 comments

Comments

@sumit-agrwl
Copy link

I dont want to use wikipedia for any processing. I just want to use the wikidata for entity matching in different languages. Can you guide me through the steps? I am assuming I need to only work with wikidata

@ringgaard
Copy link
Owner

I have added a new --wikidata_only flag that you can use for the fuse_items and extract_aliases tasks. This excludes inputs from Wikipedia. Please notice that this means that you will not get entity popularity counts in the alias table.

@sumit-agrwl
Copy link
Author

So which steps do I need to run? I have ran till import_wikidata.

Also, my ultimate aim is to given a piece of text like “Who is the president of United States?”, it can extract wiki data from it like “president of United States”. If you can just tell me what needs to be done, it would be helpful. I could see the parse for other stores, but am not able to find any documentation for wikidata as such.

@ringgaard
Copy link
Owner

You need to run the following tasks in addition to import_wikidata:

sling  compute_fanin fuse_items build_kb extract_aliases build_phrasetab --wikidata_only

This will produce a knowledge base (kb.sling) and a phrase table (phrase-table.repo). You can use the phrase table to look up matching phrases, see https://github.com/ringgaard/sling/blob/master/doc/guide/pyapi.md#phrase-tables.

Since both the knowledge base and the phrase table is in memory is is pretty fast to make lookups. You should be able to look up all subphrases up til a certain length (e.g. 10).

@sumit-agrwl
Copy link
Author

Since you have made changes for the flag in the source code, I am assuming I need to build from source.

I cloned the repo and ran setup.sh, but its giving me the error

ln: failed to create symbolic link '/usr/lib/python3.7/site-packages/sling': No such file or directory

After that I tried the below command

I am assuming I need to do this -

If you haven't run the setup.sh script already, you then need to link the sling Python module directly to the Python source directory to use it in "developer mode":

sudo ln -s $(realpath python) /usr/lib/python3/dist-packages/sling

For which I ran

sudo ln -s /usr/bin/python3 /usr/lib/python3/dist-packages/sling

But then sling command is still not working.

@ringgaard
Copy link
Owner

if your sling directory is /home/bob/sling, I think the ln command should be something like:

sudo ln -s /home/bob/sling/python /usr/lib/python3/dist-packages/sling

You can also just wait until tomorrow, where the changes has been included in the nightly build

@sumit-agrwl
Copy link
Author

Thank you for your prompt responses. I was able to run the command!

@sumit-agrwl
Copy link
Author

[2022-04-30 16:37:09.474628: F sling/task/task.cc:215] Input config is missing for task fused-items/item-reconciler

@sumit-agrwl sumit-agrwl reopened this Apr 30, 2022
@ringgaard
Copy link
Owner

Seem like the config is not optional for item reconciler. Could I get you to try to add the auxin parameter in kb.py:

      return self.wf.mapreduce(input=items,
                               output=output,
                               mapper="item-reconciler",
                               reducer="item-merger",
                               format="message/frame",
                               params={"indexed": True},
                               auxin={"config": self.recon_config()})

@sumit-agrwl
Copy link
Author

Thanks for your prompt reply. Its running.

@sumit-agrwl
Copy link
Author

I am not sure if I understood this. My query still lies in the fact, that given a query like "Who is the president of United States?" it can extract "president of United States" as an entity that matches to "President of the United States" (https://ringgaard.com/kb/Q11696). I am hoping there is some kind of sling parser that can do that. But I cannot find any documentation or process to do that. Also, it would be helpful if I can do the entity linking in different languages. I think there is support for that in this project, but I am not able to figure that out. Also, one more question is -

I need to change "Who is the president of United States?" to -

Who is the {entity in different language}? For eg : Who is the [Presidente de los Estados Unidos] ?

(this will be using the aliases in different languages), currently after running the steps that you suggested, I could just get the english name and no aliases.

@ringgaard
Copy link
Owner

If you want to match entity names in other languages you can use the --language flag when generating the phrase tables ( extract_aliases and build_phrasetab). I should note that Wikidata is not "language-dependent" in the same sense as Wikipedia.

While there is a semantic parser and some entity resolution components in SLING, this is not really going to solve your problem. What you are asking for is really a question-answering system. This is a difficult research problem which many researchers and companies are actively working on. There are no simple solutions, but if you search for this, you will find references to many articles describing different approaches to this problem with each their strengths and weaknesses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants