Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the silver annotation pipeline #5

Open
foolfun opened this issue Nov 26, 2020 · 40 comments
Open

How to run the silver annotation pipeline #5

foolfun opened this issue Nov 26, 2020 · 40 comments

Comments

@foolfun
Copy link

foolfun commented Nov 26, 2020

I ran the DrKIT code which includes 'sling/local/data/distant/facts-0000%d-of-00010.json', I have no idea how to get it?

@Deriq-Qian-Dong
Copy link

+1

@ringgaard
Copy link
Owner

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

@Deriq-Qian-Dong
Copy link

Hi, thanks for your reply. When I run sling silver_annotation, I got the error massage:
[2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0
do you have any idea about this error?

@ringgaard
Copy link
Owner

I remember having seen this error before. Let me check if there are some changes from the dev branch that I haven't submitted to the master branch.

@foolfun
Copy link
Author

foolfun commented Nov 27, 2020

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

Thanks for your reply! When I ran ‘ sling fuse_items’ met #4. I have no idea why it happened, can you help me?

@ringgaard ringgaard changed the title How to get 'sling/local/data/distant/facts-0000%d-of-00010.json' How run the silver annotation pipeline Nov 27, 2020
@ringgaard
Copy link
Owner

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

@ringgaard ringgaard changed the title How run the silver annotation pipeline How to run the silver annotation pipeline Nov 27, 2020
@foolfun
Copy link
Author

foolfun commented Nov 28, 2020

It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline.

You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:

rm -r /tmp/local.*

It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this.

Thank you so much. I will take your advice to try it again.

@foolfun
Copy link
Author

foolfun commented Nov 28, 2020

I have enough disk space, remove the old temp files and run the command follow. However, it seems that I met the same problem again. Looking forward to your reply

export TMPDIR=/mnt/hdd1/tmp

sling build_wiki --lbzip2 --languages en

image

@Deriq-Qian-Dong
Copy link

Deriq-Qian-Dong commented Nov 28, 2020

@foolfun try sling build_wiki. Withou lbzip2 and languages. It woks for me.

@ringgaard
Copy link
Owner

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline:
[2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

@foolfun
Copy link
Author

foolfun commented Nov 29, 2020

I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit.

You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:

sling fuse_items build_kb extract_names build_nametab build_phrasetab

Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline:
[2020-11-27 17:39:19.446544: F sling/nlp/kb/calendar.cc:41] Check failed: num >= 0

it works! I have been troubled by the issue for nearly two weeks, thank you very much!

@foolfun
Copy link
Author

foolfun commented Dec 1, 2020

The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:

sling build_idf

Then you can run silver annotation on all the Wikipedia articles:

sling silver_annotation

It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you.

I have run the silver annotation pipeline and the result shows in the following picture. However, I still can not find 'local/data/e/silver/en/silver-00000-of-00010.rec'. I don`t know whether I miss some important steps. Can you help me?
image

By the way, the files I can find are:
image

@ringgaard
Copy link
Owner

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

@foolfun
Copy link
Author

foolfun commented Dec 1, 2020

The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser.

NB: I did a complete run of the silver annotation pipeline and it did not get the Check failed: num >= 0 error. This error could be due to Wikidata errors in the date items. My version of Wikidata is from Nov 25.

I can not find 'e/silver/en/silver-0000%d-of-00010.rec' . How can I get the file?

@ringgaard
Copy link
Owner

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

@Deriq-Qian-Dong
Copy link

Hi, Ringgaard, I'm using google/sling to get sliver annotation. And I get the problem "Check failed: num >= 0"``. Cause I don't have a SUDO right to build sling in your repository. Do you have any idea about how to deal with this?

@foolfun
Copy link
Author

foolfun commented Dec 2, 2020

From where did you get the impression that the silver annotations should be in e/silver/en/silver-0000%d-of-00010.rec?

The silver annotations are in local/data/e/silver/en/train-?????-of-00010.rec and local/data/e/silver/en/eval.rec. You can take a look at the data with the codex tool:

bin/codex data/e/silver/en/train-00000-of-00010.rec | less

Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames.

Sorry,I try to run distantly_supervise.py which needs silver-0000%d-of-00010.rec in line 543. It is why I want to consult you about the way to get this file.

I tried to replace silver-0000%d-of-00010.rec with train-0000%d-of-00010.rec, but it showed line 348 kb_item gets None. Then, I guess this way may not work. Do you have any idea about how to deal with this?

@Deriq-Qian-Dong
Copy link

image
when using the google/sling, I got the sliver-* files. But it's not correct because it's not processed completely.

@ringgaard
Copy link
Owner

The problem seems to be that the distantly_supervise.py script expects the silver data to be indexed by QIDs but the silver pipeline assigns random keys in order to shuffle the data set for training. How many documents do you need to extract? Is it all of them or just a small subset?

@Deriq-Qian-Dong
Copy link

all of them, I think

@ringgaard
Copy link
Owner

There are basically two solutions: either take the train and eval files and reindex them, or make a new silver workflow that is compatible with the old mode.

Let me first check out how difficult it would be to make a custom silver workflow that produces the output that distantly_supervise.py expects.

@Deriq-Qian-Dong
Copy link

Pretty thanks a lot!

@ringgaard
Copy link
Owner

ringgaard commented Dec 2, 2020

With the Python script below you should be able to produce the silver-*.rec output that should be compatible with distantly_supervise.py:

import sling
import sling.flags as flags
import sling.log as log
import sling.task.workflow as workflow
import sling.task.wiki as wiki
import sling.task.corpora as corpora

flags.parse()
workflow.startup()

language = flags.arg.language
workdir = flags.arg.workdir + "/silver/" + language

wf = workflow.Workflow("silver")
wikiwf = wiki.WikiWorkflow(wf=wf)

indocs = wikiwf.wikipedia_documents(language)
outdocs = wf.resource("silver@10.rec", dir=workdir, format="records/document")
idf = wf.resource("idf.repo", dir=workdir, format="repository")

config = corpora.repository("data/wiki/" + language + "/silver.sling")
phrases = corpora.repository("data/wiki/" + language) + "/phrases.txt"

mapper = wf.task("document-processor", "labeler")
mapper.add_annotator("mentions")
mapper.add_annotator("anaphora")
mapper.add_annotator("phrase-structure")
mapper.add_annotator("relations")

mapper.add_param("resolve", True)
mapper.add_param("language", language)

mapper.attach_input("commons", wikiwf.knowledge_base())
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

mapper.attach_input("aliases", wikiwf.phrase_table(language))
mapper.attach_input("dictionary", idf)
mapper.attach_input("phrases", wf.resource(phrases, format="lex"))

wf.connect(wf.read(indocs), mapper)
output = wf.channel(mapper, format="message/document")
wf.write(output, outdocs)

workflow.run(wf)
workflow.shutdown()

You can check the output with this command:

bin/codex --lex local/data/e/silver/en/silver* 

@ringgaard
Copy link
Owner

Hmm... My test run seems to indicate that the script above does not read the stopword and blacklists correctly, resulting in many spammy annotations. Let me try to fix this.

@Deriq-Qian-Dong
Copy link

image

Emm...When I run this script, I got the same error.

@ringgaard
Copy link
Owner

Is there a stack trace below the "Check failed:" line?

@Deriq-Qian-Dong
Copy link

(core dumped)

@ringgaard
Copy link
Owner

The CHECK fault indicates that some invalid date is being processed. You could just comment out the CHECK in line 41 of calendar.cc. It would cause some invalid dates in the output annotations, but without further information, I don't know how to fix this.

@ringgaard
Copy link
Owner

I have updated the Python script above to include the configuration of stopwords and blacklists. The following lines were missing:

config = corpora.repository("data/wiki/" + language + "/silver.sling")
mapper.attach_input("commons", wf.resource(config, format="store/frame"))

This should remove a lot of spammy annotations for common words and phrases.

@foolfun
Copy link
Author

foolfun commented Dec 2, 2020

Hi, ringgaard! When I ran the script, I met this problem:
image

@ringgaard
Copy link
Owner

I assume that you run the script from the root directory of the git repo. The silver.sling file is checked into the master branch of the repo here. I don't understand why you don't have this file.

@foolfun
Copy link
Author

foolfun commented Dec 3, 2020

Hi, ringgaard! I have run the script but I got a err:
image

@ringgaard
Copy link
Owner

I haven't been able to reproduce the Check failed: num >= 0 error yet, so I think the best option for now is to replace line 41 in sling/nlp/kb/calendar.cc with:

    DCHECK(num >= 0);

and rebuild the code (tools/buildall.sh). This could result in some bad date annotations, but it would allow you to get on with producing the silver annotations.

@foolfun
Copy link
Author

foolfun commented Dec 3, 2020

Hello ringgaard. Sorry for disturbing you. I tried many times, but it seems the same error still happened.

@ringgaard
Copy link
Owner

@foolfun: Are you still getting the same CHECK fault in sling/nlp/kb/calendar.cc line 41 although you have changed it to a DCHECK, which can only happen in debug mode? Are you sure you recompiled the code using tools/buildall.sh?

@foolfun
Copy link
Author

foolfun commented Dec 3, 2020

@foolfun: Are you still getting the same CHECK fault in sling/nlp/kb/calendar.cc line 41 although you have changed it to a DCHECK, which can only happen in debug mode? Are you sure you recompiled the code using tools/buildall.sh?

Yes, I did these steps and the same error happened. I find the error may be related to my SLING Python API and python environment, I reconfigured them and the script has run half an hour without any error until now. Thank you for your patience.

@ringgaard
Copy link
Owner

It can sometimes be confusing which version of the SLING Python API you are using if you switch between using pip and downloading and installing the code yourself. You can check where it is installed like this:

$ python3 -c "import sling; print(sling)"
<module 'sling' from '/usr/lib/python3/dist-packages/sling/__init__.py'>
$ ls -l /usr/lib/python3/dist-packages/sling
lrwxrwxrwx 1 root root 26 Oct  1 17:54 /usr/lib/python3/dist-packages/sling -> /home/michael/sling/python

In "developer mode" it is important that the python package directory (/usr/lib/python3/dist-packages/sling) is a symlink to the python directory in your repository directory (/home/michael/sling/python). Otherwise, recompilation has no effect.

@foolfun
Copy link
Author

foolfun commented Dec 3, 2020

It can sometimes be confusing which version of the SLING Python API you are using if you switch between using pip and downloading and installing the code yourself. You can check where it is installed like this:

$ python3 -c "import sling; print(sling)"
<module 'sling' from '/usr/lib/python3/dist-packages/sling/__init__.py'>
$ ls -l /usr/lib/python3/dist-packages/sling
lrwxrwxrwx 1 root root 26 Oct  1 17:54 /usr/lib/python3/dist-packages/sling -> /home/michael/sling/python

In "developer mode" it is important that the python package directory (/usr/lib/python3/dist-packages/sling) is a symlink to the python directory in your repository directory (/home/michael/sling/python). Otherwise, recompilation has no effect.

Get it! Thank you!

@ringgaard
Copy link
Owner

Since the distantly_supervise.py script does random lookups in the silver data set, it might be useful to index this to make it faster:

bin/index local/data/e/silver/en/silver*

PS: My run of the silver pipeline completed successfully.

@ringgaard
Copy link
Owner

I have made prebuilt version of the knowledge base and alias tables available on the ringgaard.com web site. You can use the sling command to download these, e.g.:

sling fetch --dataset kb,phrasetab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants