-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run the silver annotation pipeline #5
Comments
+1 |
The "silver annotation pipeline" is not yet properly documented as it is still under development, but you should be able to run it. First you run the wiki pipeline as described here. Then you need to build an IDF table using this command:
Then you can run silver annotation on all the Wikipedia articles:
It takes quite a while to run the silver annotation pipeline (10 hours on my machine). Please let me know if this works for you. |
Hi, thanks for your reply. When I run sling silver_annotation, I got the error massage: |
I remember having seen this error before. Let me check if there are some changes from the dev branch that I haven't submitted to the master branch. |
Thanks for your reply! When I ran ‘ sling fuse_items’ met #4. I have no idea why it happened, can you help me? |
It seems like I will have to do a complete test run of the wiki and silver annotation pipelines. I run these in a slightly different mode using wiki snapshots to get a wikidata dump and the reconciler for fusing items. It seems like there is some bug in the old pipeline. You should check if you have enough disk space. You will need something like 500 GB free space on you hard drive including your temp directory (usually /tmp). There has been reports about out-of-disk-space is not always reported correctly. You should also check that you don't have a bunch of temp files from runs that crashed. You can remove old temp files using this command:
It is going to take a while to rerunning the pipelines, so please be patient. I will try to do this over the weekend. I have a server upgrade Sunday which will also delay this. |
Thank you so much. I will take your advice to try it again. |
@foolfun try sling build_wiki. Withou lbzip2 and languages. It woks for me. |
I think I managed to fix the error that caused fuse_items to crash, so if you sync to HEAD you should be able to run the wiki pipeline. See this commit. You can just resume from the fuse_items stage, so you don't need to re-run the whole wiki pipeline again:
Next, I will try to see if I can reproduce the CHECK fault in the silver annotation pipeline: |
it works! I have been troubled by the issue for nearly two weeks, thank you very much! |
I have run the silver annotation pipeline and the result shows in the following picture. However, I still can not find 'local/data/e/silver/en/silver-00000-of-00010.rec'. I don`t know whether I miss some important steps. Can you help me? |
The output looks correct. The silver-annotated Wikipedia documents are in train-*.rec and eval.rec. Together these contain all the Wikipedia articles. They are split into train and eval because I use this data as noisy training data for the semantic parser. NB: I did a complete run of the silver annotation pipeline and it did not get the |
I can not find 'e/silver/en/silver-0000%d-of-00010.rec' . How can I get the file? |
From where did you get the impression that the silver annotations should be in The silver annotations are in
Each record is a Wikipedia article and contains the title, the raw text, the tokens, and the mentions with evoked frames. |
Hi, Ringgaard, I'm using google/sling to get sliver annotation. And I get the problem "Check failed: num >= 0"``. Cause I don't have a SUDO right to build sling in your repository. Do you have any idea about how to deal with this? |
Sorry,I try to run distantly_supervise.py which needs I tried to replace |
The problem seems to be that the |
all of them, I think |
There are basically two solutions: either take the train and eval files and reindex them, or make a new silver workflow that is compatible with the old mode. Let me first check out how difficult it would be to make a custom silver workflow that produces the output that |
Pretty thanks a lot! |
With the Python script below you should be able to produce the silver-*.rec output that should be compatible with distantly_supervise.py:
You can check the output with this command:
|
Hmm... My test run seems to indicate that the script above does not read the stopword and blacklists correctly, resulting in many spammy annotations. Let me try to fix this. |
Is there a stack trace below the "Check failed:" line? |
(core dumped) |
The CHECK fault indicates that some invalid date is being processed. You could just comment out the CHECK in line 41 of calendar.cc. It would cause some invalid dates in the output annotations, but without further information, I don't know how to fix this. |
I have updated the Python script above to include the configuration of stopwords and blacklists. The following lines were missing:
This should remove a lot of spammy annotations for common words and phrases. |
I assume that you run the script from the root directory of the git repo. The |
I haven't been able to reproduce the
and rebuild the code ( |
Hello ringgaard. Sorry for disturbing you. I tried many times, but it seems the same error still happened. |
@foolfun: Are you still getting the same CHECK fault in sling/nlp/kb/calendar.cc line 41 although you have changed it to a DCHECK, which can only happen in debug mode? Are you sure you recompiled the code using |
Yes, I did these steps and the same error happened. I find the error may be related to my |
It can sometimes be confusing which version of the SLING Python API you are using if you switch between using pip and downloading and installing the code yourself. You can check where it is installed like this:
In "developer mode" it is important that the python package directory (/usr/lib/python3/dist-packages/sling) is a symlink to the python directory in your repository directory (/home/michael/sling/python). Otherwise, recompilation has no effect. |
Get it! Thank you! |
Since the
PS: My run of the silver pipeline completed successfully. |
I have made prebuilt version of the knowledge base and alias tables available on the ringgaard.com web site. You can use the
|
I ran the DrKIT code which includes 'sling/local/data/distant/facts-0000%d-of-00010.json', I have no idea how to get it?
The text was updated successfully, but these errors were encountered: