Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part_id bug with idp4 dataset from tagtog #113

Open
6 of 7 tasks
abojchevski opened this issue Sep 30, 2015 · 7 comments
Open
6 of 7 tasks

Part_id bug with idp4 dataset from tagtog #113

abojchevski opened this issue Sep 30, 2015 · 7 comments
Assignees
Labels
Milestone

Comments

@abojchevski
Copy link
Collaborator

Indeed old doc didn't have s4s1f1p1 but its text was somehow included at the end of s4s1p2. By parsing the doc again in latest version of tagtog, s4s1f1p1 is indeed created. It maybe have been an old bug on tagtog. The old doc contain in nala was parsed by NcbiJournalArticleParser_v0_3 whereas tagtog now uses NcbiJournalArticleParser_v0_4 (that is, a newer version). Because of wrong text placement, the bug on the display results.

Besides, only Ectelion (Rustem) and Shpend had annotated the doc. Eceletion annotations in that part refer to the correct s4s1f1p1 whereas Shpend also has annotation in that part but they refer to the wrong `s4s1p2. Sanjeev contains an empty ann.json that can be safely deleted.

Tasks:

  • Backup all data for that doc
  • Delete doc on tagtog
  • Reupload on tagtog
  • Reupload Ectelion annotations making sure that the offsets are correct
  • [ ] Reupload Shpend's correcting the offsets
    --> Decided it's not worth the effort. See comment: Part_id bug with idp4 dataset from tagtog #113 (comment)
  • Reupload html to corpora in nala repo --> Rostlab/nala@a979175
  • Copy links of ann.json downloads for Ectelion and Shpend's
  • Juanmi does this nala team decides whether to replace the html in the bootstrap 0 and whether and how to include the ann.json (after possible merge)

Original description:

For the document aYwkx1JUQj1EJKF0BpuDGtqNMqnK-PMC3613162

  • the .html has NO part with the following part_id s4s1f1p1
  • however, Ectelion's .ann.json has few entities that reference that part_id
  • and, sanjeevkrn's .ann.json lists the same part_id in annotatable

Additionaly, when you open it on tagtog the following weird thing is going on:
capture

@abojchevski abojchevski modified the milestones: 2-ThisWeek, 0-Inbox Sep 30, 2015
@juanmirocks juanmirocks modified the milestones: 2-ThisWeek, 0-Inbox Oct 2, 2015
@juanmirocks juanmirocks modified the milestones: 3-Next, 2-ThisWeek Nov 2, 2015
@juanmirocks juanmirocks modified the milestones: 2-ThisWeek, 3-Next Nov 25, 2015
@juanmirocks
Copy link
Collaborator

Unfortunately Shpend's annotations are rather difficult to recover, since all figure annotations offsets are wrong (due the the old parsing having all wrong figure captions). Ectelion's still work so we go with them.

Just in case, here is the Shpend's original annotations backup NOTE the wrong s4s1p2 annotations were already changed to s4s1f1p1, EXCEPT for the relations offsets ...

The relation offsets mappings for those should be:

s4s1p2|1040,1046 -> 404,410
s4s1p2|1168,1175 -> 528,535
s4s1p2|1185,1192 -> 549,556
s4s1p2|1164,1166 -> 532,539
s4s1p2|1181,1183 -> 545,547

Shpendi-aYwkx1JUQj1EJKF0BpuDGtqNMqnK-PMC3613162.ann.json.zip

@juanmirocks
Copy link
Collaborator

@juanmirocks
Copy link
Collaborator

New html:

Rather Use: Rostlab/nala@a979175

(this in xml form)
new-aYwkx1JUQj1EJKF0BpuDGtqNMqnK-PMC3613162.plain.html.zip

@juanmirocks
Copy link
Collaborator

@abojchevski @carstenuhlig corrected on tagtog and html in nala. Now it's your turn to decide how to put Ectelion's ann.json to the corpus and to the bootstrapping 0

Let me know if you have questions

@juanmirocks
Copy link
Collaborator

@abojchevski @carstenuhlig did you guys include this on itr0 with the correct files?

@abojchevski
Copy link
Collaborator Author

I believe so. However it was a long time ago and I can't be sure.

@juanmirocks
Copy link
Collaborator

See #169

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants