syntactic_head_ID erroneously references a token in the previous sentence #3

victoryhb · 2022-01-24T05:34:09Z

Hi, thank you for making the great library!
When parsing long documents, the syntactic_head_ID will sometimes reference a token in the previous sentence. For example, in the parsing output in the attached file (dKDD.csv):

0	2	0	28	she	she	122	125	PRON	PRP	nsubj	29	O
0	2	1	29	's	be	126	128	AUX	VBZ	ROOT	29	O
0	2	2	30	not	not	129	132	PART	RB	neg	29	O
0	2	3	31	the	the	133	136	DET	DT	det	32	O
0	2	4	32	one	one	137	140	NOUN	NN	attr	29	O
0	2	5	33	to	to	141	143	PART	TO	aux	34	O
0	2	6	34	write	write	144	149	VERB	VB	relcl	32	O
0	2	7	35	.	.	150	151	PUNCT	.	punct	29	O
0	3	0	36	Yeah	yeah	152	156	INTJ	UH	intj	35	O
0	3	1	37	.	.	157	158	PUNCT	.	punct	36	O

The syntactic_head_ID of token 36 (in sentence 3) is token 35 (sentence 2), which doesn't seem to make sense.
The same happens with tokens 62, 68, 91, 202, 276, 327, 328, 344, 376, 378, 385, 387, 433, 434, 499, 503, 516, 550, 556, 557, 558, 566, 589, 725, 751, 755, 813, 818, 843, 845, 853, 876, 880, 1450, 1502, 1563, 1756, 1881, 1882, 1902, 1926, 1972, 1993, 2054, 2058, 2059, 2086, 2097, 2103, 2488, 2489, 2511.
Is there a way to fix this?
dKDD.csv
dKDD.txt

The text was updated successfully, but these errors were encountered:

dbamman · 2022-01-28T23:39:09Z

Thanks for the note! Yes that does seems weird -- if this is with a version <1.0.7, try upgrading and see if it still happens (I'm running 1.0.7 and the "big" model and not seeing that issue with dKDD.txt.)

victoryhb · 2022-01-29T09:21:43Z

Thanks for the note! Yes that does seems weird -- if this is with a version <1.0.7, try upgrading and see if it still happens (I'm running 1.0.7 and the "big" model and not seeing that issue with dKDD.txt.)

I tried again and discovered that this bug only occurs when using en_core_web_lg as the parsing model (which I prefer as it appears to give more accurate results). Any idea why this is happening?

wjbmattingly · 2022-02-28T18:22:18Z

Which version of the en_core_web_lg model are you using @victoryhb?

victoryhb · 2022-03-01T05:19:01Z

@wjbmattingly I am using version 3.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syntactic_head_ID erroneously references a token in the previous sentence #3

syntactic_head_ID erroneously references a token in the previous sentence #3

victoryhb commented Jan 24, 2022

dbamman commented Jan 28, 2022

victoryhb commented Jan 29, 2022 •

edited

wjbmattingly commented Feb 28, 2022

victoryhb commented Mar 1, 2022

syntactic_head_ID erroneously references a token in the previous sentence #3

syntactic_head_ID erroneously references a token in the previous sentence #3

Comments

victoryhb commented Jan 24, 2022

dbamman commented Jan 28, 2022

victoryhb commented Jan 29, 2022 • edited

wjbmattingly commented Feb 28, 2022

victoryhb commented Mar 1, 2022

victoryhb commented Jan 29, 2022 •

edited