Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntactic_head_ID erroneously references a token in the previous sentence #3

Open
victoryhb opened this issue Jan 24, 2022 · 4 comments

Comments

@victoryhb
Copy link

Hi, thank you for making the great library!
When parsing long documents, the syntactic_head_ID will sometimes reference a token in the previous sentence. For example, in the parsing output in the attached file (dKDD.csv):

0	2	0	28	she	she	122	125	PRON	PRP	nsubj	29	O
0	2	1	29	's	be	126	128	AUX	VBZ	ROOT	29	O
0	2	2	30	not	not	129	132	PART	RB	neg	29	O
0	2	3	31	the	the	133	136	DET	DT	det	32	O
0	2	4	32	one	one	137	140	NOUN	NN	attr	29	O
0	2	5	33	to	to	141	143	PART	TO	aux	34	O
0	2	6	34	write	write	144	149	VERB	VB	relcl	32	O
0	2	7	35	.	.	150	151	PUNCT	.	punct	29	O
0	3	0	36	Yeah	yeah	152	156	INTJ	UH	intj	35	O
0	3	1	37	.	.	157	158	PUNCT	.	punct	36	O

The syntactic_head_ID of token 36 (in sentence 3) is token 35 (sentence 2), which doesn't seem to make sense.
The same happens with tokens 62, 68, 91, 202, 276, 327, 328, 344, 376, 378, 385, 387, 433, 434, 499, 503, 516, 550, 556, 557, 558, 566, 589, 725, 751, 755, 813, 818, 843, 845, 853, 876, 880, 1450, 1502, 1563, 1756, 1881, 1882, 1902, 1926, 1972, 1993, 2054, 2058, 2059, 2086, 2097, 2103, 2488, 2489, 2511.
Is there a way to fix this?
dKDD.csv
dKDD.txt

@dbamman
Copy link
Member

dbamman commented Jan 28, 2022

Thanks for the note! Yes that does seems weird -- if this is with a version <1.0.7, try upgrading and see if it still happens (I'm running 1.0.7 and the "big" model and not seeing that issue with dKDD.txt.)

@victoryhb
Copy link
Author

victoryhb commented Jan 29, 2022

Thanks for the note! Yes that does seems weird -- if this is with a version <1.0.7, try upgrading and see if it still happens (I'm running 1.0.7 and the "big" model and not seeing that issue with dKDD.txt.)

I tried again and discovered that this bug only occurs when using en_core_web_lg as the parsing model (which I prefer as it appears to give more accurate results). Any idea why this is happening?

@wjbmattingly
Copy link

Which version of the en_core_web_lg model are you using @victoryhb?

@victoryhb
Copy link
Author

@wjbmattingly I am using version 3.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants