Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyconll throws error when more than 10 columns #125

Open
IHMCBrodie opened this issue Oct 21, 2021 · 9 comments
Open

Pyconll throws error when more than 10 columns #125

IHMCBrodie opened this issue Oct 21, 2021 · 9 comments

Comments

@IHMCBrodie
Copy link

IHMCBrodie commented Oct 21, 2021

Describe the bug
When trying to read a CoNLL-U formatted file with more than 10 columns the following error occurs:
pyconll.exception.ParseError: The number of columns per token line must be 10.

To Reproduce
Run pyconll.load_from_file("/path/to/file") on a file that has CoNLL formatted data with more than 10 columns

Expected behavior
This one is a bit tougher, I suppose expected behavior would be that it would handle any number of extra columns. The reason I am raising this issue is because it is common, especially with semantic role labeling (SRL) that the extra data (SRL labels in this case) are placed in extra columns after the 10th column. One major example (although it's not exactly CoNLL-U format) is the CoNLL formatted OntoNotes 5.0. Another example (using CoNLL-U format) is this library for projecting SRL labels . For every predicate that is identified in the sentence there is an extra column with the corresponding SRL annotations. So it would be nice if pyconll could read in the file without causing an error.

  • OS: [e.g. Linux, Windows, MacOS] = MacOS
  • Version [e.g. 2.2.1] = pyconll 3.1.0
@IHMCBrodie IHMCBrodie added the bug label Oct 21, 2021
@matgrioni
Copy link
Collaborator

Hi IHMCBrodie,

Thanks for the issue. This is the first time I have seen a use case like this, but it definitely makes sense. My initial approach to the library was to be strict as to what it expected to make the program invariants simpler to maintain and keep support costs low. The assumption was that if the data does not follow this format don't expect it to parse.

In general there are a few requests and ideas that have come from this LOS, and I would really like to expand it. I think a reader options object that can parameterize how reading is done would help a lot toward this. This would be something you pass in to the load methods, like

rc = ReaderConfig(preserve_extra_columns=True, require_validation=False)
pyconll.load_from_file('my.conll', rc)

It may just be simpler having the non-validation mode automatically preserve columns; I'm not sure the exact path I would take.

@matgrioni
Copy link
Collaborator

Since this is by design, I am going to remove the bug tag and change it to a feature request. I will try to see what can be done. There are few other things in the library that make this more difficult than it appears at first glance (namely the use of slots on Token, and I'd rather not have an extra columns variable when its not used).

I'm curious also on your requirement for this. Is this something you would like to see soon or are able to resolve otherwise? Since a small personal fix is pretty easy, I wonder if having a nightly-support type build that works off of specified branches would be helpful for this scenario. That way you are unblocked, but a proper fix can be done on a separate cadence.

@matgrioni matgrioni added enhancement and removed bug labels Oct 22, 2021
@IHMCBrodie
Copy link
Author

Hello again haha, and again thanks for the quick responses!

A major organization that makes use of the CoNLL formatted OntoNotes 5.0 is AllenNLP for their SRL predictor tool. I totally understand about being strict with the initial approach. Also, yes familiar with the format at the Universal Dependencies link, from what I know the library I linked above for projecting SRL labels is following the CoNLL-U format found at the universal dependencies page. I guess I am not sure how prevalent it is for additional columns to be added maybe this is more exclusive to SRL tasks then I originally thought.

I like you idea passing in the reader options I think that could be a solid direction to take.

Perfect, yes I think the enhancement tag is more suited to this issue, and I totally understand the difficulty that can come with task such as these and making it fit nicely into what has already been done.

We are working towards being able to build a non-english SRL model for AllenNLP, and are using the X-SRL library (linked above) as a part of a project that I helped design for a college course, so as soon as possible would be great. Obviously I am aware though that not everything can be dropped for one person, but wow that is a very generous offer and would be fantastic, I think, to have a specific branch for this issue with a proper fix being in the future.

@matgrioni
Copy link
Collaborator

Well it happens frequently enough that people use pyconll for projects that can be time-sensitive, and given the stable feature base, being able to distribute small tweaks here and there is useful.

I'll have something this weekend. Distributing it properly is honestly more of the work than anything else for something like this. I'll update this thread once that version is up.

@IHMCBrodie
Copy link
Author

Thanks! This is super helpful and much appreciated!

@matgrioni
Copy link
Collaborator

Hello again,

You can find the patched version at https://pypi.org/project/pyconll/3.1.0.dev1/. Since this isn't the most recent version you can install using this command: pip install pyconll==3.1.0.dev1. Also I assumed you were using pypi/pip. If you are using conda please let me know and I can also release on that platform as necessary. I'm still determining the best way to handle this scenario, so I didn't want to put too time much into an interim approach.

This version only requires that there be at least 10 columns. Any extra columns for a token are put in Token.extra_cols. So the 11th column would be extra_cols[0], and 12th column would be extra_cols[1], and so on.

These should also be output on serialization. Note there is no requirement for all tokens to have the same extra column amount.

Please let me know about anything else here!

@matgrioni
Copy link
Collaborator

I just tested it out. Seems like there is another irregularity in the file provided. The 9th column is supposed to be for enhanced dependencies. In several lines like here:
10 desmoronó desmoronar VERB VERB Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 10 ROOT Y fall.01 _ _ _

there is a notation I do not quite understand, it uses 'Y'. I have released another version on 3.1.0.dev2 to unblock this if needed, (like other fields if '_' is the value then None is provided, otherwise it is the raw string). I only see instances of 'Y' so I'm not clear what it means, and why an existing field was co-opted for this.

If there are other structural irregularities from the spec, let me know and we can work through the best way to handle them. Since the file samples on that repo are small, I suspect that there may be more instances seen in a larger set.

@IHMCBrodie
Copy link
Author

Hey thanks so much for taking care of this and so quickly! Yes pypi/pip should be fine and the release of that version is a very nice way for use to make use of it. Thanks again on this.

Yeah I forgot that in some of the examples at the link I gave you they have that "Y". What that is, is they were marking which token was a predicate in the sentence. It shows up in their sample data on their repo but I notice when using their text_to_conll.py script that it does input that "Y" so I think for our purposes it should be fine but thank you for checking it yourself and also creating a version release that handles that use case.

I will keep you posted if anything else comes up but again immensely appreciate of this work and accommodation.

@matgrioni
Copy link
Collaborator

Version 3.1.0.dev3 was released with these changes (didn't change enhanced dependency parsing JIC). The features now work like the misc column. There is no need for there to be any values, and the requirements are pretty permissive. it shouldn't fail on anything now, although for really weird input (like "SP=" or "SP=,") you won't get back that result, but I don't think that's an issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants