-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for more CoNLL formats #49
Comments
Hi Marcel, Thank you for your issue. I agree that given the similarities of these file types, it would make sense to have one library to read all of them. In the past, I was not able to find a good definition of these files types as good as CoNLL-U, which is why I focused on this format. I will take a look at these papers this week to see the differences and get back to you over the weekend with more info! |
Thanks! You're not wrong that the definitions aren't always precise. What's more, I've encountered several files that don't fully conform to the specs but have been accepted as submissions to shared tasks etc. (A common violation is having empty columns without a My idea would be to have a Admittedly, I haven't looked into your codebase enough to understand how big/difficult of a change this would be, but it's the approach I took for my own (very hacky & incomplete :)) handling of these formats for now. |
Took me longer than I expected to get through to your links. I have not thought extremely in depth on it but your suggestion sounds like a natural step to final solution (especially considering the flexibility in the CoNLL-U Plus format). This also matches the Python csv library which allows you to create your own dialects, and specify formatting changes that are cosmetic or identifier related rather and easily swappable. For me the interesting case here is CoNLL-UL, which encodes a very rich representation (DAG seems like vs. simple tree). My current approach to the rich alternative is to provide an immutable structure for viewing and processing, while any writes have to be on the serial, flat structure. In this case, Sentence has to be used for any changes and is a flat list of tokens, but a tree can be easily created from it using Since you are the first to ask for this support, I will also ask you if there is a certain format of these which is most useful for you for pyconll to support, in case I run into some time issues I can focus on this one for the time being :). |
I have started to come around (finally) to this issue, and am hoping to have it released at the end of next month along with a few other new advanced features. I am curious as to what your usage of these formats is, however, to understand the status of them more, and know what to include by default. For example it seems like conll-ul has not been updated since 2018 and there's not much activity on their pages or the project. At face value, this project doesn't seem active in the community, however, this could be a mistaken impression. If you have more information on this it would be great, but given my delays, it's very possible you are working on other projects nowadays :). |
Thanks @matgrioni! I'm indeed not actively working with this right now, but I've used a hot-patched clone of your library in a paper that will soon be published, and I'm sure I'll work with CoNLL-* files again in the future! :) My use case was reading a variety of CoNLL-ish files coming from a variety of different research projects. It appears that many people use a CoNLL-* format as a starting point for representing their datasets, but then make custom modifications to fit their specific annotation needs. In order to parse and work with these files, I needed a parser that is as flexible and tolerant as possible when it comes to the types of columns that are in the data (as opposed to, say, enforcing strict compliance with a given file format spec). |
Thanks for the perspective. My approach something similar to what you have already done in your fork, with some well known formats having the specifications already defined in the package, but custom formats also possible to define. Ideally, I want these prepackaged specifications to include the new typing annotations also, which I already have staged for release with the conllu only design, and have been really useful in my own usage for autocomplete in IDEs. The syntax would look something like this for typed annotations (mostly for myself or other contributors to add new support options):
A non-typed option for users would also be available for any newer conll variants. There are still some technical issues with getting this to play nicely with the type annotations, but this is the goal anyway. |
As requested in the README, I'd like to leave a note that I'd be very interested in having the library support more than the standard CoNLL-U format.
Specifically, I currently find myself needing to work with all of these formats:
It would be really awesome to have a single library for reading in all these different (yet very similar) file types!
The text was updated successfully, but these errors were encountered: