Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for more CoNLL formats #49

Open
mbollmann opened this issue Nov 21, 2019 · 6 comments
Open

Support for more CoNLL formats #49

mbollmann opened this issue Nov 21, 2019 · 6 comments
Assignees
Projects

Comments

@mbollmann
Copy link

As requested in the README, I'd like to leave a note that I'd be very interested in having the library support more than the standard CoNLL-U format.

Specifically, I currently find myself needing to work with all of these formats:

It would be really awesome to have a single library for reading in all these different (yet very similar) file types!

@matgrioni
Copy link
Collaborator

Hi Marcel,

Thank you for your issue. I agree that given the similarities of these file types, it would make sense to have one library to read all of them. In the past, I was not able to find a good definition of these files types as good as CoNLL-U, which is why I focused on this format.

I will take a look at these papers this week to see the differences and get back to you over the weekend with more info!

@mbollmann
Copy link
Author

Thanks!

You're not wrong that the definitions aren't always precise. What's more, I've encountered several files that don't fully conform to the specs but have been accepted as submissions to shared tasks etc. (A common violation is having empty columns without a _.)

My idea would be to have a ConllFormat class that defines the column names, types, and other relevant features, and use this for reading/writing/instantiating Conll objects. You could predefine some specific ConllFormat instances that correspond to known formats, such as CoNLL-U or CoNLL-X, but if a user encounters a different CoNLL variant in the wild, they could still instantiate their own ConllFormat.

Admittedly, I haven't looked into your codebase enough to understand how big/difficult of a change this would be, but it's the approach I took for my own (very hacky & incomplete :)) handling of these formats for now.

@matgrioni
Copy link
Collaborator

Took me longer than I expected to get through to your links. I have not thought extremely in depth on it but your suggestion sounds like a natural step to final solution (especially considering the flexibility in the CoNLL-U Plus format).

This also matches the Python csv library which allows you to create your own dialects, and specify formatting changes that are cosmetic or identifier related rather and easily swappable.

For me the interesting case here is CoNLL-UL, which encodes a very rich representation (DAG seems like vs. simple tree). My current approach to the rich alternative is to provide an immutable structure for viewing and processing, while any writes have to be on the serial, flat structure. In this case, Sentence has to be used for any changes and is a flat list of tokens, but a tree can be easily created from it using to_tree. In this case, I can imagine there may be some very useful operations to define on the DAG and also for modifying the sentence through the DAG. I would like to look more into this case before changes to see how the community uses it.

Since you are the first to ask for this support, I will also ask you if there is a certain format of these which is most useful for you for pyconll to support, in case I run into some time issues I can focus on this one for the time being :).

@matgrioni matgrioni self-assigned this Oct 6, 2020
@matgrioni matgrioni added this to To do in main via automation Oct 6, 2020
@matgrioni
Copy link
Collaborator

@mbollmann

I have started to come around (finally) to this issue, and am hoping to have it released at the end of next month along with a few other new advanced features. I am curious as to what your usage of these formats is, however, to understand the status of them more, and know what to include by default.

For example it seems like conll-ul has not been updated since 2018 and there's not much activity on their pages or the project. At face value, this project doesn't seem active in the community, however, this could be a mistaken impression. If you have more information on this it would be great, but given my delays, it's very possible you are working on other projects nowadays :).

@mbollmann
Copy link
Author

Thanks @matgrioni! I'm indeed not actively working with this right now, but I've used a hot-patched clone of your library in a paper that will soon be published, and I'm sure I'll work with CoNLL-* files again in the future! :)

My use case was reading a variety of CoNLL-ish files coming from a variety of different research projects. It appears that many people use a CoNLL-* format as a starting point for representing their datasets, but then make custom modifications to fit their specific annotation needs. In order to parse and work with these files, I needed a parser that is as flexible and tolerant as possible when it comes to the types of columns that are in the data (as opposed to, say, enforcing strict compliance with a given file format spec).

@matgrioni
Copy link
Collaborator

Thanks for the perspective. My approach something similar to what you have already done in your fork, with some well known formats having the specifications already defined in the package, but custom formats also possible to define. Ideally, I want these prepackaged specifications to include the new typing annotations also, which I already have staged for release with the conllu only design, and have been really useful in my own usage for autocomplete in IDEs.

The syntax would look something like this for typed annotations (mostly for myself or other contributors to add new support options):

class ConlluToken:
  id = Str()
  form = NullableStr('_')
  lemma = NullableStr('_')
  ...
  features = Map('|', '=')
  ...
  enhanced = Collection('|', Tuple(4, ','))
  misc = Map('|', '=', empty_value=True)

A non-typed option for users would also be available for any newer conll variants. There are still some technical issues with getting this to play nicely with the type annotations, but this is the goal anyway.

@matgrioni matgrioni moved this from To do to In progress in main Feb 25, 2021
@matgrioni matgrioni moved this from In progress to To do in main Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
main
  
To do
Development

No branches or pull requests

2 participants