Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerunning frog on already frogged FoliA #70

Open
kosloot opened this issue May 27, 2019 · 2 comments
Open

Rerunning frog on already frogged FoliA #70

kosloot opened this issue May 27, 2019 · 2 comments
Assignees
Milestone

Comments

@kosloot
Copy link
Collaborator

kosloot commented May 27, 2019

Frog now assigns provenance data to FoLiA, which a.o. allows us to detect a rerun of (parts of) Frog on a FoLiA documents. BUT:
Handling this is quite dangerous and needs a lot of thinking.

  • assigning useful ID's to all provenance information of the several tools
  • Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?
  • tokenization is a special case see Rerunning ucto on already tokenized FoLiA ucto#68
  • etc
    As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.

@proycon Any comments?

@kosloot kosloot added this to the Folia 2.0 milestone May 27, 2019
@proycon
Copy link
Member

proycon commented May 27, 2019

We do have to consider the realistic case where somebody only runs certain modules of Frog (say PoS-tagging and lemmatisation) and someone else at a later stage wants to add something else, like NER or parsing.

assigning useful ID's to all provenance information of the several tools

The random component strategy I use is easy and works well for that.

Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?

Add a new one, since it's completely new Frog run, which can be done at a very different time and by a very different user on a very different machine than the older one.

As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.

Yes, as a temporary solution that seems quite acceptable, it will take some time for users to run into this issue anyway.

@kosloot
Copy link
Collaborator Author

kosloot commented May 27, 2019

Ok, so running a new (part-of) Frog requites new provenance record.

Regarding the ID's

The random component strategy I use is easy and works well for that.

I already started implementing along these lines, but:
A big disadvantage is the reproducibility. Every run on a certain input will generate different ID's which makes it quite impossible to check for REAL differences in (integration-) tests.

So my solution will probably be to register the used ID's while processing the document.
Frog will force these ID's into a certain format. Probably 'tool.n' so 'ucto.2' or 'MBLEM.4'
When new provenance is added, we use the highest id for a tool and increment by 1.
In this way processing will be deterministic.

Within the Frog pipeline this won't be a problem. Other tool might use a different scheme for ID's but that will be opaque to Frog.

Although this seems relatively easy to implement, I will keep this on the wish-list for after the 2.0 release
That will give us also time to think about a nice way to have Frog neatly (re)do just a module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants