Rerunning frog on already frogged FoliA #70

kosloot · 2019-05-27T13:40:40Z

Frog now assigns provenance data to FoLiA, which a.o. allows us to detect a rerun of (parts of) Frog on a FoLiA documents. BUT:
Handling this is quite dangerous and needs a lot of thinking.

assigning useful ID's to all provenance information of the several tools
Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?
tokenization is a special case see Rerunning ucto on already tokenized FoLiA ucto#68
etc
As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.

@proycon Any comments?

proycon · 2019-05-27T16:42:52Z

We do have to consider the realistic case where somebody only runs certain modules of Frog (say PoS-tagging and lemmatisation) and someone else at a later stage wants to add something else, like NER or parsing.

assigning useful ID's to all provenance information of the several tools

The random component strategy I use is easy and works well for that.

Does a rerun of one or more parts (like MBLEM or NER) mean an extra sub-processor under the old frog processor OR do we add a new Frog-processor?

Add a new one, since it's completely new Frog run, which can be done at a very different time and by a very different user on a very different machine than the older one.

As I don't want to postpone the FoLiA 2.0 Release, I suggest for the time being to just FORBID running Frog again on FoLiA with frog provenance data. That will not break existing cases, and will for sure NOT introduce artifacts that would bother us in the future.

Yes, as a temporary solution that seems quite acceptable, it will take some time for users to run into this issue anyway.

kosloot · 2019-05-27T20:25:07Z

Ok, so running a new (part-of) Frog requites new provenance record.

Regarding the ID's

The random component strategy I use is easy and works well for that.

I already started implementing along these lines, but:
A big disadvantage is the reproducibility. Every run on a certain input will generate different ID's which makes it quite impossible to check for REAL differences in (integration-) tests.

So my solution will probably be to register the used ID's while processing the document.
Frog will force these ID's into a certain format. Probably 'tool.n' so 'ucto.2' or 'MBLEM.4'
When new provenance is added, we use the highest id for a tool and increment by 1.
In this way processing will be deterministic.

Within the Frog pipeline this won't be a problem. Other tool might use a different scheme for ID's but that will be opaque to Frog.

Although this seems relatively easy to implement, I will keep this on the wish-list for after the 2.0 release
That will give us also time to think about a nice way to have Frog neatly (re)do just a module.

kosloot added this to the Folia 2.0 milestone May 27, 2019

kosloot assigned proycon and kosloot May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rerunning frog on already frogged FoliA #70

Rerunning frog on already frogged FoliA #70

kosloot commented May 27, 2019

proycon commented May 27, 2019

kosloot commented May 27, 2019 •

edited

Rerunning frog on already frogged FoliA #70

Rerunning frog on already frogged FoliA #70

Comments

kosloot commented May 27, 2019

proycon commented May 27, 2019

kosloot commented May 27, 2019 • edited

kosloot commented May 27, 2019 •

edited