Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD #62

Open
bansp opened this issue Jan 4, 2021 · 4 comments
Assignees
Labels
schema Things related to the TEI Schemata/ODD

Comments

@bansp
Copy link
Member

bansp commented Jan 4, 2021

I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.

In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.

Let me sketch some background:

  • the TEI ODD mechanism is in essence a customization / documentation mechanism that targets a set of all the definitions encoded by the TEI Guidelines.
  • that set is not present in a cloned TEI repository, but rather gets derived by the make system (via TEI Stylesheets, which is a set of tools that accompanies the TEI Guidelines) and resides in a cryptically named document called p5subset. It is called an 'integrated ODD'.
  • any typical ODD document created with the appropriate TEI tools is meant to tailor the integrated ODD down to a particular purpose: manuscript description, corpus encoding, dictionary encoding, etc.
  • the application of the Freedict ODD to the integrated ODD (p5subset) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.
  • the 'Freedict integrated ODD' is used (or rather: was used) to derive the schema documents: RNG (of primary use for us), but also XSD and DTD (which we provide more or less out of courtesy -- but I can imagine us not providing these two, to avoid having to address the potential issues if someone decides to use those instead of the RNG)
  • I stress the "was used" because, simplifying the history slightly, that happened once, years ago: I ran the TEI tools on the current Freedict ODD and created the three schema documents. Note the crucial issue: they were ran on the p5subset as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the current p5subset is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.
  • one more relevant issue and an argument for 'freezing' the p5subset in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.

A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.

Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?
EDIT: this is now the topic of freedict/tools#28 and I have an interim solution

I mentioned adding two files to the version control. I meant the current p5subset and the Freedict integrated ODD (call it... freedict_p5subset?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).


Envisioned action sequence:

  1. derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)
  2. freeze the p5subset by adding it to Freedict version control (where? under shared/ or elsewhere?)
  3. derive the current freedict_p5subset by using the current Freedict ODD, with one change: its @source attribute will now point at the p5subset frozen at step (2)
  4. derive the RNG and check if all the databases validate against the RNG
  5. freeze the newly derived freedict_p5subset next to the p5subset; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets
  6. rewrite the current Freedict ODD, just for the syntactic sugar
  7. (recurring step) derive the RNG and check if all the databases validate against the RNG
  8. commit the newly created freedict_p5subset just to document any modifications that could have crept in at step (6)
  9. check our RNG version history for potential modifications introduced by hand, and see if they need to be handled at the ODD level (it might be that the underlying TEI has caught up with them, during the years that passed), if an ODD rewrite is necessary, then repeat steps (7) and (8)

At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.

@bansp bansp added the schema Things related to the TEI Schemata/ODD label Jan 4, 2021
@bansp bansp self-assigned this Jan 5, 2021
@humenda
Copy link
Member

humenda commented Jan 9, 2021 via email

@bansp
Copy link
Member Author

bansp commented Jan 11, 2021

Replying to specific points:

BTW, if the schemas were in tools/, we wouldn't need to copy / symlink the
schemas to each dictionary, but they were part of the tooling. Does this sound
sensible? If so, I would like to make this shift at some point.

I don't thank document grammars should be seen as part of the tooling. ODD and schemas are what provides semantic and syntactic rules for the interpretation of dictionary documents. I would definitely advise to keep them within the fd-dictionaries repository and either symlink, the way it's done now, or make the dictionaries point to the shared/ directory to identify the schema. I have just posted #66 to outline that. [EDIT: I would be completely comfortable (or even outright happy) with scratching issue #66 and maintaining the current status quo]

Please go ahead if you have your plan :)

Thanks :-) I understand that some of the above may be unclear (and I think I will reduce the procedure somewhat, to save some time), but indeed, I'm going to work on that in a separate branch, so nothing will be affected until I'm finished and it looks good.

@bansp
Copy link
Member Author

bansp commented Jan 12, 2021

Trying to keep the off-topic to a single ticket, so I am reposting Sebastian's comment from elsewhere. I am not sure if Sebastian had read my reply above before posting that comment.

I asked in another issue about including the schemas with the tooling. To what
respect is this not optimal? A dictionary should be buildable with a certain
version of the tooling. Eng-deu in 0.1 required the possibly oldest version of
freedict-tools, not versioned back then. eng-deu 1.8.1 requires fd-tools
0.5.0. It looks natural to me to include the schemas in each version of the
tools.

Tools operate on the semi-structured databases (as our XML dictionaries can be treated) in many cases thanks to the document grammars that flesh out the semantics of the particular components or regulate the relationships between components.

Think very early HTML with all the styling info inside. Separating the styling info into CSS leaves us with a skeleton that the styling information from the CSS attaches to. You need to put the two together in order to receive a pleasant, readable web page.
By rough analogy, you need to put bare XML and its schema together to be sure how to interpret the given semi-structured database. They belong together.

If you were to take the schemas away, you would only leave part of the relevant information in fd-dictionaries. They would be half-useless as XML documents, until the schemas were located or (imperfectly) inferred from the existing structure. There is completely nothing natural in snatching schemas away from the dictionary documents. I don't think it is a good approach for an open project to say, "fd-dictionaries contain bare XML documents; in order to make them meaningful, you have to install the other repository as well". That just isn't user-friendly.
fd-dictionaries in the current form (with schemas) do not require the fd-tools in order to become useful to people who do not wish to build distribution packages. They can safely exist on their own and the shared/ directory contains enough information (even if some of it is outdated) to get people started using or even fixing or extending fd-dictionaries with an XML editor. fd-tools make fd-dictionaries even more valuable, but they are not essential for fd-dictionaries to function on their own, if fd-documents are accompanied by their schema and their ODD.[1]

The TEI ODD makes the connection between the XML documents and schemas even more explicit, and it is my fault to not have maintained our ODD for a long time, and to have failed to exploit some of its features. I intend to take a step to amend that situation, and this ticket outlines the first steps towards that goal.

Looping back to the beginning of this particular comment: I believe that there exist good arguments for keeping schemas in fd-dictionaries rather than in fd-tools. I would like to suggest that we maintain the status quo in this regard, and don't try to fix something that is not broken.


[1] A minor note: it is part of TEI compliance requirements that in order to qualify as the TEI document, an XML document has to be (among other things) accompanied by the ODD document that defines its schema. But I believe that my argument above stands even without this further detail.

@humenda
Copy link
Member

humenda commented Jan 12, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Things related to the TEI Schemata/ODD
Projects
None yet
Development

No branches or pull requests

2 participants