xml2ccg #20

shoeffner · 2018-11-29T16:32:17Z

Note: This PR relies on #18 and #19 and thus contains the same commits as well. Once those are merged, it will be slightly smaller. I can also rebase/squash etc. for a shorter history.

xml2ccg

This PR introduces a script xml2ccg, which is roughly the inverse of ccg2xml.
Since the recommended way to edit grammars is not fiddling around with xml files but with a ccg file, the tool should only be seen as a one-off generator of a lost ccg file.

I am looking forward to your review and feedback!

Changelog

Features

xml2ccg.py: a new script to create a ccg file from a directory containing the appropriate grammar xml files. It comes with the same xml2ccg and xml2ccg.bat convenience scripts as ccg2xml. Just like ccg2xml.py, it is copied to the bin directory using the ccg-build process. However, it is not auto-generated.

xml2ccg is tested as follows:

Each available ccg grammar (arabic.ccg, tiny.ccg, tinytiny.ccg, grammar_template.ccg, inherit.ccg) was converted to its xml counterpart and put into the test/ccg2xml directory.
1a. An additional hand-crafted grammar (diaspace, LGPL 2.1+) is used, although no original ccg files exist anymore.
The test_xml2ccg.py generates a ccg file from each xml directory and then generates a new xml directory from that temporary directory.
The original xml directory and the new xml directory are compared (except for the properties of the root elements and the grammar.xml's file attributes).
While this works well for all ccg2xml generated grammars, for the hand-crafted grammar a few looser rules are needed:

The newly generated grammar is allowed to have more entries. It is possible
that some implicit macro definitions were not explicitly written by hand,
while the ccg2xml generator adds those. This is especially the case for the
types.xml, which lists all macro types explicitly when generated via ccg2xml,
while the hand-crafted variant only contains ontology types.
The ccg2xml tool has a few small inconsistencies with the documentation in tiny.ccg for handling certain situations, especially how case<0>: acc0:p-case; is converted. According to tiny.ccg it should become:
```
<macro name="@Acc0">
    <fs id="0" attr="case" val="p-case"/>
</macro>
```
but instead becomes
```
<macro name="@Acc0">
  <fs id="0">
    <feat attr="case" val="p-case" />
  </fs>
</macro>
```
These two variants, however, represent the same content in some way. So for the final comparison in the test, macro/fs/feat with a val != None is treated in the same way as macro/fs.

Fixes & smaller changes

Multiple entries with similar names would be discarded by ccg2xml, as for some complex xml structures only shallow copies have been performed. This was wrapped into deepcopies (ccg.ply:785, ccg.ply:1881)
warning_count was not defined or properly used and thus removed (ccg.ply)
Removes the executable file permission from various files (ccg.ply, README, arabic.ccg)
Indentation and whitespacing in the build.xml and src/ccg2xml/build.xml is streamlined

Caveats

ccg2xml ignores macro names when generating xml files but instead uses the entity names prefixed with an @ for macro names. Thus, an entry MACRO<NOMVAR:MODE>: NAME; results in
```
<macro name="@NAME">
  <lf>
    <satop nomvar="NOMVAR">
      <diamond mode="MODE">
        <prop name="NAME" />
      </diamond>
    </satop>
  </lf>
</macro>
```
instead of using macro name="@MACRO". Thus, in a handcrafted xml where macro names are different from prop names, the information is converted "properly" to ccg, but lost on the conversion back, leading to some strange errors. The only solution to this problem is to change the xml files before hand, so that the prop names and macro names are the same (and unique) already.
The grammar.xml's content is largely ignored, the script assumes all files to be in the same directory instead of following the paths inside grammar.xml.

Additionally removed some outdated imports and comments. One "arg == None" was changed to "arg is None".

- Using 2to3 and some manual labor - Especially comments can be overlooked - Updating lex.py and yacc.py - Tested the changes with arabic.ccg - The editor buttons are broken with this version

It still works fine inside a debian-docker container using XQuartz on the host system. Windows will be tested soon.

The content of Features and Testbed is presented properly again. Before, only the last items were shown, as the indentation level was one too few.

…ested by tiny.ccg.

… in <>).

…aising UnboundLocalError for pos in the next line.

…ables.

…g_words.

…w. Adding CategoryParser for complexcat and atomcats.

…he CategoryParser.

…es into account.

I tried to keep it to a minimum, but it is possible that more locations are missing. In future iterations, this can be improved to be more convservative and consistent with the places where maybe_quote is used.

…l more closely.

This reflects the fact that in hand-crafted xml files, rules / macros might have been forgotten.

@Acc0

case<0>: acc0:p-case; should become <macro name="@Acc0"> <fs id="0" attr="case" val="p-case"/> </macro> but becomes instead <macro name="@Acc0"> <fs id="0"> <feat attr="case" val="p-case" /> </fs> </macro> This is in theory fine (although tiny.ccg claims the prior case would be true), but causes trouble when comparing the XMLs and when converting back and forth between XML and CCG. However, since xml2ccg's purpose is more or less a one-way-recovery, this slight inconsistence is fine and should be permitted by the tests.

@macro

Caveat: If a macro is defined in the form <macro name="@macro"> <lf> <satop nomvar="NOMVAR"> <diamond mode="MODE"> <prop name="NAME" /> </diamond> </satop> </lf> </macro> The resulting ccg entry looks like this: MACRO<NOMVAR:MODE>: NAME; However, ccg2xml drops the macro name and uses the prop name instead: <macro name="@name"> <lf> <satop nomvar="NOMVAR"> <diamond mode="MODE"> <prop name="NAME" /> </diamond> </satop> </lf> </macro> This is functionaly equivalent, as the macro name is only an identifier. However, this can lead to name clashes in certain circumstances as well as issues with with unit tests (as the macro names now differ).

shoeffner · 2018-12-05T18:47:00Z

WIth commit 66aa180 / 2decc3b I identified a couple of problems with the test suite which resulted in some wrongly translated xml's to slip through (below are only the important excerpts).
I am currently working on a fix for these parses.

arabic morph:

Original:
<fs id="2">
    <feat attr="PERS" val="1st"/>
</fs>

Generated:
<fs attr="PERS" id="2" val="1st"></fs>

arabic lexicon:

Original:
<feat attr="lex" val="[*DEFAULT*]"/>

Generated:
<feat attr="lex" val="*"/>

diaspace lexicon:

Original:
<feat attr="num">
    <featvar name="NUM"/>
</feat>

Generated:
<feat attr="NUM">
    <featvar name="NUM"/>
</feat>

diaspace rules:

Original:
<typeraising dir="forward" useDollar="false">
    <arg>
        <atomcat type="pper"/>
    </arg>
</typeraising>

Generated:
<typeraising dir="forward" useDollar="false"/>

inherit lexicon:

Original:
<feat attr="index">
    <lf>
        <nomvar name="E"/>
    </lf>
</feat>

Generated:
// nothing (other feats are processed, but feats containing lf not)

tiny has multiple of the above issues but no new issues.

…ized. Fixing typerrais parsing. Fixing several [*DEFAULT*] values and handling nomvars more properly.

shoeffner · 2018-12-07T14:14:00Z

The only remaining problem with the diaspace grammar is now family entries which have features of the following type:

<feat attr="modality">
    <lf>
        <nomvar name="SM:gs-SpatialModality"/>
    </lf>
</feat>

These are currently parsed into [modality], thus the information about the nomvar is lost. I am not sure, if modality is even a thing to be treated special like the "index" features -- and if so, it could only also work with a single uppercase letter as its name, i guess.

In either case, I don't know how to represent this in ccg so that it would generate the right output. Maybe the original xml grammar can be changed or this is something the ccg format does not support, while OpenCCG does.

Also disallows saveSection on a None text element. If a section was edited but "Done" is clicked on another section, a NoneType has no get exception was thrown. However, a proper fix would be to allow "Done" only on edited sections.

shoeffner · 2018-12-17T14:32:22Z

Similarly to the above mentioned modality attributes, in the diaspace grammar there are a few index attributes with complex names:

<feat attr="index">
    <lf>
        <nomvar name="GL:gs-GeneralizedLocation"/>
    </lf>
</feat>

Since ccg2xml parses only (so it seems) single uppercase letters properly into index attributes, this index feature gets translated from its current ccg representation

[GL:gs-GeneralizedLocation]

into

<feat attr="GL">
    <featvar name="GL:gs-GeneralizedLocation"/>
</feat>

Is this a limitation of the ccg files? Or are those errors in the grammar which should not be possible in xml either?

These two issues seem ( :-) ) to be the remaining problems for xml2ccg.
Do you have any ideas on how to progress with these?

mwhite14850 · 2018-12-17T16:03:37Z

This may be a limitation of what ccg2xml can parse. But in general the ability to support LF-valued features (beyond the special index feature) is an important part of the native XML grammar format (note that the .ccg format was designed for easier human authoring but was never exhaustively checked against what the native XML format supports). In the flights and comic grammars (under openccg/grammars), LF-valued features are used to propagate the info and owner features from the semantics to the syntax, in order to implement a version of Steedman's theory of communicative structure (theme/rheme and 'kontrast'), which is described in this article [http://aclweb.org/anthology/J10-2001.pdf]. One way to wrap up xml2ccg, of course, would be to emit warnings when a native XML grammar cannot be adequately translated to .ccg; another would be to try to make ccg2xml complete, but that option would not be for the faint of heart.

shoeffner · 2018-12-17T17:52:49Z

Thank you, I was already afraid that this would be the case. I will consider the options and see if I can find some time over the holidays to implement one or the other.

shoeffner added 30 commits November 18, 2018 09:34

Allowing unicode as a valid string type for lex.py:input.

1177c12

Removing ccg-editor.py in favor of the up-to-date ccg_editor.py

fb64047

Streamlining whitespace in ccg_editor and related build files.

d472650

Additionally removed some outdated imports and comments. One "arg == None" was changed to "arg is None".

Removing trailing whitespace in lex.py.

9ddc011

Converting ccg2xml to Python 3

b5fd0b1

- Using 2to3 and some manual labor - Especially comments can be overlooked - Updating lex.py and yacc.py - Tested the changes with arabic.ccg - The editor buttons are broken with this version

Using self.next instead of self.__next__ for Tree navigation.

b03b7b9

Using ttk to make tkinter work on MacOS.

d102857

It still works fine inside a debian-docker container using XQuartz on the host system. Windows will be tested soon.

Fixes indentation errors from the 2to3 conversion.

bc3fce2

The content of Features and Testbed is presented properly again. Before, only the last items were shown, as the indentation level was one too few.

Call python3 explicitly in ccg2xml

3330877

Adding executables for xml2ccg.

1c4e209

Adding xml2ccg to build.xml.

4a6509a

Adding dummy files for xml2ccg and its test to start tracking them.

1241cbc

Adding readily available test cases.

04a04fe

xml2ccg: Basic program and test structure

9b3902c

Adding testbed conversion.

72773a5

Adding first part of feature xml2ccg conversion.

ebd1388

Adding relation-sorting to feature.

2d9e732

Adding "known failures"-handling (!) to testbed.

c832e04

Adding rules section for ccg files.

29b20a0

Adding default value of 1 for the numOfParses in the testbed, as sugg…

3793955

…ested by tiny.ccg.

Fixing undefined warning_count

e536b54

Prettyfied XML test output, fixing test case stdin-seek(0) issues.

d09a02e

Ignoring order in type parents during tests.

39de715

Handling multiple inheritance parents in type hierarchy properly.

a6a6775

Adding feature structure ids/syntactic macros to features (the values…

502d24c

… in <>).

Adding ValueError if family/part-of-speech is not found, instead of r…

d1dd9b9

…aising UnboundLocalError for pos in the next line.

Adding words declarations.

e69474e

Renaming feature_sec to feature_section to be aligned with other vari…

78cdbf0

…ables.

Added documentation to Word methods, simplified for loop to map in cc…

19216a1

…g_words.

Adding family/category support.

50f679f

shoeffner added 16 commits November 26, 2018 13:23

Handling additional word attributes "pred", "excluded", and "coart".

4c82b71

Improving family parsing, works for arabic grammar and punctuation no…

44b743e

…w. Adding CategoryParser for complexcat and atomcats.

Adding proper rule parsing for typeraise and typechange rules using t…

17a9d17

…he CategoryParser.

Removing executable permission from README, arabic.ccg, and ccg.ply.

f23fa08

Sorting of xml tree elements now also takes attributes and their valu…

3ae471f

…es into account.

Using maybe_quote to avoid issues with . in identifiers.

f9507fd

I tried to keep it to a minimum, but it is possible that more locations are missing. In future iterations, this can be improved to be more convservative and consistent with the places where maybe_quote is used.

Updating the slash handling to match the behaviour expected by ccg2xm…

22660c8

…l more closely.

Creating deep instead of shallow copies for morph entries.

c5f77f8

Generated XMLs may be longer than the input.

01093a5

This reflects the fact that in hand-crafted xml files, rules / macros might have been forgotten.

Allowing + and % as normal letters which do not need quotes.

fe2d1d1

Adding diaspace grammar as a 'handcrafted' test case.

b370880

Adding more documentation to the xml2ccg script.

c5511d4

Adding bin/xml2ccg.py to the .gitignore

9d81d7e

Recursively comparing subtrees.

2decc3b

shoeffner force-pushed the feature/xml2ccg branch from 66aa180 to 2decc3b Compare December 5, 2018 18:36

Adding members to families. Fixing nested complexcats to be parenthes…

3cb25f6

…ized. Fixing typerrais parsing. Fixing several [*DEFAULT*] values and handling nomvars more properly.

visccg handles family names containing quotes.

c8a587d

Also disallows saveSection on a None text element. If a section was edited but "Done" is clicked on another section, a NoneType has no get exception was thrown. However, a proper fix would be to allow "Done" only on edited sections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml2ccg #20

xml2ccg #20

shoeffner commented Nov 29, 2018

shoeffner commented Dec 5, 2018

shoeffner commented Dec 7, 2018 •

edited

shoeffner commented Dec 17, 2018

mwhite14850 commented Dec 17, 2018

shoeffner commented Dec 17, 2018

xml2ccg #20

Are you sure you want to change the base?

xml2ccg #20

Conversation

shoeffner commented Nov 29, 2018

xml2ccg

Changelog

Features

Fixes & smaller changes

Caveats

shoeffner commented Dec 5, 2018

shoeffner commented Dec 7, 2018 • edited

shoeffner commented Dec 17, 2018

mwhite14850 commented Dec 17, 2018

shoeffner commented Dec 17, 2018

shoeffner commented Dec 7, 2018 •

edited