In the name of Allah
14 January 2013
This is the README for the "TreebankTransform" package that is a helper tools for tranforming input conll file to desire formats. This package has been developed by [Mojtaba Khallash] (mailto: mkhallash@gmail.com) from Iran University of Science and Technology (IUST).
The home page for the project is: http://nlp.iust.ac.ir
If you want to use this software for research, please refer to this web address in your papers.
The package can be used freely for non-commercial research and educational purposes. It comes with no warranty, but we welcome all comments, bug reports, and suggestions for improvements.
-
Compiling
-
Example of usage
-
Running the package
-
a. Global Transforms
-
References
-
Compiling
b. Transform that Affect Wordform and Lemma Column
c. Transform that Affect POS and CPOS Column
d. Transform that Affect FEATS Column
e. Transform that Affect DEPREL and Head Column -
Requirements:
- Version 1.7 or later of the [Java 2 SDK] (http://java.sun.com)
You must add java binary file to system path.
In linux, your can open~/.bashrc
file and append this line:PATH=$PATH:/<address-of-bin-folder-of-JRE>
To compile the code, first decompress the package:
in linux:
tar -xvzf TreebankTransform.tgz
cd TreebankTransform
sh compile_all.sh
in windows:
decompress the TreebankTransform.zip
compile.bat
You can open the all projects in NetBeans 7.1 (or maybe later) too.
- Example of Usage
For mark compund verb in this package use "[Valency Lexicon Ver. 2.2] (http://dadegan.ir/en)" [1].
- Running the package
This package run in gui mode. simple double click on jar file or run the following command:
java -jar TreebankTransform.jar
Two options exist:
- Change direction of word (Left to Right - Right to Left):
-
This transform used when want to increase diversity of baseline parsers for have
a good ensemble system.
- Mark Compound Verb:
-
In annotation of persian dependency treebank, elements of
compound verbs marked as separate words. This option by using "Valency
Lexicon Ver. 2.2" find compound verbs and mark as a word.
Three options exist:
-
Transform annotation of numbers:
-
This option can be used to replace each number in _Wordform_ and _Lemma_ column by a constant.
Normal Replace each number by num
labelBining num-bin0
the number 0 and numbers ending with 00
num-bin1
the number 1 and numbers ending with 01
num-bin2
the number 2 and numbers ending with 02
num-bin3
the numbers 3-10 and those ending with 03-10
num-bin4
the numbers, and numbers ending with, 11-99
num-bin5
all other number tokens -
Copy Lemma in Wordform:
- This option used for reduce data sparsity of lexical data.
-
Remove Space:
-
In annotation of persian dependency treebank, words can have space that not permit in
CoNLL format and some tools that use space delimeter cannot run on this treebank. this
option replace all space by underline `_` character.
Two options exist:
- Copy POS to CPOS
- Copy CPOS to POS
FEATS column contains key=value pairs of attributes that separate by vertical
bar |
and if no attribute exist, an underline _
insert.
Two options exist:
-
Remove attribute:
-
In this option you can add list of keys that want to remove.
if you want remove all attribute just enter
all
. -
Add attribute:
- In this option you can add list of keys that want to add.
If you want remove all attributes except one off them, you can insert all
to
remove list and your-attribute
to add list.
Three options exist:
-
Transform Ra notation:
-
This option is a bit different from the original treebank.
In generated version by this option, the direct object structure representation has been changed. In this representation, ra is not the head of the object word. Instead, "ra" is regarded as the case marker for the direct object (dependent of "ra" in the original representation). The conversion has been done automatically; therefore, there may be some potential errors. -
Remove:
-
This options used for remove content of DEPREL and HEAD column and
replace by underline
_
so that predict by parser. -
Set All Head Deprel to ROOT:
- This options unified DEPREL of each word that have
HEAD=0
toROOT
.
- References
[1] M. S. Rasooli, et al., "A Syntactic Valency Lexicon for Persian Verbs The First Steps towards Persian Dependency Treebank", 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland, pp. 227-231, 2011.