STARK: a tool for dependency-tree extraction and analysis

STARK is a highly-customizable tool that extracts different types of syntactic trees from parsed corpora (treebanks) and quantifies them with respect to frequency and other useful statistics, such as the strength of association between the nodes of a tree, or its significance in comparison to another treebank. It is primarily aimed at processing treebanks based on the Universal Dependencies annotation scheme, but it also takes any other single-rooted dependency treebank in the CONLL-U format as input.

Installation and execution

Install Python 3 on your system (https://www.python.org/downloads/).

Linux users

Install pip and other libraries required by the program, by running the following commands in the terminal:

sudo apt install python3-pip
cd <PATH TO PROJECT DIRECTORY>
pip3 install -r requirements.txt

Execute extraction by first moving to the project directory and executing the script with:

python3 stark.py

Windows users

Download pip installation file (https://bootstrap.pypa.io/get-pip.py) and install it by double clicking on it.

Install other libraries necessary for running by going into program directory and double clicking on install.bat. If windows defender is preventing execution of this file you might have to unblock that file by right-clicking on .bat file -> Properties -> General -> Security -> Select Unblock -> Select Apply.

Execute extraction by running run.bat (in case it is blocked repeat the same procedure as for install.bat).

Changing the settings

By default, running the program as described above extracts trees from the sample en_ewt-ud-dev.conllu file (taken from the English EWT UD treebank) as defined by the parameter settings in the config.ini file. To modify the settings you can modify the config.ini file directly or create your own configuration file, which is then passed as an argument when running the program in the terminal (example below) or specified in the run.bat file.

python3 stark.py --config_file my-settings.ini

Alternatively, you can change a specific setting by introducing it as a command line argument directly, which overrides the default setting specified in the config.ini configuration file. In the example below, the tool extracts verb-headed trees consisting of exactly three words from a treebank named my-treebank.conllu, while all other options remain the same as in the default configuration file.

python3 stark.py --input my-treebank.conllu --size 3 --head upos=VERB

List of settings

The types of trees to be extracted and the associated output information can be defined through the parameters listed below and described in more detail here.

General settings:

input: location of the input file or directory (parsed corpus in .conllu)
output: location of the output file (list of trees in .tsv)

Tree specification:

size: number of nodes in the tree (integer or range)
node_type: node characteristic under investigation (form, lemma, upos, xpos, feats or deprel)
complete: extraction of full trees only (i.e. heads with all their dependents) rather than all possible subtrees (values yes or no)
labeled: extraction of labeled or unlabeled trees (values yes or no)
fixed: differentiating trees by surface word order (values yes or no)

Tree restrictions:

head: predefined characteristics of the head node (e.g. upos=NOUN)
ignore_labels: predefined list of dependency labels that should be ignored when counting the trees (e.g. punct)
query: predefined tree structure based on the DepSearch query language (e.g. VERB >obl NOUN).

Statistics:

association_measures: calculates the strength of association between nodes by MI, MI3, t-test, logDice, Dice and simple-LL scores (values yes or no)
compare: calculates the keyness of a tree in comparison to another treebank by LL, BIC, log ratio, odds ratio and %DIFF scores (reference treebank in .conllu)

Additional visualization:

example: prints a random sentence containing the tree
grew_match: describes the trees structure using the grew query language and provides links to examples in Grew-match

For a detailed explanation of these and other settings, see the settings documentation here.

Output

STARK produces a tab-separated (.tsv) file with a list of all the trees matching the input criteria sorted by descending frequency, as illustrated by the first few lines of the default sample output below that shows the 5-most frequent trees occurring in the sample en_ewt-ud-dev.conllu treebank.

The description of the tree is given in the first column, while subsequent columns include additional information on individual nodes, the absolute and relative frequencies, the surface node order, the number of the nodes in the tree and the head. For adding other types of information to the output, such as other useful statistics and links to visualised examples, see the list of settings above or the detailed settings documentation here.

Tree	Node A	Node B	Node C	A-Freq	R-Freq	Order	N	Head
DET <det NOUN	DET	NOUN		320	12724.2	AB	2	NOUN
ADP <case DET <det NOUN	ADP	DET	NOUN	190	7555.0	ABC	3	NOUN
ADP <case PROPN	ADP	PROPN		172	6839.2	AB	2	PROPN
ADP <case NOUN	ADP	NOUN		165	6560.9	AB	2	NOUN
ADJ <amod NOUN	ADJ	NOUN		126	5010.1	AB	2	NOUN

Description of tree structure

The description of the trees given in the first column of the output is based on the DepSearch query language, which is simple to learn and easy to read:

Dependencies are expressed using < and > operators, which mimick the "arrows" in the dependency graph.
- A < B means that token A is governed by token B, e.g. rainy < morning
- A > B means that token A governs token B, e.g. read > newspapers
Dependency labels are specified right after the dependency operator
- A <amod B means that token A is the adjectival modifier of token B, e.g. rainy <amod morning
- A >obj B means that token B is the direct object of token A, e.g. read >obj newspapers
Priority is marked using parentheses:
- A > B > C means that A governs both B and C in parallel, e.g. read > newspapers > people for 'people read newspapers'
- A > (B > C) means that A governs B which, in turn, governs C, e.g. read > (newspapers > interesting) for 'read interesting newspapers'

Acknowledgment

This tool was developed by Luka Krsnik in collaboration with Kaja Dobrovoljc and Marko Robnik Šikonja. Financial and infrastructural support was provided by Slovenian Research and Innovation Agency, CLARIN.SI and CJVT UL as part of the research projects SPOT: A Treebank-Driven Approach to the Study of Spoken Slovenian (Z6-4617) and Language Resources and Technologies for Slovene (P6-0411), as well as through the 2019 CLARIN.SI Resource and Service Development grant.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
logos		logos
sample		sample
scripts		scripts
stark		stark
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
advanced.md		advanced.md
config.ini		config.ini
install.bat		install.bat
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
settings.md		settings.md
setup.py		setup.py
stark.py		stark.py

License

clarinsi/STARK

Folders and files

Latest commit

History

Repository files navigation

STARK: a tool for dependency-tree extraction and analysis

Installation and execution

Linux users

Windows users

Changing the settings

List of settings

Output

Description of tree structure

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Languages