GitHub - boubnanm/Probabilistic-Context-Free-Grammar-Parser: A Probabilistic Context Free Grammar Parser for French based on CYK algorithm

Probabilistic Parser for French

This system provides a Probabilistic Parser for French based on the CYK algorithm, and the PCFG model and that is robust to unknown words thanks an OOV (out-of-vocabulary) module.

System arguments

Input corpus parameters :

--corpus_path: Path to the corpus treebank to take as input
- By default : "sequoia-corpus+fct.mrg_strict" (SEQUOIA treebank)
--train_size: Size of the training corpus to extract
- By default : 0.9 (90%)
--valid_size: Size of the validation corpus to extract
- By default : 0 (0%)
--test_size: Size of the testing corpus to extract
- By default : 0.1 (10%)
--shuffle_split: Shuffle the corpus before splitting the three corpus.
- By default : False

OOV parameters :

--embed_path: Path to word embeddings pickle file
- By default : 'polyglot-fr.pkl' (Polyglot French embeddings)
--lev_candidates: Number of formal similar words candidates for the OOV module proposed.
- By default : 2
--emb_candidates: Number of embedding similar words candidates for the OOV module proposed.
- By default : 20
--use_damereau: Use Damereau-Levenstein distance for the formal similarities
- By default : False
--linear_interpolation: Linear interpolation coefficient for the bigram language model
- By default : 0.2

CYK parameters :

--written_sentence: Parse a sentence provided through the argument --sent_to_parse. If not, parse a file of sentences provided through the argument --file_to_parse
- By default : False
--sent_to_parse: Written sentence to parse
- By default : "Le maire de Paris ."
--file_to_parse: Path to the file containing sentences to parse (one sentence per line, exactly one whitespace between each token)
- By default : "test_sentences.txt" (Last 10% of the SEQUOIA treebank)
--output_path: Output path for the sentences parses outputs.
- By default : "parse_results.txt"

Multiprocessing parameters:

--multiprocess: Use multiprocessing (This may not work on windows)
- By default : False
--n_cpus: Number of CPUs to use. If -1, use all the CPUs available.
- By default : -1

How to use

Here are some examples to use the system proposed :

To train the parser on the first 90% of the SEQUOIA treebank and parse the last 10% without multiprocessing and output results to "parse_results.txt":

bash run.sh

To train the parser on the first 90% of the SEQUOIA treebank and parse the last 10% with multiprocessing and output results to "parse_results.txt":

bash run.sh --multiprocess

To train the parser on all the SEQUOIA treebank and parse the sentences in the file "sentences_to_parse.txt" and write the result to the file "parsed_sentences.txt":

bash run.sh --train_size 1 --test_size 0 --file_to_parse "sentences_to_parse.txt" --output_path "parsed_sentences.txt"

To train the parser on all the SEQUOIA treebank and parse a written sentence "Le maire de Paris .":

bash run.sh --train_size 1 --test_size 0 --written_sentence --sent_to_parse "Le maire de Paris ."

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
evaluation_data.parser_output		evaluation_data.parser_output
main.py		main.py
polyglot-fr.pkl		polyglot-fr.pkl
run.sh		run.sh
sequoia-corpus+fct.mrg_strict		sequoia-corpus+fct.mrg_strict
test_sentences.txt		test_sentences.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

Report.pdf

Report.pdf

evaluation_data.parser_output

evaluation_data.parser_output

main.py

main.py

polyglot-fr.pkl

polyglot-fr.pkl

run.sh

run.sh

sequoia-corpus+fct.mrg_strict

sequoia-corpus+fct.mrg_strict

test_sentences.txt

test_sentences.txt

Repository files navigation

Probabilistic Parser for French

System arguments

How to use

About

Releases

Packages

Languages

boubnanm/Probabilistic-Context-Free-Grammar-Parser

Folders and files

Latest commit

History

Repository files navigation

Probabilistic Parser for French

System arguments

How to use

About

Topics

Resources

Stars

Watchers

Forks

Languages