GitHub - RobGrimm/CogSci2017-MultiWordUnits: Evidence for a facilitatory effect of multi-word units on child word learning. Proceedings of the 39th annual conference of the cognitive science society, 2017.

Code for obtaining results described in the following paper:

Grimm R., Cassani G., Gillis S. and Daelemans W. (2017). Evidence for a facilitatory effect of multi-word units on child word learning. Proceedings of the 39th annual conference of the cognitive science society.

OS and Dependencies

This project is written in Python (version 3.4.3) and R (version 3.3.3), both on Ubuntu 14.04. The biggest part of the code is written in Python, and a small part for statistical analysis is written in R. The Python component requires the following packages (the version we used is given in parentheses):

numpy (1.12.1)
nltk (3.2.2)
scipy (0.19.0)

Since our pipeline uses the WordNet lemmatizer included wit the NLTK, you need to download a copy of WordNet via nltk.download()

Get the Corpus Data

Prepare the CHILDES corpora

We use several corpora from the CHILDES data base.

Get the North American corpora here.
Then unzip them to: Frontiers_MultiWordUnits/CHILDES/corpora/NA/

Then, get the British English corpora here.
Unzip them to: Frontiers_MultiWordUnits/CHILDES/corpora/BE/

Download the following North American corpora:

Bates, Bernstein, Bliss, Bloom70, Bloom73, Bohannon, Braunwald, Brent, Brown, Carterette, Clark, Cornell, Demetras1, Demetras2, ErvinTripp, Evans, Feldman, Garvey, Gathercole, Gleason, HSLLD, Hall, Higginson, Kuczaj, MacWhinney, McCune, McMillan, Morisset, Nelson, NewEngland, Peters, Post, Providence, Rollins, Sachs, Snow, Soderstrom, Sprott, Suppes, Tardif, Valian, VanHouten, VanKleeck, Warren, Weist

And the following British English corpora:

Belfast, Fletcher, Manchester, Thomas, Tommerdahl, Wells, Forrester, Lara

Run the experiments

The project's root directory contains Python and R scripts, numbered 1 through 6, which you need to run one after the other in order to carry out the experiments.

1-pre_process_cds_corpora.py
Pre-process the CHILDES corpora.

2-induce_aofp.py
Collect age of first production (AoFP) values for words used by the children in the CHILDES corpora.

3-run_chunk_based_learner.py
Run the Chunk-Based Learner on the CHILDES corpora and save extracted multi-word units to hard drive.

4-run_chunk_based_learner.py
Run the Prediction Based Segmenter on the on the CHILDES corpora and save extracted multi-word units to hard drive. This step requires around 20 GB of RAM.
Code for the Prediction Based Segmenter was written by Julian Brooke and is taken from his homepage at the University of Toronto: http://www.cs.toronto.edu/~jbrooke/ngram_decomp_seg.py
We converted the program to to Python3 via 2-to-3 and made minor changes to integrate it into our pipeline.

5-results_to_csv.py
Compute statistics and write results to CSV file for statistical analysis in R. If run for the first time, this script creates a dictionary which maps words to their nearest phonological neighbors, which will take a few hours to complete.

6-statistical_analysis.R
Perform statistical analysis in R.

To get the results for various tables, check the scripts in: ./Tables

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
AoFP		AoFP
CHILDES		CHILDES
CMUdict		CMUdict
ChunkBasedLearner		ChunkBasedLearner
Covariates		Covariates
PredictionBasedSegmenter		PredictionBasedSegmenter
Tables		Tables
pre_processed_corpora		pre_processed_corpora
1-pre_process_cds_corpora.py		1-pre_process_cds_corpora.py
2-induce_aofp.py		2-induce_aofp.py
3-run_chunk_based_learner.py		3-run_chunk_based_learner.py
4-run_predicton_based_segmenter.py		4-run_predicton_based_segmenter.py
5-results_to_csv.py		5-results_to_csv.py
6-statistical_analysis.R		6-statistical_analysis.R
License.md		License.md
README.md		README.md
r_helper_function.R		r_helper_function.R

License

RobGrimm/CogSci2017-MultiWordUnits

Folders and files

Latest commit

History

Repository files navigation

OS and Dependencies

Get the Corpus Data

Prepare the CHILDES corpora

Run the experiments

About

Resources

License

Stars

Watchers

Forks

Languages