We write a grammar and a parser to parse the POS tag sequence.
Input data: sentences with POS tags The input is a tsv (tab-separated values) file like the sample:
id | label | sentence | pos |
---|---|---|---|
73 | 0 | Many thanks in advance for your cooperation . | JJ NNS IN NN IN PRP$ NN . |
The id column is the unique id for each sentence. The label column indicates whether a sentence contains grammar errors (1 means having errors and 0 means error-free). The pos column contains the POS tags for each token in the sentence, also separated by a single space.
The POS tags follow the Penn Treebank (PTB) tagging scheme, described here
- We wrote a toy CFG for English in NLTK’s .cfg format.
- We used the chart parser from NLTK to parse each of the POS sequences in the dataset with the toy grammar we wrote in task 1. We stored results in a TSV file with three columns:
Column name | Description |
---|---|
id | The id of the input sentence. |
prediction | 1 if the sentence has grammar errors, 0 if not. In other words, whether the POS sequence can be parsed successfully with your grammar and parser. |
- We evaluate the performance of our grammar checker by calculating its precision and recall on the data available to us. To do that, we compared the prediction of our system on a given sentence and its corresponding label in the dataset.
Further details and results can be found here
Leen Alzebdeh @Leen-Alzebdeh
Sukhnoor Khehra @Sukhnoor-K
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall.
GitHub Copilot
We run this project using standard Python libraries csv, sys, nltk.
-
Ensure Python is installed, as well as the Python Standard Library.
-
Ensure the library nltk is installed, it can be installed using the following command:
pip install --user -U nltk
- Ensure you have input data in the format outlined above and in a file 'data/train.tsv'
Example usage: use the following command in the current directory.
python3 src/main.py data/train.tsv grammars/toy.cfg output/train.tsv