Data Science Capstone Data Processing Scripts (July to August 2015)

This repository contains the codes and scripts that are used to process the raw text files obtained from Coursera.

The objective of these scripts is to demostrate data engineering rather than building a smoothing ngram model. These script managed to process the entire corpus to generate the ngrams and their counts, which is hardly possible with R. These data can be then used for developing the predictive text algorithm.

Requirements

Hardware

At least 16 GB of RAM
Most of the scripts are single-threaded, therefore dual-core is sufficient

Software

Linux (tested on Ubuntu 14.04, should work on most Linux)
Python (version 2.7)
nltk (version 3.0)
numpy

Scripts

Run the following scripts in order:

1-get_src_data.sh

Download the raw corpus file

2-extract_data.sh

Unpackage the raw corpus file and remove the non-English data files.

3-data_cleaning.sh

Clean up the corpus by replacing unicode characters to ascii

4-generate_grams.sh

Using the python scripts (generate_grams.py and generate_unigrams.py), the ngrams (up to n = 3) and their respective counts for each corpus (therefore, unaggregated) are generated in csv file

5-post_process.sh

Post-process removes invalid ngrams and aggregate the count for common ngrams.

6-compress.sh

Compress the files with aggregated counts with bzip2 (optional)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Quiz/1		Quiz/1
Reports		Reports
lib		lib
slides		slides
.gitignore		.gitignore
1-get_src_data.sh		1-get_src_data.sh
2-extract_data.sh		2-extract_data.sh
3-data_cleaning.sh		3-data_cleaning.sh
4-generate_grams.sh		4-generate_grams.sh
5-post_process.sh		5-post_process.sh
6-compress.sh		6-compress.sh
README.md		README.md
generate_grams.py		generate_grams.py
generate_unigrams.py		generate_unigrams.py
profanity.txt		profanity.txt

kylase/data-science-capstone

Folders and files

Latest commit

History

Repository files navigation

Data Science Capstone Data Processing Scripts (July to August 2015)

Requirements

Hardware

Software

Scripts

About

Topics

Resources

Stars

Watchers

Forks

Languages