NLP-Final-Projects

Collaborators:

Project Scope and Goals

Using data from the CAES institute, we want to make a native language inference/identification system that classifies a persons native language based on how they write a second language. We are using data from a Spanish Language test, with participants that spoke Chinese, Portuguese, Russian, French, English, and Arabic. We then use data from the TOEFL which contains a larger corpus of essays (written in english) with a wider diversity of native speakers. These languages include German, Turkish, French, Arabic, Korean, Chinese, Hindi, Spanish, Italian, Japanese, and Telugu. Information about each of these is outlined in the following sections.

Metadata and Associated Information

General Essay Information (CAES)

Essay Category	Associated Number
Respondents	3878
Essay Samples	3878
Total POS Tags	682172
Average POS tags per essay	175.9

Responses and Tags by Language (CAES)

Language	Total Essays	Average POS Tags per Essay
Arabic	1342	148.3
Chinese	373	169.2
English	615	204.7
French	371	189.5
Russian	176	140.8
Portuguese	1001	198.8

Challenges

Unfortunately, the data that we have is not very rich. We are working with essays that have been transcribed into part of speech tags meaning we are not utilizing the raw essays. This makes working with the text easier because it is standardized, but limits the amount of data that we can gather from the text as the phrases have been completely normalized.

Applications

One application of NLI is its use in forensic linguistics. With the rise of Russian troll farms, being able to accurately determine which texts were written by native language speakers versus which texts were written by native Russian speakers would turn the tide of misinformation and propaganda that has flooded the internet. In addition, a number of intelligence agencies have started to fund NLI projects in the hopes that it will give them more information about potential threats and who are responsible for them. NLI also has applications as it pertains to pedagogical (teaching) materials. By identifying L1-specific features, we can improve language transfer and author profiling.

Formal Report

A PDF file containing all relevant background and findings from this project (along with associated literature reviews) can be found here.

Other work

This work is based on research from a few shared tasks. Here are links to further reading:

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Data		Data
DataManipulationPyFiles		DataManipulationPyFiles
Report Draft		Report Draft
TFLearn		TFLearn
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

DataManipulationPyFiles

DataManipulationPyFiles

Report Draft

Report Draft

TFLearn

TFLearn

LICENSE

LICENSE

README.md

README.md

Repository files navigation

NLP-Final-Projects

Project Scope and Goals

Metadata and Associated Information

Challenges

Applications

Formal Report

Other work

About

Releases

Packages

Contributors 3

Languages

License

ClintonTak/NLP-Final-Project

Folders and files

Latest commit

History

Repository files navigation

NLP-Final-Projects

Project Scope and Goals

Metadata and Associated Information

Challenges

Applications

Formal Report

Other work

About

Resources

License

Stars

Watchers

Forks

Languages