MASSIVEDATASET README

Introduction

This project focuses on processing and managing language data. It answers two main questions:

Python3 Development Environment Setup: Set up a Python3 development environment, install relevant dependencies, and build a project structure similar to PyCharm's. Import a massive dataset and generate language-specific Excel files (en-xx.xlsx) using specific fields (id, utt, and annot_utt). Recursive algorithms are not used to optimize performance.
Working with Files: Generate separate JSONL files for English (en), Swahili (sw), and German (de) datasets with test, train, and dev partitions. Create a single large JSON file showcasing translations from English to other languages (xx) for the training dataset.

Prerequisites

Before running the project, ensure you have the following prerequisites installed:

Python 3.x
pip (Python package manager)

Installation

You can install the required Python libraries/packages using the following command:

pip install jsonlines

pip install json

pip install os

pip install pandas

pip install sys

Project Structure

The project structure should resemble the following:

project-root/
│
├── 1.1/data/
│       └── excel
│
├── main.py
├── q2.py
│
│
├── en-xx.xlsx (Generated)
├── en_train.jsonl (Generated)
├── sw_train.jsonl (Generated)
├── de_train.jsonl (Generated)
│   └── translations.json (Generated)
│
├── README.md
├── generator.sh
└── other_files...

Running the Project

Question 1

Place your dataset file (input_data.xlsx) inside the data/ directory.

Run the following command to execute Question 1:

./generate.sh

The script will generate language-specific Excel files (en-xx.xlsx) in the results/ directory.

Question 2

Make sure you have the English (en), Swahili (sw), and German (de) JSONL files (e.g., en-US.jsonl, de-DE.jsonl, sw-KE.jsonl) in the project directory.

Run the following command to execute Question 2(shell command):

python q2.py

The script will generate separate JSONL files for English, Swahili, and German in the results/ directory. Additionally, it will create a large JSON file (translations.json) showcasing translations from English to other languages.

Run the following command to upload files to google drive:

python googledrive.py

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
extract.py		extract.py
fileslines.py		fileslines.py
generate.sh		generate.sh
googledrive.py		googledrive.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

.gitignore

.gitignore

README.md

README.md

extract.py

extract.py

fileslines.py

fileslines.py

generate.sh

generate.sh

googledrive.py

googledrive.py

main.py

main.py

Repository files navigation

MASSIVEDATASET README

Introduction

Prerequisites

Installation

Project Structure

Running the Project

About

Releases

Packages

Contributors 5

Languages

KolwaBrad/massivedataset

Folders and files

Latest commit

History

Repository files navigation

MASSIVEDATASET README

Introduction

Prerequisites

Installation

Project Structure

Running the Project

About

Topics

Resources

Stars

Watchers

Forks

Languages