Skip to content

This project involves development of a ETL data pipeline that allows streaming millions of Amharic and Swahili speech audio files and speakers providing transcription texts for data collection in a web platforms.

License

Stella-Mutacho/STT-Data-Collection

 
 

Repository files navigation

ETL pipelines for Amharic Speech to Text

Data pipeline

Test Image 4

Table of content*

STT-Data-Collection

Overview

This week, 10 Academy is your client. Recognizing the value of large data sets for speech-t0-text data sets, and seeing the opportunity that there are many text corpuses for both languages, and understanding that complex data engineering skills is valuable to your profile for employers, this week’s task is simple: design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file. By the end of this project, you should produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Data Source

Data source For this Project came from Amharic news text classification dataset with baseline performance [dataset] (https://github.com/IsraelAbebe/An-Amharic-News-Text-classification-Dataset).

Requirements

Install Kafka and run
Install Airflow 
Install Spark

Installation Guide

To install and run this project

        git clone https://github.com/STT-data-collection/STT-Data-Collection.git

        cd STT-Data-Collection

        pip install -r requirements.txt

Project Structure

dags

This folder holds python script files for airflow dags

data

This folder holds the data of the project (data is store on google drive using DVC)

API

This folder holds backend flask api

frontend

This folder holds front end of the project using Reactjs

kafka

This folder holds python script files that define producer , consumer , topic and manage kafka cluster

logs

This folder holds log data of the projects

models

This folder holds prediction models

notebooks

This folder holds demonstrations of the project

screenshots

This folder holds images of parts of the projects

scripts

This folder holds prediction model scripts

tests

This folder holds test files

About

This project involves development of a ETL data pipeline that allows streaming millions of Amharic and Swahili speech audio files and speakers providing transcription texts for data collection in a web platforms.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 77.0%
  • Python 9.6%
  • PowerShell 4.7%
  • JavaScript 4.3%
  • CSS 3.5%
  • HTML 0.4%
  • Other 0.5%