Skip to content

AidaLog/Plain-Swahili-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Swahili Sentences Dataset

GitHub repo size GitHub last commit GitHub issues GitHub contributors

Table of Contents

Overview

This repository contains a dataset of Swahili sentences, consisting of 84249 sentences.It can be a valuable resource for various natural language processing (NLP) tasks. This dataset was sourced from public repository.

Note: If you have information about the source or licensing details, please reach out or submit a pull request.

Dataset Details

  • Number of Sentences: 84249
  • Language: Swahili
  • File Format: CSV
  • Data Structure: One sentence per row

Potential Use Cases

This dataset can be used for a variety of applications, including:

  • Language detection
  • NLP research
  • Language model training
  • Content filtering and moderation
  • Cross-lingual research
  • Educational purposes

Usage

"""
Load and preview dataset.
"""

import pandas as pd

df = pd.read_csv('swahili_sentences.csv')
df.head()
sentence
0 mkutano wa biashara je ungependa kupata mualiko maalum kuhudhuria kwenye ...
1 kadiri ya hesabu yake hao mwaka walikuwa milioni lakini watu wa nje ...
2 jina linatokana na neno la kilatini scio yaani najua kwa maana pana ...
3 historia ya scientology inaendana kabisa na maisha ya mwanzilishi ...
4 kisha kupata umaarufu wa muda mfupi akaelekea upande wa roho aliyoiona ...

Citation

If you use this dataset in your research or applications, please consider citing this repository. A proper citation will be added here once available.

Contributing

If you have more data to add into the dataset, ideas on how to improve it, or any questions, please feel free to open an issue or submit a pull request. Any contributions you make are greatly appreciated.

Contributors