Words Bag Parallelized

Team

Mariana Luna Rocha, Data Science Bachelor student at ITAM.
Mauricio Vázquez Moran, Data Science and Actuarial Science Double Bachelor Program student at ITAM.

Problem Definition

Given a list of filenames containing texts, a filename containing the vocabulary and its size, and the number of processes to be used (equal to the number of input files), the task is to implement a bag-of-words algorithm. The algorithm counts the occurrences of words in the texts and produces an output file with a Bag of Words matrix in CSV format.

Input:
- List of filenames where the texts to be analyzed are located (files are in the same location as the executable).
- Filename containing the vocabulary and its size.
- Number of processes to be used (equal to the number of input files).
Output:
- file containing the Bag of Words matrix in CSV format.

Algorithm Overview

Read the vocabulary file to create a dictionary where the keys are words, and the values are indices.
Initialize a Bag of Words matrix with rows for each text file and columns for each word in the vocabulary.
Assign a process to each text file to count the occurrences of words in parallel.

For each text file:
- Read the text file.
- Tokenize the text into words.
- Count the occurrences of each word in the vocabulary.
- Update the corresponding row in the Bag of Words matrix.

Write the Bag of Words matrix to a CSV file.

Implementation

Serial Version:
- The serial version of the code processes the input files one by one in a single thread.
- It reads the vocabulary from a CSV file, then reads each book's content, counts the occurrences of each word from the vocabulary, and finally writes the results to a CSV file.
Parallel Version (MPI):
- The parallel version of the code utilizes the Message Passing Interface (MPI) to distribute the workload across multiple processes.
- Each process reads a portion of the data, and they collaborate to process the vocabulary and count the occurrences of each word.
- By dividing the workload among multiple processes, the parallel version can significantly reduce the processing time, especially for large datasets.

Instructions

To run the program, navigate to the "Words_Bag_Parallelized" folder level and execute the following command in your console.
NOTE: Before running the program, you must compile one of the two: either the serial or the parallelized version.

Serial Execution Console Code

./BagOfWords_serial.exe ../DATA/files_names.txt ../DATA/vocab.txt 15164 results_serial.csv

Parallelized Execution Console Code

./BagOfWords_parallelized.exe ../DATA/files_names.txt ../DATA/vocab.txt 15164 results_parallel.csv

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
DATA		DATA
SCRIPTS		SCRIPTS
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATA

DATA

SCRIPTS

SCRIPTS

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Words Bag Parallelized

Team

Problem Definition

Algorithm Overview

Implementation

Instructions

About

Releases

Packages

Contributors 2

Languages

License

MauricioVazquezM/Words_Bag_Parallelized

Folders and files

Latest commit

History

Repository files navigation

Words Bag Parallelized

Team

Problem Definition

Algorithm Overview

Implementation

Instructions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages