Skip to content

Text analysis: find a pattern in the text by using MapReduce algorithm and thread tools in C#

Notifications You must be signed in to change notification settings

OMER62/TextAnalysis_By_Using_MapReduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text analysis: By Using MapReduce & Threads

The next C# code find a string pattern(Including all possible chars) in a text file, by using MapReduce algorithm and Thread tools

The program will get the next Inputs:

Text: text file to analysis.
Keyword: the string-pattern to search in the text file
ThreadsNumber: How many threads will used for the search
Delta: The distance between letters in the search pattern

The pogram will return the next Output:
The program will output the locations of te keword in the txt file, in the next order:
[row number,location in the row] with referring to the delta parameter.

NetaLavi.txt:
[Row 0]:Neta Lavi is the Best
[Row 1]:Football player in the world
[Row 2]:Neta Lavi6 is a legend


Example 1:

Input:

Text: NetaLavi.txt •Keyword: Neta •ThreadsNumber:2 •Delta:0

Output:

[0,0] [2,0]

Example 2:

Input:

Text: NetaLavi.txt •Keyword: Ni •ThreadsNumber:4 •Delta:7

Output:

[0,3], [1,0]

Example 3:

Input:

Text: NetaLavi.txt •Keyword: al •ThreadsNumber:3 •Delta:2

Output:

[0,3], [1,5], [2,3], [2,14]


My program is based on The MapReduce algorithm and the Regular Expression tools.

Regular Expression: A regular expression is a sequence of characters that specifies a search pattern in the text. Usually, such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

The MapReduce algorithm: MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Mapper class takes the input and maps it. The output of the Mapper class is used as input by the Reducer class, which in turn searches for matching pairs and reduces them.

Why use these two methods?

  1. MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, the MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers. In my program, I perform a search process using Threads. The Threads is our multiple systems.
  2. The main use of regular expressions is to match patterns of the text so that the program can easily recognize and manipulate the text file. In our Text, we are required to fast search processing, and used regular expression Allows it.

My Algorithm includes three main steps:

Step A: Data pre-processing

Step B: The Map phase

Step C: The Reduce phase

Step A: Data pre-processing

in this Process, we will calculate the next four elements:

1.(Regex) rx_For_Pre_Processing: Calculate the Regex to Identify the Candidate's Strings

2.(Regex) rx_For_Match: Calculate the Regex Who Identify the matches String

3.(int) CandidateStringLen: Calculate the Length of the string

4.(string[]) Text Lines: Read the file.txt to an array. Each cell represents a row in the text file.

Step B: The map Phase

The map task is done by means of Mapper Class.

The Map phase processes take the text input file and provides the string who candidate for the match by the next representation :

(<key, Val> : <Row_In_the_text_file, Location_In_the_row>).

The map Process include a use in Regular expression

Step C: The Reduce phase

The reduce task is done by means of Reducer Class:

The Reduce phase (searching technique) will accept the input from the Map phase as a key-value pair with Row_In_the_text_file and Location_In_the_row.

Using searching technique, the combiner will check all the Key value pair and he will print to the screen the String how match for the Matching Regular Expression.

About

Text analysis: find a pattern in the text by using MapReduce algorithm and thread tools in C#

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages