GitHub - ookraw/commom-substrings-filter: A filter is presented that reduces big ‘common substring problems' to a more manageable size. The approach uses ‘shingling’ and ‘fingerprinting’: In a first step the fingerprints of the reference string are scattered over a large hash map. Gathered in a second step, these hash values are matched against the values obtained from the test string.

Common Substrings Filter

Designed and coded by Felix Baessler, felix.baessler@gmail.com

Summary

This project is closely related to the file based version of the well-known common substring problem.
Given on auxiliary storage:

a test data set containing a very big string S, and
a reference data set of a much smaller string s,

find all cross-repetitions defined by those strings of lengths >= LP,
which are substrings of both the reference and the test string.
A variant is LCS, the longest common substring problem.

Reference and test string are composed of IID distributed byte sequences. Their lengths underlie certain restrictions:

ns [bytes]: the length of the reference string s,
is limited by the size of the available system memory (usually several giga bytes)
NS [bytes]: the length of the test string S, is orders of magnitudes bigger;
its limitation is implied by the filtration ratio.

The objective is to reduce big ‘common substring problems' to a more manageable size.

Based on the concepts:

Shingle: the smallest information unit consists of a substring of length L that overlaps on L-1 bytes with the neighbors, like roof shingles.
Fingerprint: the hash(b,m) value of a shingle, base b and modulus m are the parameters of the Rabin-Karp Rolling Hash (aka signature algorithm).
Diversification: to alleviate the locality problem of fingerprinting, the hashes are diversified as presented in On_Finding_Common_Substrings_between_two_Large_Files.

the general idea is to eliminate those test shingles that have no matching fingerprint among the reference shingles.

Implementation

The project consists of three C++ programs.

A) master
generates on disk a long enough IID distributed byte sequence of minimum ns + NS bytes. The master disk file comprises the

reference data set (ns bytes) concatenated with the
test data set (NS bytes)

Seamless shingling of the master file, suggests that the reference data overlaps with the first L-1 bytes of the test data, so that

the last reference shingle starts at position : ns - 1
the first test shingle starts at position : ns

Because of this overlap, the reference string appears lengthened by L-1 bytes in the present implementation.

B) scatter
reads the reference data (n= ns shingles), creates the fingerprint map and writes the result to the map file.
The map can be viewed as a minimalistic hash table reduced to m one-bit slots. Scatter will mark those slots that correspond to the hash value modulo m (fingerprint) of the reference shingles.

C) gather
loads the map file into RAM, reads the big test data set (N= NS-L+1 shingles) and filters the test shingles by means of the map.
It turns out that the most time consuming operation consists in reading the map, when the fingerprints of a large amount of test shingles are checked via map against the fingerprints of the reference shingles.
Run on an ordinary laptop, the throughput is of the order of 20 MB/s.
An output example is given in the Appendix of the long write-up: On_Finding_Common_Substrings_between_two_Large_Files.

Batchwise Processing
both scatter and gather distribute their workload on three threads:

thread 1: reads a batch of shingles into memory (RAM)
thread 2: from the shingle batch the thread produces a hash batch
thread 3: the hash batch is mapped:
- scatter (write to map) : slots are marked free -> occupied
- gather (read from map): slots are checked free / occupied

In the present implementation logical processor 3 is reserved for thread 3, which guaranties that mapping takes place within the same thread.

Using three containers (a,b,c), each containing a shingle and the corresponding hash batch, scatter and gather advances stage by stage as illustrated by the example:

workload processed in 5 batches on 7 stages
thread 1 2 3 4 5 6 7 -> stage (time)
   1       a b c a b
   2          a b c a b
   3             a b c a b
For example at stage 4 the threads (1,2,3) are simultaneously busy with the batches in containers (a,c,b).

Description

For a more detailed write-up see: On_Finding_Common_Substrings_between_two_Large_Files.

LICENSE

This project is released under CC-BY-NC 4.0.
The licensing TLDR is: You are free to use, copy, distribute and transmit this Software for personal, non-commercial purposes, as long as you give attribution and share any modifications under the same license. Commercial or for-profit use requires a license.
For more details see the LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
gather_v1.cpp		gather_v1.cpp
master.cpp		master.cpp
readme.md		readme.md
scatter_v1.cpp		scatter_v1.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

gather_v1.cpp

gather_v1.cpp

master.cpp

master.cpp

readme.md

readme.md

scatter_v1.cpp

scatter_v1.cpp

Repository files navigation

Common Substrings Filter

Summary

Implementation

Description

LICENSE

About

Releases

Packages

Languages

License

ookraw/commom-substrings-filter

Folders and files

Latest commit

History

Repository files navigation

Common Substrings Filter

Summary

Implementation

Description

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Languages