Skip to content

A filter is presented that reduces big ‘common substring problems' to a more manageable size. The approach uses ‘shingling’ and ‘fingerprinting’: In a first step the fingerprints of the reference string are scattered over a large hash map. Gathered in a second step, these hash values are matched against the values obtained from the test string.

License

Notifications You must be signed in to change notification settings

ookraw/commom-substrings-filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Common Substrings Filter

Designed and coded by Felix Baessler, felix.baessler@gmail.com

Summary

This project is closely related to the file based version of the well-known common substring problem.
Given on auxiliary storage:

  • a test data set containing a very big string S, and
  • a reference data set of a much smaller string s,

find all cross-repetitions defined by those strings of lengths >= LP,
which are substrings of both the reference and the test string.
A variant is LCS, the longest common substring problem.

Reference and test string are composed of IID distributed byte sequences. Their lengths underlie certain restrictions:

  • ns [bytes]: the length of the reference string s,
    is limited by the size of the available system memory (usually several giga bytes)
  • NS [bytes]: the length of the test string S, is orders of magnitudes bigger;
    its limitation is implied by the filtration ratio.

The objective is to reduce big  ‘common substring problems'  to a more manageable size.

Based on the concepts:

  • Shingle: the smallest information unit consists of a substring of length L that overlaps on L-1 bytes with the neighbors, like roof shingles.
  • Fingerprint: the hash(b,m) value of a shingle, base b and modulus m are the parameters of the Rabin-Karp Rolling Hash (aka signature algorithm).
  • Diversification: to alleviate the locality problem of fingerprinting, the hashes are diversified as presented in On_Finding_Common_Substrings_between_two_Large_Files.

the general idea is to eliminate those test shingles that have no matching fingerprint among the reference shingles.

Implementation

The project consists of three C++ programs.

A) master
generates on disk a long enough IID distributed byte sequence of minimum ns + NS bytes. The master disk file comprises the

  • reference data set (ns bytes) concatenated with the
  • test data set (NS bytes)

Seamless shingling of the master file, suggests that the reference data overlaps with the first L-1 bytes of the test data, so that

  • the last reference shingle starts at position : ns - 1
  • the first test shingle starts at position : ns

Because of this overlap, the reference string appears lengthened by L-1 bytes in the present implementation.

B) scatter
reads the reference data (n= ns shingles), creates the fingerprint map and writes the result to the map file.
The map can be viewed as a minimalistic hash table reduced to m one-bit slots. Scatter will mark those slots that correspond to the hash value modulo m (fingerprint) of the reference shingles.

C) gather
loads the map file into RAM, reads the big test data set (N= NS-L+1 shingles) and filters the test shingles by means of the map.
It turns out that the most time consuming operation consists in reading the map, when the fingerprints of a large amount of test shingles are checked via map against the fingerprints of the reference shingles.
Run on an ordinary laptop, the throughput is of the order of 20 MB/s.
An output example is given in the Appendix of the long write-up:   On_Finding_Common_Substrings_between_two_Large_Files.

Batchwise Processing
both scatter and gather distribute their workload on three threads:

  • thread 1: reads a batch of shingles into memory (RAM)
  • thread 2: from the shingle batch the thread produces a hash batch
  • thread 3: the hash batch is mapped:
      - scatter (write to map)   : slots are marked  free -> occupied
      - gather (read from map): slots are checked free / occupied

In the present implementation logical processor 3 is reserved for thread 3, which guaranties that mapping takes place within the same thread.

Using three containers (a,b,c), each containing a shingle and the corresponding hash batch, scatter and gather advances stage by stage as illustrated by the example:

workload processed in 5 batches on 7 stages
thread 1 2 3 4 5 6 7 -> stage (time)
   1       a b c a b
   2          a b c a b
   3             a b c a b
For example at stage 4 the threads (1,2,3) are simultaneously busy with the batches in containers (a,c,b).

Description

For a more detailed write-up see:   On_Finding_Common_Substrings_between_two_Large_Files.

LICENSE

This project is released under CC-BY-NC 4.0.
The licensing TLDR is: You are free to use, copy, distribute and transmit this Software for personal, non-commercial purposes, as long as you give attribution and share any modifications under the same license. Commercial or for-profit use requires a license.
For more details see the LICENSE

About

A filter is presented that reduces big ‘common substring problems' to a more manageable size. The approach uses ‘shingling’ and ‘fingerprinting’: In a first step the fingerprints of the reference string are scattered over a large hash map. Gathered in a second step, these hash values are matched against the values obtained from the test string.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages