Skip to content

Boyer-moore in pure python, search for unicode strings in large files quickly

License

Notifications You must be signed in to change notification settings

eriknyquist/boyermoore

Repository files navigation

tests_badge cov_badge codeclimate_badge license_badge version_badge

This is an implementation of the Boyer-Moore substring search algorithm in pure python.

It is a shameless copy-paste of the python reference code provided here, with modifications to support the following additional features:

  • Searching in files without reading the whole file into memory, allowing handling of large files
  • Full unicode support

See the API documentation for more details.

Install from pip.

pip install boyermoore
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt")                 # Find all occurrences of "pattern!" in file "file.txt"
>>> offsets                                                       # Display found occurrences
[12, 456, 10422]                                                  # Pattern occurs at byte offsets 12, 456, and 104222
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt", greedy=False)   # Find the first occurrence of "pattern!" in file "file.txt"
>>> offsets                                                       # Display found occurrences
[12]                                                              # First occurrence of pattern is at byte offset 12

The following section illustrates the average speed of the boyermoore.search_file function when searching for a unicode string in files of sizes ranging from 1MB to 2GB.

The test is implemented in the file scripts/speed_test.py if you want to inspect the code yourself.

The test was executed using Python 3.9.13 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz and 32 GB of RAM.

The test searches for all occurrences of a fixed unicode string in a series of test files. The unicode string is:

Hello नमस्ते Привет こんにちは

("Hello" in English, followed by the Hindi translation, followed by the Russian translation, followed by the Japanese translation)

Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0) and one at the very end (byte offset of [file_length - pattern_length]).

The following table shows the times taken to search for all occurences of the unicode string "Hello नमस्ते Привет こんにちは" inside test files of various sizes, and compares it to a linear search of the same data.

File size Boyer-moore time (seconds) Linear time (seconds)
1 MB 0.01 0.08
2 MB 0.02 0.17
32 MB 0.24 2.67
64 MB 0.47 5.24
128 MB 0.93 10.62
256 MB 1.88 21.44
512 MB 4.16 43.76
1 GB 7.44 85.46
2 GB 16.03 175.93

Contributions are welcome, please open a pull request at https://github.com/eriknyquist/boyermoore and ensure that:

  1. All existing unit tests pass (run tests via python setup.py test)
  2. New unit tests are added to cover any modified/new functionality (run python code_coverage.py to ensure that coverage is above 98%)

You will need to install packages required for development, these are listed in dev_requirements.txt:

pip install -r dev_requirements.txt

If you have any questions about / need help with contributions or unit tests, please contact Erik at eknyquist@gmail.com.