lucene-custom-analyzer

(Micro)Library to build Lucene analyzers in a data-driven fashion.

Why Would You Want to Use `lucene-custom-analyzer`?

Current Clojure Lucene libraries (e.g. jaju/lucene-clj, federkasten/clucie) doesn't provide a mechanism to build your custom Lucene Analyzers.
Data-driven.
Allows for extensibility using standard Lucene SPI, i.e. just put a JAR in the CLASSPATH.
Allows to specify a directory from which resources will be loaded, e.g. synonyms dictionaries.
Lucene 9+ supported.
Already includes the most commonly used Lucene analysis components.

Quickstart

Dependencies:

lt.jocas/lucene-custom-analyzer {:mvn/version "1.0.34"}

Code:

(require '[lucene.custom.analyzer :as custom-analyzer])

(custom-analyzer/create
  {:tokenizer              {:standard {:maxTokenLength 4}}
   :char-filters           [{:patternReplace {:pattern     "foo"
                                              :replacement "foo"}}]
   :token-filters          [{:uppercase nil}
                            {:reverseString nil}]
   :offset-gap             2
   :position-increment-gap 3
   :config-dir             "."})
;; =>
;; #object[org.apache.lucene.analysis.custom.CustomAnalyzer
;;         0x4686f87d
;;         "CustomAnalyzer(org.apache.lucene.analysis.pattern.PatternReplaceCharFilterFactory@2f1300,org.apache.lucene.analysis.standard.StandardTokenizerFactory@7e71a244,org.apache.lucene.analysis.core.UpperCaseFilterFactory@54e9f0d6,org.apache.lucene.analysis.reverse.ReverseStringFilterFactory@3e494ba7)"]

Short notation for analysis components:

(custom-analyzer/create
  {:tokenizer :standard
   :char-filters [:htmlStrip]
   :token-filters [:uppercase]})
;; =>
;; #object[org.apache.lucene.analysis.custom.CustomAnalyzer
;;        0x16716eb1
;;        "CustomAnalyzer(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterFactory@4c7f61fa,org.apache.lucene.analysis.standard.StandardTokenizerFactory@6fc69052,org.apache.lucene.analysis.core.UpperCaseFilterFactory@3944ccba)"]

If no options are provided then an Analyzer with just the standard tokenizer is created:

(custom-analyzer/create {})
;; =>
;; #object[org.apache.lucene.analysis.custom.CustomAnalyzer
;;         0x456fe86
;;         "CustomAnalyzer(org.apache.lucene.analysis.standard.StandardTokenizerFactory@5703f5b3)"]

If you want to check which analysis components are available run:

(lucene.custom.analyzer/char-filter-factories)
(lucene.custom.analyzer/tokenizer-factories)
(lucene.custom.analyzer/token-filter-factories)

Design

Under the hood this library uses the factory classes TokenizerFactory, TokenFilterFactory, and CharFilterFactory. The actual factories are loaded with java.util.ServiceLoader. All the available classes are automatically discovered.

If you want to include additional factory classes, e.g. your implementation of the TokenFilterFactory, you need to add it to the classpath 2 things:

The implementation class of one of the Factory classes
Under the META-INF/services add/change a file named org.apache.lucene.analysis.TokenFilterFactory that lists the classes from the step 1.

An example can be found here.

Future work

Conditional token filters

License

Distributed under The Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
src/lucene/custom		src/lucene/custom
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
bb.edn		bb.edn
build.clj		build.clj
deps.edn		deps.edn
pom.xml.template		pom.xml.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

src/lucene/custom

src/lucene/custom

test

test

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

bb.edn

bb.edn

build.clj

build.clj

deps.edn

deps.edn

pom.xml.template

pom.xml.template

Repository files navigation

lucene-custom-analyzer

Why Would You Want to Use `lucene-custom-analyzer`?

Quickstart

Design

Future work

License

About

Releases

Sponsor this project

Packages

Languages

License

dainiusjocas/lucene-custom-analyzer

Folders and files

Latest commit

History

Repository files navigation

lucene-custom-analyzer

Why Would You Want to Use lucene-custom-analyzer?

Quickstart

Design

Future work

License

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages

Why Would You Want to Use `lucene-custom-analyzer`?