Lucene/Solr Synonym-Expanding EDisMax Parser

This is just a hard-cloned repo of https://github.com/healthonnet/hon-lucene-synonyms

Why?

We want to attach builds automatically to a release to be able to skip the compilation step when provisioning our application. Therefore, we need to set up Travis-CI.
Cloning the repo and adding it to Travis-CI failed because something is wrong with the original repository. See the error below:

git clone https://github.com/healthonnet/hon-lucene-synonyms && cd hon-lucene-synonyms
git add <SOMETHING>
git tag -a 'v2.0.0' -m 'Whatever'
cd ../
git clone --depth=50 --branch=v2.0.0 https://github.com/refinery-platform/hon-lucene-synonyms.git hon-lucene-synonyms-2
git checkout -qf <COMMIT-HASH-OF-LATEST-COMMIT>
fatal: reference is not a tree: 9043856f2a89f9ed75b195acbbaddc53fd145820

The command "git checkout -qf 9043856f2a89f9ed75b195acbbaddc53fd145820" failed and exited with 128 during .

Looking at .git/refs/tags reveiled that no reference for tag v2.0.0 was available even though the tag was visible on GitHub.

Lucene/Solr Synonym-Expanding EDisMax Parser

Current version : 2.0.0 (changelog)

Note: This project is not actively maintained anymore, but pull requests are welcome. :)

Maintainer

Nolan Lawson

Health On the Net Foundation

License

Apache 2.0.

Summary

Extension of the ExtendedDisMaxQueryParserPlugin that splits queries into a "normal" query and a "synonym" query. This enables proper query-time synonym expansion, with no reindexing required.

This also fixes lots of bugs with how Solr typically handles synonyms using the SynonymFilterFactory.

For more details, read my blog post on the subject.

Getting Started

The following tutorial will set up a working synonym-enabled Solr app using the example/ directory from Solr itself, running in Jetty.

Protip: The unit tests will do these steps automatically.

Step 1

Download the latest JAR file, if you have Solr 5.3.1:

hon-lucene-synonyms-2.0.0.jar

Or if you are using an older version of Solr, then you can use the last version of this plugin to support older Solr versions (1.3.5):

JAR	Solr
hon-lucene-synonyms-1.3.5-solr-3.x.jar	3.4.0, 3.5.0, and 3.6.x
hon-lucene-synonyms-1.3.5-solr-4.0.0.jar	4.0.0
hon-lucene-synonyms-1.3.5-solr-4.1.0.jar	4.1.0 and 4.2.x
hon-lucene-synonyms-1.3.5-solr-4.3.0.jar	4.3+

Step 2

Download Solr from the Solr home page. For this tutorial, we'll use Solr 3.6.2. You do not need the sources; the tgz or zip file will work fine.

Step 3

Extract the compressed file and cd to the example/ directory.

Step 4

Now, you need to bundle the hon-lucene-synonyms-*.jar file into webapps/solr.war. Below is a script that will work quite nicely on UNIX systems. Be sure to change the /path/to/my/hon-lucene-synonyms-*.jar part before running this script.

mkdir myjar
cd myjar
jar -xf ../webapps/solr.war
cp /path/to/my/hon-lucene-synonyms-*.jar WEB-INF/lib/
jar -cf ../webapps/solr.war *
cd ..

Note that this plugin will not work in any location other than the WEB-INF/lib/ directory of the solr.war itself, because of issues with the ClassLoader.

Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and it works (Jetty).

Step 5

Download example_synonym_file.txt and copy it to the solr/conf/ directory (solr/collection1/conf/ in Solr 4.x).

Step 6

Download example_config.xml and copy the contents into solr/conf/solrconfig.xml (solr/collection1/conf/solrconfig.xml in 4.x), just before the </config> tag at the end.

This defines the analyzer that will be used to generate synonyms.

Protip: You can customize this analyzer based on your synonym set. E.g. if your synonyms are all two words or less, you can safely set maxShingleSize to 2.

Solr 4.3+ Protip: For Solr 4.3 and up, we support loading Tokenizers and Token Filters by service name through the new SPI method. That means you may put synonym instead of solr.SynonymFilterFactory or shingle instead of solr.ShingleFilterFactory, if you'd like to make your configuration more succinct.

Step 7

Start up the app by running java -jar start.jar. Jetty may print a ClassNotFoundException, but it shouldn't matter.

Step 8

In your browser, navigate to

http://localhost:8983/solr/select/?q=dog&debugQuery=on&qf=text&defType=synonym_edismax&synonyms=true

You should see a response like this:

<response>
  ...
  <result name="response" numFound="0" start="0"/>
  <lst name="debug">
    <str name="rawquerystring">dog</str>
    <str name="querystring">dog</str>
    <str name="parsedquery">
        +(DisjunctionMaxQuery((text:dog)) (((DisjunctionMaxQuery((text:canis))
        DisjunctionMaxQuery((text:familiaris)))~2) DisjunctionMaxQuery((text:hound))
        ((DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best))
        DisjunctionMaxQuery((text:friend)))~3) DisjunctionMaxQuery((text:pooch))))
    </str>
    <str name="parsedquery_toString">
        +((text:dog) ((((text:canis) (text:familiaris))~2) (text:hound)
        (((text:man's) (text:best) (text:friend))~3) (text:pooch)))
    </str>
    <lst name="explain"/>
    <str name="QParser">SynonymExpandingExtendedDismaxQParser</str>
    ...
  </lst>
</response>

Note that the input query dog has been expanded into dog, hound, pooch, canis familiaris, and man's best friend.

Tweaking the results

Boost the non-synonym part to 1.2 and the synonym part to 1.1 by adding synonyms.originalBoost=1.2&synonyms.synonymBoost=1.1:

+((text:dog)^1.2 (((((text:canis) (text:familiaris))~2) (text:hound)
(((text:man's) (text:best) (text:friend))~3) (text:pooch))^1.1))

Apply a minimum "should" match of 75% by adding mm=75%25:

+((text:dog) ((((text:canis) (text:familiaris))~1) (text:hound)
(((text:man's) (text:best) (text:friend))~2) (text:pooch)))

Observe how phrase queries are properly handled by using q="dog" instead of q=dog:

+((text:dog) ((text:"canis familiaris") (text:hound) (text:"man's best friend") (text:pooch)))

Gotchas

Keep in mind that you must add defType=synonym_edismax and synonyms=true to enable the parser in the first place.

Also, you must either define qf in the query parameters or defaultSearchField in solr/conf/schema.xml, so that the parser knows which fields to use during synonym expansion.

If you enable debugging (with debugQuery=on), the plugin will output helpful information about how synonyms are being expanded.

Query parameters

The following are parameters that you can use to tweak the synonym expansion.

Param	Type	Default	Summary
synonyms	boolean	false	Enable or disable synonym expansion entirely. True if enabled.
synonyms.analyzer	String	null	Name of the analyzer defined in solrconfig.xml to use. (E.g. in the examples, it's myCoolAnalyzer). This must be non-null, if you define more than one analyzer (e.g. for more than one language).
synonyms.originalBoost	float	1.0	Boost value applied to the original (non-synonym) part of the query.
synonyms.synonymBoost	float	1.0	Boost value applied to the synonym part of the query.
synonyms.disablePhraseQueries	boolean	false	True if synonym expansion should be disabled when the user input contains a phrase query (i.e. a quoted query). This option is offered because expansion of phrase queries may be considered non-intuitive to users.
synonyms.constructPhrases	boolean	false	v1.2.2+: True if expanded synonyms should always be treated like phrases (i.e. wrapped in quotes). This option is offered in case your synonyms contain lots of phrases composed of common words (e.g. "man's best friend" for "dog"). Only affects the expanded synonyms; not the original query. See issue #5 for more discussion.
synonyms.ignoreQueryOperators	boolean	false	v1.3.2+: If you treat query operators (e.g. AND and OR) as usual words and want the synonyms be added to the query anyhow, set this option to true.
synonyms.bag	boolean	false	v1.3.2+: When false (default), this plugin generates additional synonym queries by using the original query string as a template: dog bite => dog bite, canis familiaris bite, dog chomp, canis familiaris chomp. When true a simpler, "bag of terms" query is created from the synonyms. IE dog bite => bite dog chomp canis familiaris. The simpler query will be more performant but loses positional information. Use with synonyms.constructPhrases to keep synonym phrases such as "canis familiaris".
synonyms.ignoreMM	boolean	false	v1.3.5+: When false (default), the mm param is applied to the original query and to the synonym queries. When true mm is ignored for the synonym queries and applied only to the original query.

Compile it yourself

Download the code and run:

mvn install

Testing

Python-based unit tests are in the test/ directory. To run them, follow these steps.

First, install dependencies:

sudo pip install nose
sudo pip install solrpy

Then run the tests:

./test.sh

Alternatively, you can run two separate processes - one to run Solr, and the other to run the Python tests. This is better for debuggging.

In one tab, run:

./run_solr_for_unit_tests.py

(This downloads, builds, and launches Solr on localhost:8983.)

Then in another tab, do:

nosetests test/

(This runs the Python tests against the live Solr.)

Changelog

2.0.0
- BREAKING CHANGE: Updated to support Solr 5.3.1. Removed support for older versions of Solr.
- Note that as of Lucene 5.2.0, when synonyms are parsed, original terms are now correctly marked as type word instead of type synonym LUCENE-6400.
v1.3.5
- Added synonyms.ignoreMM option
v1.3.4
- Fixed #41 thanks to @rpialum.
v1.3.3
- Fixed #33: synonyms are now weighted equally, regardless of how many there are per word.
- Fixed #31: synonyms are no longer given extra weight when using the params bq, bf, and boost.
- debugQuery=on now gives more helpful debug output.
- Fixed #9, #26, #32, and #34. Note that this is a documentation change; not a code change, so to get the benefits of this "fix," you'll need to manually perform Step 6 again.
v1.3.2
- Added synonyms.ignoreQueryOperators option (#28)
- Added synonyms.bag option (#30)
- The run_solr_for_unit_tests.py script now downloads the proper version of Solr.
v1.3.1
- Avoid luceneMatchVersion in config (#20)
v1.3.0
- Added support for Solr 4.3.0 (#19)
- New way of loading Tokenizers and TokenFilters
- New XML syntax for config in solrconfig.xml
v1.2.3
- Fixed #16
- Verified support for Solr 4.2.0 with the 4.1.0 branch (unit tests passed)
- Improved automation of unit tests
v1.2.2
- Added synonyms.constructPhrases option to fix #5
- Added proper handling for phrase slop settings
v1.2.1
- Added support for Solr 4.1.0 (#4)
v1.2
- Added support for Solr 4.0.0 (#3)
v1.1
- Added support for Solr 3.6.1 and 3.6.2 (#1)
- Added "Getting Started" instructions to clarify plugin usage (#2)
v1.0
- Initial release

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml
run_solr_for_unit_tests.py		run_solr_for_unit_tests.py
test.sh		test.sh

refinery-platform/solr-synonyms-analyzer

Folders and files

Latest commit

History

Repository files navigation

Lucene/Solr Synonym-Expanding EDisMax Parser

Maintainer

License

Summary

Getting Started

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Tweaking the results

Gotchas

Query parameters

Compile it yourself

Testing

Changelog

About

Resources

Stars

Watchers

Forks

Languages