Skip to content
Alan Woodward edited this page Sep 23, 2016 · 2 revisions

So you’ve added a bunch of queries to your luwak Monitor, and you’ve tried matching some documents against them, and you’re not getting the exact answers you want. How can you find out what’s going wrong?

Check the errors in your response

Rather than abort a match run due to one bad query, luwak saves exceptions that are thrown during matching and reports them as part of the final Matches object. So if you find a query isn’t matching a document, and you expect it to be, have a look at the return value of Matches.getErrors() and see if the query is throwing an Exception.

Check that your DocumentBatch is correctly constructed

Quite often, matching errors are in fact down to problems with tokenization during document analysis. To check that a query is actually matching once it has been selected by the presearcher, you can run your query directly against the searcher for a particular batch:

TopDocs td = batch.getSearcher().search(query, 10);

Check that the query is being selected by your Presearcher

Luwak speeds up matching by analysing queries as they are added to the Monitor, and then only selecting those queries that it views as likely to match a given document to actually run at match time. This is fertile ground for bugs. To ensure that your query is actually being selected by the presearcher, you can do one of two things:

  • check the getPresearcherHits() values on your Matches response
  • run your Monitor with a MatchAllPresearcher to ensure that every query is selected for matching

If the presearcher isn’t selecting your query, and it should be, then you have a bug.

Debugging the TermFilteredPresearcher

The standard presearcher shipped with luwak is the TermFilteredPresearcher, which works by analysing queries as they are added to the Monitor and extracting combinations of terms that a document must have in order to match the query. Internally, a query is mapped to a tree-like structure called a QueryTree by a QueryTreeBuilder, and then terms are extracted from this tree using a TreeWeightor. The MultipassTermFilteredPresearcher does this several times, extracting different combinations of terms each time. Bugs can occur here if queries are not analysed correctly.

You can get an explanation of how a particular query is being analysed using TermFilteredPresearcher.showQueryTree(Query, PrintStream). This will write a schematic representation of the analysis to a print stream, showing the terms taken from a query, the weights assigned to those terms, and the subset of terms ultimately selected for indexing.

As an example, take the query +field:horsten field:thurston +(+(field:periwinkle field:flibbertigibbet) +field:verbiage). Calling showQueryTree with this query yields the following:

Conjunction[2] 3.8506389 [EXACT field:periwinkle, EXACT field:flibbertigibbet]
	Conjunction[2] 3.8506389 [EXACT field:periwinkle, EXACT field:flibbertigibbet]
		Disjunction[2] 3.8506389 { [EXACT field:periwinkle] [EXACT field:flibbertigibbet] }
			Node [EXACT field:periwinkle] 3.8506389
			Node [EXACT field:flibbertigibbet] 3.966673
		Node [EXACT field:verbiage] 3.7278461
	Node [EXACT field:horsten] 3.6326308

The top level is a conjunction node with two entries (the added SHOULD clause field:thurston is discarded, because it only matches if other terms are present). Stepping down through the hierarchy, we can see at each level which terms are selected by that node, with their weights. A conjunction will select whichever of its child nodes has the highest weight, while a disjunction selects all of its child nodes, and assigns the lowest of all their weights to itself. If a query is not being broken up correctly, or its term is somehow being mangled, you should be able to see it here.