Creates a TokenMetadataStore to return startPosition of tokens in results #79

cambridgemike · 2014-04-01T05:43:47Z

I took a stab at prototyping a solution for #25, and took your comments in #58 into consideration.

I think you're right that the long term goal should be to refactor to replace existing tokens with a lunr.Token. Unfortunately this seems a bit overwhelming, as it presents a lot of complexities with previously serialized dataStores and complicates the TokenStore. As a step in the right direction, I propose a lunr.TokenMetadataStore, which maps doc refs and indexed tokens to a lunr.Token. In this case, by "indexed token" I mean present day token (string) and a lunr.Token, which is an object I introduce to encapsulate metadata about tokens (like StartPosition). This MetadataStore lives on the sidelines, and is only added to when a document is indexed, and data is only retrieved when a document is surfaced in search results.

So now you'll get back a result set that looks like

// idx.add({ id: 2, 
//   body: "Some are born great, some achieve greatness, and some have greatness thrust upon them." 
// })
// idx.search("greatness")

[{
    "ref": 2,
    "score": 0.12345
    "tokens": [
      {indexedAs: "great", raw: "great", startPos: 15, field: "body"},
      {indexedAs: "great", raw: "greatness", startPos: 35, field: "body"},
      {indexedAs: "great", raw: "greatness", startPos: 60, field: "body"}
    ]
}]

Overall, I created three new object types:

lunr.Token which is a container for metadata like startPosition
lunr.TokenList is essentially an array of lunr.Tokens, but has some helper method to extract the indexed tokens and the raw tokens.
lunr.TokenMetadataStore is a dataStore as described above.

The only changes I had to make to the existing codebase are outlined as follows:

pipeline.js#run

This is probably the most substantial change I made. Running the pipeline now returns a lunr.TokenList instead of an array of strings. This works with old tokenizers that return strings or new tokenizers that return lunr.Tokens. The pipeline passes string tokens to the stack, so backwards compatibility is preserved for 3rd party pipeline functions and 3rd party tokenizers.

index.js#add

Updated this method to store the lunr.Tokens returned by the pipeline in the lunr.tokenMetadataStore. I added a configuration variable on the lunr.Index.prototype called useTokenMetadata which controls this behavior.

index.js#search

Once a list of documents is found, return a list of lunr.Tokens that are associated with the docment and were in the query string.

tokenizer.js

This was a pretty big change, and I did a quick and dirty job. I tried to keep the runtime relatively sane, but didn't worry too much about code organization. This should definitely be cleaned up if/when a merge happens.

Notes:

Drawbacks: This will use up a lot of memory, since we create an individual token for every single string that gets indexed.
Nomenclature: I started using the term "index token" to describe a present day "token", i.e a string that has been run through the pipeline (and will eventually end up in the index). A "raw token" is a String with the value of how the token appeared in the original document (with the exception of being lowercased, since we currently lowercase int he tokenizer).
Tests: I updated a few things in the tests, but otherwise they all passed. I wrote a few tests just to smokescreen my additions. If you think this is headed in the right direction then I can sure it up with more tests.

…on with search results

olivernn · 2014-04-01T20:12:35Z

Hey, thanks for putting your time into this, I'll try and go through your code in more detail at some point but just wanted to say thanks for taking a stab at this!

I've actually been working on some changes that should make #25 and #58 possible. Actually they will be by-products of what I'm trying to achieve which is to take token position into account when scoring matching documents, e.g. a document that contains search tokens closer together should score higher.

A slightly related feature is also to have better wildcard support, currently a wildcard is automatically appended to every search term, this mostly works but has caused some issues (#74, #62, #33). What I want is for you to be able to enter a wildcard where you want/need it, at the beginning, end or in the middle of a search term.

Both of these require a lunr.Token object, and as you have found this is not such a trivial change 😉 I've actually got an implementation already, its still mostly focused on the wildcard stuff but I have a feeling it will be an enabler for all sorts of niceness and features like this.

I'm not overly concerned with backwards compatibility, yet. lunr isn't quite 1.0 so I feel I can still experiment a little with the public interfaces. Serialised indexes can be re-built and lunr currently warns you if you are using an index serialised with a different version.

My current work is still very 'work in progress' but I'll try and tidy it up a little and push the branch here so you, and others, can take a look with where I'm going. I'd really appreciate your input on what I've got and how to make sure it is compatible with what you're trying to achieve.

Thanks again for you help, I'll be sure to keep you updated.

cambridgemike · 2014-04-01T20:18:48Z

Thanks for the thoughtful response, I'd love to see what you've been working on. Having token position be involved in the search process is also important for exact searches (i.e, "dog food" where the words "dog" and "food" appear next to each other in a document).

Cheers,
Mike

hugovincent · 2014-05-09T13:17:22Z

Any update on this? I'd love to be able to use this feature.

olivernn · 2014-05-12T19:23:09Z

Thanks for you interest @hugovincent. I've been working on a couple of changes to the way lunr works to support this feature, as well as better wildcard searching and a pipeline for scoring documents. You can follow along with whats happening on the next branch.

Everything is still very alpha and might change at any point, but it should at least give you an idea as to where this feature is going.

aschuck · 2015-10-02T07:40:35Z

@olivernn Did anything become of the next branch, or of @cambridgemike's patch?

olivernn · 2015-10-05T19:46:19Z

@aschuck sadly no, its all still available on github, but I haven't had a chance to take these any further.

clns · 2016-03-02T08:01:49Z

I guess there are no updates with this, sadly. 😞

cambridgemike added 4 commits March 31, 2014 17:42

Initial functionality for a Token Metadata Store. Returns startPositi…

2bbaddc

…on with search results

Update tests

cc79d5e

Update Makefile

8623d7a

A few style changes

d2708a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Creates a TokenMetadataStore to return startPosition of tokens in results #79

cambridgemike commented Apr 1, 2014

olivernn commented Apr 1, 2014

cambridgemike commented Apr 1, 2014

hugovincent commented May 9, 2014

olivernn commented May 12, 2014

aschuck commented Oct 2, 2015

olivernn commented Oct 5, 2015

clns commented Mar 2, 2016

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Are you sure you want to change the base?

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Conversation

cambridgemike commented Apr 1, 2014

olivernn commented Apr 1, 2014

cambridgemike commented Apr 1, 2014

hugovincent commented May 9, 2014

olivernn commented May 12, 2014

aschuck commented Oct 2, 2015

olivernn commented Oct 5, 2015

clns commented Mar 2, 2016