Exporting index to xml

Note:

This page describes Thinlet Luke's functionality

Current version of Luke (implemented by Swing) does not support index export.

There is an issue for this: https://github.com/DmitryKey/luke/issues/141

There are different goals of why you would want to export your Lucene / Solr index or part of the index to an xml file for further processing.

One such goal is extracting the indexed tokens.

In this post we will illustrate one particular luke's feature, that allows you to dump index into an xml for external processing. The post has been adapted from here.

Task

Extract indexed tokens from a field to a file for further analysis outside luke.

Indexing data

In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.

If you are indexing using Apache Solr, you would configure the following on your field:

<field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">

With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.

Extracting index terms

One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").

Luke's Term Vector

If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:

Luke's Export Index

In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:

<field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<tv>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />
</tv>
</field>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting index to xml

Task

Indexing data

Extracting index terms

Clone this wiki locally