how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

stargazer33 · 2023-04-06T16:28:06Z

I have a collection of following... let say "data structures":

{
  "id": "abcd123455",
  "title": "Some title",
  "body": "Contents of the blog post..."
},
{
  "id": "xyz986724",
  "title": "Another great title",
  "body": "another contents..."
}

These "data structures" are in my database, so I can export them in any format (HTML, text, JSON, YML...)
There are about 200-500 "data structures" per search index. They all have an unique ID (and this ID is not URL). The "body" is about one or two screens big. On the backend I have complete control, so can generate what is necessary, run the stork build command etc...

At the moment the search functionality on my site is implemented with the help of lunrjs (see lunrjs.com/guides/core_concepts.html). I am thinking about migration to Stork.
But... reading Stork documentation I get the impression that Stork is designed to index... let say 5-10 big (HTML?) pages.

So, the question is: how to use Stork to handle a list (collection? array?) of 200-500 documents?
I mean - how to use Stork in the "lunrjs scenario"?
(The first idea that comes into my mind is to generate the *.toml config file with one [[input.files]] entry for each document/"data structure" - and to put the documents into a separate file each (200-500 files!). Probably an overkill, I do not think Stork was designed for this)

The text was updated successfully, but these errors were encountered:

karlwilcox · 2023-04-11T18:23:13Z

I use stork to index the text content of about 11,000 web pages and use your first idea of doing some "pre-processing" to create a .toml config that contains everything I want to be indexed. I have a PHP script that scans the relevant HTML files and produces something like this:

[input]
    frontmatter_handling = "Omit"
    stemming = "None"
    minimum_indexed_substring_length = 4
files = [
{ url = "/gallery/000001", title = "Petre (from Boutell's Heraldry)", contents = "this shield was used by boutell as the primary [snip....]  ", filetype="PlainText" },
{ url = "/gallery/000002", title = "Boyd Garrison", contents = "shield device of [snip..] ", filetype="PlainText" },
{ url = "/gallery/000003", title = "Example of Varying Edge Types", contents = "this example demonstrates [snip..].  ", filetype="PlainText" },
[snip 11,000 additional entries]
]

This has the advantage that I can also "pre-scan" the input to take out any terms that I don't want to have included in the index,

tlienart mentioned this issue Jun 29, 2023

JSON output tlienart/Xranklin.jl#214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

stargazer33 commented Apr 6, 2023 •

edited

karlwilcox commented Apr 11, 2023

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

Comments

stargazer33 commented Apr 6, 2023 • edited

karlwilcox commented Apr 11, 2023

stargazer33 commented Apr 6, 2023 •

edited