Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

Open
stargazer33 opened this issue Apr 6, 2023 · 1 comment

Comments

@stargazer33
Copy link

stargazer33 commented Apr 6, 2023

I have a collection of following... let say "data structures":

{
  "id": "abcd123455",
  "title": "Some title",
  "body": "Contents of the blog post..."
},
{
  "id": "xyz986724",
  "title": "Another great title",
  "body": "another contents..."
}

These "data structures" are in my database, so I can export them in any format (HTML, text, JSON, YML...)
There are about 200-500 "data structures" per search index. They all have an unique ID (and this ID is not URL). The "body" is about one or two screens big. On the backend I have complete control, so can generate what is necessary, run the stork build command etc...

At the moment the search functionality on my site is implemented with the help of lunrjs (see lunrjs.com/guides/core_concepts.html). I am thinking about migration to Stork.
But... reading Stork documentation I get the impression that Stork is designed to index... let say 5-10 big (HTML?) pages.

So, the question is: how to use Stork to handle a list (collection? array?) of 200-500 documents?
I mean - how to use Stork in the "lunrjs scenario"?
(The first idea that comes into my mind is to generate the *.toml config file with one [[input.files]] entry for each document/"data structure" - and to put the documents into a separate file each (200-500 files!). Probably an overkill, I do not think Stork was designed for this)

@karlwilcox
Copy link

I use stork to index the text content of about 11,000 web pages and use your first idea of doing some "pre-processing" to create a .toml config that contains everything I want to be indexed. I have a PHP script that scans the relevant HTML files and produces something like this:

[input]
    frontmatter_handling = "Omit"
    stemming = "None"
    minimum_indexed_substring_length = 4
files = [
{ url = "/gallery/000001", title = "Petre (from Boutell's Heraldry)", contents = "this shield was used by boutell as the primary [snip....]  ", filetype="PlainText" },
{ url = "/gallery/000002", title = "Boyd Garrison", contents = "shield device of [snip..] ", filetype="PlainText" },
{ url = "/gallery/000003", title = "Example of Varying Edge Types", contents = "this example demonstrates [snip..].  ", filetype="PlainText" },
[snip 11,000 additional entries]
]

This has the advantage that I can also "pre-scan" the input to take out any terms that I don't want to have included in the index,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants