Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Lunr pre-built indexes #1396

Open
mgroeber9110 opened this issue Nov 18, 2023 · 3 comments
Open

Add support for Lunr pre-built indexes #1396

mgroeber9110 opened this issue Nov 18, 2023 · 3 comments
Labels
enhancement status: ready to implement Issues that can be actively worked on, and need an implementation!

Comments

@mgroeber9110
Copy link

mgroeber9110 commented Nov 18, 2023

I have used JTD to convert a relatively large collection of pre-existing Markdown docs into a site deployed onto gh-pages (after having previously used the jekyll-build-pages action). While the conversion overall works fairly well (after installing a few additional plugins that jekyll-build-pages includes by default, such as supporting Markdown without Front Matter).

The resulting documentation can be found here: https://bluewaysw.github.io/pcgeos

The Markdown documentation is about 11 MB in size, leading to a search-data.json size for lunr full-text-search of about 6 MB. This causes noticeable freezes in the page after loading, while lunr generates its index.

The lunr docs describe a solution for pre-building indexes, which appears to be relatively straightforward to integrate into just-the-docs:

  • Add support for an optional file search-index.json that is loaded together with search-data.json. If loading t´he index file fails, the onload event generates the index locally, otherwise it just loads the index JSON.

  • For generating the index, a file build-index.json is put into the assets that can be called via node.js if desired:

    node build-index.js < _site\assets\js\search-data.json > _site\assets\js\search-index.json

This could for example happen in a github action after generating the site. The build-index.js could look something like this:

var lunr = require('./_site/assets/js/vendor/lunr.min.js'),
    stdin = process.stdin,
    stdout = process.stdout,
    buffer = []

stdin.resume()
stdin.setEncoding('utf8')

stdin.on('data', function (data) {
  buffer.push(data)
})

stdin.on('end', function () {
  var docs = JSON.parse(buffer.join(''))

  var idx = lunr(function(){
	this.ref('id');
	this.field('title', { boost: 200 });
	this.field('content', { boost: 2 });
	this.field('relUrl');
	this.metadataWhitelist = ['position']

	for (var i in docs) {	  
	  this.add({
		id: i,
		title: docs[i].title,
		content: docs[i].content,
		relUrl: docs[i].relUrl
	  });
	}
  });

  stdout.write(JSON.stringify(idx))
})

For our site, the index would be about 11 MB in size. I have not yet fully integrated this approach, but if this looks feasible, I could try making a PR for it.

@mattxwang
Copy link
Member

Thanks for submitting this issue @mgroeber9110, and apologies for the delay - has been hard for me to find OSS time this quarter! I appreciate you writing up this issue. I haven't personally used a site with such a large search index, but I'm sure that you aren't the only use-case (e.g. I'm aware of other client sites that have that size - many of which are not open-source).

Something to be aware of is that many of our users use the github-pages gem, which means that (unfortunately) a feature that relies on GitHub Actions without a fallback is nontenable. Here are two ways I think we could incorporate your solution:

  1. Like you mentioned, generate the pre-built index after the site build. This is easier to do and easily automated with CI/CD, but has a runtime downside: client sites without pre-built indices would make a fetch that is guaranteed to fail every time. This can be mitigated by making this an opt-in feature in the _config.yml. Depending on how tightly integrated this is as a feature, it may also necessitate requiring a node installation as part of the build step, which I would prefer not to do.
  2. I'm also interested in potentially generating this index as part of the build process itself, similar in spirit to the generation of search-data.json. This would require more looking at how this index is serialized. However, if we can write this as a Ruby plugin (or if this code already exists - which it might!), it would be easier to adopt as no build-time node dependency is necessary.

I will think on this some more. However, if you are interested in contributing a solution, I'd be happy to work with you on a PR! Let me know what your thoughts are.


An interesting related feature is #1068. It may provide prior art for how to name this feature, and we need to make sure that both play together nicely!

@mattxwang mattxwang added the status: ready to implement Issues that can be actively worked on, and need an implementation! label Dec 18, 2023
@mgroeber9110
Copy link
Author

Not yet sure if/when I will get around to trying an implementation yet, but here are a few notes on what I have found out so far:

  • The closest I have found so far in terms of generating a lunr index natively in Ruby is middleman-lunr.
  • This is based on running the V8 JS interpreter from Ruby through a Gem called therubyracer and then executing the original JS code from lunr. This feels a bit heavy, but it might be an alternative to reimplementing the indexing code natively in Ruby and having to keep it compatible with potential updates to lunr.
  • It seems to therubyracer has now been deprecated in favor of mini_racer, but this would probably only have an impact on the details of the bindings. However, I am not an experienced Ruby developer (at least not for the last 15 years or so), so I am not certain if adding these dependencies would be acceptable for JTD at all.

My naive understanding was that Github Actions are required in any case to apply this theme (just from looking at the template), but this probably neglects its use in other CI environments. Of course, the easiest fallback might be to just not pre-build the index if not all dependencies are available and to only treat this as an optional speedup...

@mattxwang
Copy link
Member

Thanks for the quick response @mgroeber9110! I had a chance to do a bit more digging, and mkdocs default search plugin actually behaves quite similarly to your proposed issue: it uses Node to pre-build the index and assumes that it's a part of the user's build system. It's opt-in, and they expect users opting-in to it to know that they need to properly configure node.

While this seems a bit brittle to me, this does provide prior art for your original issue. So, with that in mind, I think that could be a reasonable path forward - no need to implement a native lunr index in Ruby (which does sound like it's a hassle).

In other words,

Of course, the easiest fallback might be to just not pre-build the index if not all dependencies are available and to only treat this as an optional speedup...

Sounds like a great idea 😊

@pdmosses any thoughts on this matter? You've had great insights on things I've missed before!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement status: ready to implement Issues that can be actively worked on, and need an implementation!
Projects
None yet
Development

No branches or pull requests

2 participants