Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature wish: Add a search function (this is a race ;-)) #146

Open
dertuxmalwieder opened this issue Sep 29, 2015 · 30 comments
Open

Comments

@dertuxmalwieder
Copy link

Hi,

(sorry, this one will be longish, but I want to make my points clear…,)

having been in the process of transitioning from WordPress to some static (not just flat-file) blog for years now (I’m really lazy), I still haven’t settled to which system to use. Actually, I had found one which I thought would be perfect, then I noticed that the provided solution for searching articles was not working as intended, especially since there was no way to use it without JavaScript. The most important feature of a blog system is a good search functionality, followed by a decent comment solution (but that’s a different thing).

So I’m back on track, looking for the perfect static blog solution. I already have a list of such systems which failed to work well for me (mostly theming- or feature-related issues), so I loosened my requirements a bit. I don’t even care which programming language is used anymore as long as it just works (as in it provides a good search function) and it’s a cool one (as in it’s not JavaScript).

Now here’s what I want:

The perfect static blog solution should, while generating the pages, keep some full-text index of the posts and provide a search function which could be accessed through the front-end (like the article listing but filtered by contents). In case this is already possible, please tell me how - I actually searched the docs and sources but I haven’t found such a functionality.

As I want you to actually consider this - I know - frequent feature wish within the near future, I posted it to several interesting generators’ issue trackers, including yours. I’ll probably use the Static Site Generator which comes up with a sufficient search functionality first.

Thank you in advance.

@greghendershott
Copy link
Owner

Nice avatar; great album. :)

I think the short answer is that I don't have time to do this myself, but would welcome a pull request.

The longer answer is, I completely understand the need, it's just not clear to me how much of this falls within the mission of Frog per se. I'd start by presuming people will:

  1. Edit page-template.html to add the front-end JS for already-invented-good-full-text-indexer.
  2. Build with raco frog -b && already-invented-good-full-text-indexer.

Maybe there's some missing 2% that Frog could/should provide (like, I don't know, helping the indexer know what files to search, and/or where to store its output). If so, I'd be happy to add (or accept a PR for) that.

@dertuxmalwieder
Copy link
Author

Thank you. ;-)

Well, I don't really speak Racket yet, so I was hoping I could delegate this. But your idea seems reasonable too. Would Sphinx work?

@greghendershott
Copy link
Owner

I don't know much about full-text indexers. If you know (or could research) some good options, that would be a huge help, regardless of how well you know Racket.

I guess the criteria would be:

  • Decent JS front-end UI (in your opinion).
  • Frog supports all of Windows, OS/X, and Linux, so indexer should, too.

How is Sphinx in that regard?


Also: Google has some search-for-your-site thing, right? Why not use that? Even if that's inadequate, it would be nice to explain why (so people understand why it's worth the hassle of installing X to do this).
If the only reason is non-technical, e.g. "because Google is too big", that's valid, but then how about DuckDuckGo or some other provider?

@dertuxmalwieder
Copy link
Author

Sphinx - the best solution I, personally, know - runs "everywhere", at least on Frog's target platforms. The front-end is to be designed by whoever feels like it, Sphinx "only" provides the server-side stuff.

As a free (there are some commercial ones too) alternative, there's also Apache's Lucerne project with the Sphinx-like software solr.

re Google/DDG: I could use them as a workaround but they have serious drawbacks:

  1. Not reliable (uptime): If Google/DDG shuts down or decides to change their API or something, users can't search my site anymore. This is not what I want to allow.
  2. Not reliable (quality): I don't know how Google (just to stick with that example) weighs its results, but users coming from Google repeatedly find rather ... weird results on my existing WordPress blog.
  3. Privacy: I'm proud to have everything on my blog stored on my server. Essential components like the search should, where possible, not be server from third-party servers.

@greghendershott
Copy link
Owner

re Google/DDG I understand/empathize. Thanks for articulating why.

re Sphinx:

  1. Is there an open-source front-end for it?
  2. You said "server-side" but I'm hoping you meant just "back-end"? What I mean: For a static blog, we'd rather not run an index query server. Instead we want some indexer that produces a static file, that the front-end JS can read/use? (I should have spelled that out in the criteria.)

@dertuxmalwieder
Copy link
Author

re Sphinx (too): Oh, that makes things harder. Generating the index needs some server-side logic (of course) as the indexer would have to actively search through the existing files. I'm not sure if Sphinx supports a "static" (schemaless) data output. Solr does.

Else, there's still tipue search, entirely client-side and coded in jQuery. Maybe that's the last resort here?

@greghendershott
Copy link
Owner

Well, I'm assuming that most people choosing to use a static blog generator, who don't want to run an HTTP server, would also not want to run some index query server. At least, I'd be in that camp.

I was hoping that one would run the indexer after rebuilding the blog, and it would produce some sort of "database" file(s), that the JS could read and use to do fast queries.

Again I know little about full-text search, and I only scanned the tipue docs very quickly. But it looks like its Static Mode wants JSON data in a tipuesearch_content.js file: http://www.tipue.com/search/docs/?d=1. What's missing is a command line tool to go through .md and .scrbl source files -- or possibly the output html files? -- and generate this data.

I don't know, what do you think?

@dertuxmalwieder
Copy link
Author

Well, I'm assuming that most people choosing to use a static blog generator, who don't want to run an HTTP server

How do you want to deploy your generated HTML files without a HTTP server?

would also not want to run some index query server. At least, I'd be in that camp.

I see what you mean. One of the two major reasons why I want to drop WordPress is that its server components are known for nasty security holes. Still, most of the search functions I can imagine could be realized as optional plug-ins/scripts for those who could live with that.

I was hoping that one would run the indexer after rebuilding the blog, and it would produce some sort of "database" file(s), that the JS could read and use to do fast queries.

The closest you can have here -- at least if we're still talking about "real" search servers -- is running Solr in schemaless mode and storing its indexes locally, I guess.

But it looks like its Static Mode wants JSON data in a tipuesearch_content.js file: http://www.tipue.com/search/docs/?d=1. What's missing is a command line tool to go through .md and .scrbl source files -- or possibly the output html files? -- and generate this data.

The existing Pelican script looks like there's not much involved; yes, the HTML files are "scanned" and transformed into JSON.

@greghendershott
Copy link
Owner

Many (most?) people using static blog generators use an HTTP server someone else is responsible for running 24/7. That's part of the appeal. They push to GitHub Pages, or copy to Amazon S3, or similar.

So, if you're OK with the tipue UI/UX, it looks like Pelican already supports what you want! :)

Generating that kind of JSON would be easy to do in Racket for Frog, as well. I can imagine adding that, either in Frog or simply as a stand-alone repo/tool. It would be handy. There are times I'd use it to find a post on my own blog more quickly, not to mention helping others.

Oh heck, I'll assign this to myself, and try to get to it in the next day or two. Thanks again for the suggestion and for helping talk through the options.

@dertuxmalwieder
Copy link
Author

Ah, I misunderstood you there, yes.

Pelican is quite wordpressy indeed, even feature-wise. But I don't like Python too much after having used it for a while and I'm not entirely happy with its theming, so I thought I'll push the alternatives a bit.

Thank you a lot!

@greghendershott
Copy link
Owner

I'm still interested in this, but having looked at tipue more, I'm not so sure about it, specifically. Its content file JSON is just an array of page maps, and the text member of each map is just all the text from the page. This seems like it can't be great for speed (or space) especially when a site is big-ish?
(I do know one Frog user with 600 posts, and others with fewer, but long posts.) Although I'm naive about full text search, I was expecting something more like an inverted index with position info.

Maybe I'm misunderstanding the JavaScript and it converts that raw data into something better, but I don't think so.

Hmmm....

@dertuxmalwieder
Copy link
Author

Well, if you want to achieve a client-side full-text index, you'll have to deploy that full-text index in a way. Someone tested the Tipue performance and he seemed to be impressed though.

Tipue basically works with JSON, yes. That's the main problem with JavaScript-based stuff: JavaScript has horrible data formats.

@greghendershott
Copy link
Owner

In that issue you linked to (thanks!), this comment expresses another concern I have:

I hate this option because it forces 100% of the site to be sent to them. They may as well use wget -r and grep. If any word in any post changes (spelling correction) they get to download the entire site again.

That he followed through with a solution is awesome -- kudos.

That it requires running your own server, is not awesome (for me).

@dertuxmalwieder
Copy link
Author

The "solution" is the aforementioned Sphinx (and I guess it could also work with Solr), yes.

Searching in a static site basically only leaves you with those two options: Simulate a database layer or keep a full-text index anywhere. But I guess Tipue's index is not really large, it only contains your pure text. Today's connections probably won't have much trouble with that...?

@tfeb
Copy link
Contributor

tfeb commented Sep 30, 2015

I don't think you need to download the whole site, or the text of it, if you are willing to live with some limits. For instance, if you simply compute and store statically a table which maps from each unique word to a reference to the page it occurs in, then you can fetch that table and have some client-side code which then dynamically fetches that page, and searches linearly in it, on the fly. That means the initial fetch of stuff is not the whole site, but its unique words, which for large sites will be much smaller.

(Of course in real life you'd want to be smarter than this (for instance don't index 'the' & so on: but I think there should be non-pessimal solutions, where 'pessimal' is either 'running some fancy, and therefore vulnerable, server-side thing, or downloading the whole site, each time'.)

@greghendershott
Copy link
Owner

I don't think you need to download the whole site, or the text of it, if you are willing to live with some limits. For instance, if you simply compute and store statically a table which maps from each unique word to a reference to the page it occurs in, then you can fetch that table and have some client-side code which then dynamically fetches that page, and searches linearly in it, on the fly. That means the initial fetch of stuff is not the whole site, but its unique words, which for large sites will be much smaller.

Exactly. That's one of two things I did last night.

Against my better judgment I started reading more about IR. Assume a positional index like this. How does JS on a static web site avoid downloading the whole thing? Normally it queries a server, which someone has to maintain 24/7. I suppose that could be AWS Dynamo, but, not sure how to handle credentials. Also requires $.

Would it be crazy to store this with the "posting lists" sharded across objects on an AWS S3 bucket? It seems that using S3 as a key/value store like this could be reasonably performant. [Using anonymous access simplifies creds. As for $, there are some, but fewer of them than say Dynamo.]

So that could be an interesting project for someone to try (probably someone already has?).


OTOH the second thing I did last night? I configured Google Custom Search Engine for my blog. The search query/results can be embedded in one of my normal web pages -- it didn't feel like "leaving my site". It looked decent and worked really well.

I didn't push this to my site for real, yet. But.... Google CSE was ridiculously easy. I like easy. Does this make me a bad person?

@dertuxmalwieder
Copy link
Author

Yes, it pretty much does. Google is The Evil!

Seriously, using a Static Site Generator usually means that you want to gain full control over your website and adding third-party components from servers you don't own is not really a better and/or more secure idea than running a dynamic server daemon yourself, is it?

@greghendershott
Copy link
Owner

People do static web sites for different reasons. Some do it for more control, like you. Whereas I'm with The Dead Kennedys, Give Me Convenience or Give Me Death. Seriously, if I get into the server-running business my "users" won't be happy about availability and I won't be happy with about fire drills. I do understand the trade-off and have misgivings; I feel slightly ashamed if that makes you feel better.

So for example I use Disqus to add comments and I use Google Analytics to see if my tree falling in the woods makes any sound.

@greghendershott
Copy link
Owner

Because I feel slightly ashamed I'd gladly use a search system that hosted its database intelligently on Amazon S3, for example. If that doesn't exist, I'd find that really fun to develop. I just don't have time to, now. I'm already close to over-extended on open source projects.

@dertuxmalwieder
Copy link
Author

You could also run Sphinx on Amazon S3, would that validate my point then? ;)

-e- Oh, good timing.

@greghendershott
Copy link
Owner

I'm sorry if I overlooked that option in the discussion. I'll try to find time to look at that. Thanks for pointing it out.

@dertuxmalwieder
Copy link
Author

If there's anything I can help you with, I'm happy to do so. :)

@greghendershott
Copy link
Owner

Honestly? I'd like a "dummies guide" to Sphinx, specifically how to make some JS front-end access the Sphinx database file(s) from a plain file server like FTP or S3. Whether that exists already, or you write it. It would be great if you could contribute the parts I don't know, and I can contribute what I know about how to integrate things smoothly into the Frog build process.

@dertuxmalwieder
Copy link
Author

After reading a bit:

The database file(s) need to be generated first (see the "Indexes" part). Sphinx can even do real-time indexes without a database, according to the internet. However, having it expose a "human readable" database seems to be undocumented without accessing the Sphinx daemon. Similar to solr the API primarily exposes the server handle, it seems.

However (I like that word), Sphinx can be instructed to generate index files as it seems. This is ... interesting.

@tfeb
Copy link
Contributor

tfeb commented Oct 2, 2015

For what it's worth I'm in at least three camps here.

  • Static sites are inherently preferable as I can create them using tools and languages which don't make me ill, frog being a good example of that (if JS or Python don't make you ill you are already ill and should seek treatment, which may involve a chainsaw), and deploy them essentially anywhere.
  • Static sites don't require me to either run server-side software which will almost inevitably both be written in a blub language, be vastly overcomplex, and be full of security holes. Even worse would be to rely on someone else to do this for me as they will either not fix the holes, will upgrade to incompatible versions every two days or both. I deal with huge crappy software systems in my day job, I want not to have to do the same in my free time
  • Static sites let me avoid Googlebook knowing more about me than I can avoid: in particular, even if anyone but me read my blog, I can avoid becoming yet another stream of personal information being squirted down Googlebook's throat. This matters less to other people, of course: I'm not claiming any morally-superior position here, I just don't want anything to do with them that I can avoid.

So, my point here, as far as I have one, is that I'm definitely in the extreme-static camp, and while search would be interesting I'm completely happy to not have it if it's not practical in a static system, and I'd like it to be the case that Frog continues to support that position. I'm not really worried that it won't, I just want to make sure it does.

Sorry for the rant!

@dertuxmalwieder
Copy link
Author

Python is a very fast and quite reliable way to prototype things although its syntax is... weird. But I don't understand your last point: How does Googlebook know more about you when the publicly visible HTML pages are generated dynamically? No one forces you to use Google/Facebook things with WordPress, for example.

@tfeb
Copy link
Contributor

tfeb commented Oct 2, 2015

The point I didn't make clearly was that I don't want search to rely on google, or if it does I want it to be easy to disable.

(I have no problem with Python's syntax. I have a number of problems with its semantics which are just inexcusable: if it had been designed in the early 60s I could forgive it, but it wasn't. But this is very much not the forum to get into a fight about Python (apart from anything else writing Python is my day job and I don't want to think of it outside that).)

@MTecknology
Copy link

@greghendershott Howdy!

In that issue you linked to (thanks!), this comment expresses another concern I have:

I hate this option because it forces 100% of the site to be sent to them. They may as well use wget -r and grep. If any word in any post changes (spelling correction) they get to download the entire site again.

That he followed through with a solution is awesome -- kudos.

That it requires running your own server, is not awesome (for me).

I smiled when I read that. :)

As a follow up to that thread you linked to, I'm considering turning what I built into a hosted service. One caveat is it means trusting my servers to do what I claim they do. My goal, however, is letting people hide the fact that my service is in the mix at all.

Would building this service be a solution to this issue or not so much?

@dertuxmalwieder
Copy link
Author

I guess the whole point of a static website is that you can't trust a server if I got it correctly?

@greghendershott
Copy link
Owner

@MTecknology Hi!

Would building this service be a solution to this issue or not so much?

Speaking only for myself, for my personal blog? I don't think so. If I wanted to add search as a service, I'd probably use an existing search provider. If such a provider served ads, I wouldn't love that but I'd understand that it needs to be paid for, somehow.

Speaking generally? There very well might be a market for such a service.

@greghendershott greghendershott removed their assignment May 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants