Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main thread panic in stork search "not a char boundary" #356

Open
karlwilcox opened this issue Apr 11, 2023 · 3 comments
Open

Main thread panic in stork search "not a char boundary" #356

karlwilcox opened this issue Apr 11, 2023 · 3 comments

Comments

@karlwilcox
Copy link

With the following input file (named gallery.toml)

[output]
    displayed_results_count = 31
[input]
    url_prefix = "/gallery/"
    frontmatter_handling = "Omit"
    stemming = "None"
    minimum_indexed_substring_length = 4
files = [
{ url = "010695", title = "(Untitled)", contents = "caption saint–aubin–fosse–louvain  then  gules a chevron argent between 3 eagles or. caption  saint–berthevin  then  (1999) per chief per pale gules and argent; and sable  an eagle arg beaked and membered or in dexter side  taller a lion sable crowned  gu armed and langued gu in sinister side  shorter shorter a demi  lion arg in chief taller taller  lower. caption saint–berthevin–la–tanniere  then  (1999) or a chevron gules 2 eagles in chief azure  a tree eradicated vert in middle  %base higher. caption  saint–charles–la–foret  then  gules a carbuncle or. caption 'saint–denis–d'anjou'  ", filetype="PlainText" },
]

We run the command:

stork build --input gallery.toml --output gallery.st

And then try a command line search for a known hit, e.g.

stork search --format json --index gallery.st --query "azure"

We get the message:

thread 'main' panicked at 'byte index 540 is not a char boundary; it is inside '–' (bytes 539..542) of `caption saint–aubin–fosse–louvain  then  gules a chevron argent between 3 eagles or. caption  saint–berthevin  then  (1999) per chief per pale gules and argent; and sable  an eagle arg beaked and membered or in dexter side  taller a lion sable crow`[...]', stork-lib/src/index_v4/search/excerpt_grouping.rs:158:19
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/str/mod.rs:86:9
   4: stork_lib::index_v4::search::render_search_values
   5: stork::main

(Byte 540 is just before the final "gules" in the content string)

Other information:

karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ stork --version
Stork 2.0.0-beta.2
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ file /usr/local/bin/stork
/usr/local/bin/stork: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=2b673e767e82fea952b7220d047e5fe187d91b27, for GNU/Linux 3.2.0, with debug_info, not stripped
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ uname -a
Linux DESKTOP-9DUHI21 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

(So this is Ubuntu Linux running under WSL2, although I can reproduce on a native Ubuntu installation also)

Please let me know if you need anything else. Hope this is useful!

@karlwilcox
Copy link
Author

Further investigation suggests that the things that look like '-' are not ASCII, removing them solves the problem so this is likely something related to character mapping.

@karlwilcox
Copy link
Author

It is \u2013 that seems to cause the problem.

@karlwilcox
Copy link
Author

Actually everything non-ASCII in the input file seems to cause a problem with the command line search hits. In PHP,

iconv("UTF-8", "ASCII//TRANSLIT", $content);

Fixes the problem

This may even be documented somewhere so I'll shut up now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant