Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect language per-line is guaranteed to produce poor results #392

Open
joshgoebel opened this issue Jun 4, 2021 · 6 comments
Open

Comments

@joshgoebel
Copy link

joshgoebel commented Jun 4, 2021

Hey, current maintainer of Highlight.js here. This just came to my attention via #391.

You're looping over lines and then calling highlightAuto on every line (when you don't have a known language). This is not recommended and guaranteed to produce poor results. Auto-detect is not intended to be useful with such little data and the noise will often (as reported in #391) be much higher than the signal - you're just as likely to get random languages than anything useful. There will be color, but often all wrong.

If you do wish to use auto-detect you should pass us the ENTIRE document (or at the very least all the available lines from the document/diff), then look at the language we determine it to be, then use that language for every single line.

You'll have to take this approach with version 11 anyways since you'll have to do the highlighting in a single pass (rather than per-line). So calling highlightAuto upfront for all available lines and letting it use the greater amount of content available for it's auto-detection... then splitting that result back out into the individual lines you need - already highlighted for you.

You'll have to do it twice of source, once each for the before and after streams.

@rtfpessoa
Copy link
Owner

Hi @joshgoebel,

Thanks for bringing this up.

I was under the assumption that the code was fine, and I even updated it based on your suggestion highlightjs/highlight.js#2277 (comment) leading to 2416b3d

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Does this also apply to code I already know the language?

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

@joshgoebel
Copy link
Author

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Well, "correctly highlight" might be overstating it. The way IMHO to "correctly attempt" it is to give us all the content and let us decide what it is... that will at least have consistent results, even if not 100% correct. And it's not the "only" way... you could do a "pre-highlight" (of all the lines) first to get the language... then do your line by line approach... but again v11 will require a different implementation, so I don't think it's even worth pursuing that avenue.

Does this also apply to code I already know the language?

I opened this issue solely because of the random behavior with auto-detect when it's used line by line. But yes, with v11 how you are doing this will have to be rewritten because the engine does not expose it's internal parsing state to the outside anymore.

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

But if the results are random and haphazard I'd suggest it'd be better to default to nothing.

@rtfpessoa
Copy link
Owner

@joshgoebel I believe I am not understanding exactly what you mean. So maybe if I try to reduce the problem space it will help me understand.
For now, let's ignore the fact that we might not know the language for some files, consider that I am already targeting version 11 and that I can never get all the code in the files since diffs by default have a limited context.

I would like to understand what would be the best way possible to do the highlight.

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?
If yes, what would be the alternative? Would I have to pass all the lines I have for it to work?

@iHiD
Copy link

iHiD commented Jun 4, 2021

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

@rtfpessoa
Copy link
Owner

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

@iHiD thanks for the reference. Will definitely read it.

@joshgoebel
Copy link
Author

joshgoebel commented Jun 5, 2021

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?

That's not really what you want to do as it will break on any scopes that persist past the end of a line boundary. What you'd really want to do:

  • Collect all sequential diff lines
  • Do this for both the "original" and the "changed" versions
  • So you'll have say 10 lines of code in two strings now, "before" and "after"
  • Highlight both blocks of code.
  • Split the lines (this will require some small amount of parsing and fixing up tags to end at newlines and re-open on the following line)... that can be done in 5-10 lines though as our HTML output is VERY clean easy easy to parse.

You'd really need to do this for each section of a diff (if they are non-sequential). So if a diff included 3 discrete changes, ~10 lines each then you'd be grouping each of those 3 changes into blocks and then highlighting all 3 blocks. Then splitting them apart again to get at the individual highlighted lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants