normalize fortran identifiers like gfortran #1957

gridaphobe · 2017-12-18T20:00:47Z

Fortran is a case-insensitive language, so it's important to have a
consistent normalization scheme in order to recognize cross-language
function calls, e.g. C-to-Fortran.

gfortran lowercases all Fortran identifiers and adds a trailing
underscore, so this is the convention I adopted.

One issue I have found along the way was that the results page for a
"symbol/refs" query stopped printing the matching lines. I was able to
trace this issue to the Context.getContext method, which for some
reason seems to re-lex the matching file for tokens. It uses a generic
lexer PlainLineTokenizer that isn't aware of my normalization scheme,
and so the symbols that the lexer extracts don't match the user's
query...

I did a quick and dirty fix to conditionally add the normalization logic
to the PlainLineTokenizer, but the whole thing seems a bit
strange. Why are we re-lexing the file instead of storing the relevant
data (filepath, line, etc) in the index, like we do for definitions?

tarzanek · 2018-02-03T13:00:09Z

this actually looks OK
wondering if we can isolate it even more just to Fortran analyser, but this is a good start

Could you resolve conflicts @gridaphobe ?
I think we could merge after that ...

tia
L

idodeclare · 2018-02-03T13:08:15Z

Sorry I should have chimed in sooner.

As @gridaphobe observed, this will be quite different with the modern Lucene UnifiedHighlighter where context presentation does not need to re-lex source.

@tarzanek, please can you hold off merging this for now? I have a sequence of serial PRs that right now is waiting for feature/LOC to get processed. feature/new_highlighter is about fifth in line after that.

tarzanek · 2018-02-05T10:55:30Z

hehe, OK, so let's hold on with this PR

I will try to have a look at LOC PR and talk to Vlad or Krystof to expedite it

gridaphobe · 2018-02-05T16:07:13Z

Should I go ahead and resolve the conflicts now, or do we expect the other PRs to cause their own conflicts? :)

idodeclare · 2018-02-05T16:11:10Z

The other PRs will change the approach that you would use for this.

gridaphobe · 2018-04-10T17:24:15Z

Are the other PRs you mentioned merged yet?

tulinkry · 2018-04-10T18:35:28Z

PR #2052

tarzanek · 2018-04-16T16:22:03Z

@gridaphobe fwiw we were able to merge the PRs that influence this change ...
I guess you can try to merge and ping here if any issues arise, I hope one of us and @idodeclare of course will be able to guide you ;-)
tia
L

gridaphobe · 2018-04-16T16:59:04Z

Thanks! I noticed, I'll try to update my branch this week.

gridaphobe · 2018-04-17T22:17:58Z

I've updated my branch. It passes the unit tests, but I'm having trouble running the indexer now, I get the following lucene error when I run ./OpenGrok index /path/to/code

18:16:45 SEVERE: Failed with unexpected RuntimeException
java.lang.IllegalArgumentException: Could not load codec 'Lucene62'.  Did you forget to add lucene-backward-codecs.jar?
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:420)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:359)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1077)
	at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:407)
	at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:204)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene62' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene70]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:116)
	at org.apache.lucene.codecs.Codec.forName(Codec.java:116)
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:416)
	... 10 more

Any ideas?

tarzanek · 2018-04-18T05:00:04Z

yes, please remove old index, current master is on updated lucene, so you need to index from scratch ...

vladak · 2018-04-18T08:49:20Z

The test case failure is innocuous (#2030).

vladak · 2018-04-18T08:50:24Z

The stale index failure should be better captured and appropriate message given to the user - might be worth filing a new issue for this.

gridaphobe · 2018-04-18T15:41:24Z

Thanks! That was indeed the issue.

It looks like the context rendering isn't quite working anymore, it prints the correct lines but does not bold the matched token. I remember having to do some fiddling with the PlainLineTokenizer (Analyzer?) to get that to work originally. Did that part of the code change significantly?

gridaphobe · 2018-04-18T18:42:21Z

Nevermind, I sorted it out.

Everything appears to be working now from my local tests. One thing that we might want to discuss is that fortran defs/refs must currently be searched for in their normalized form. For example, if I have a fortran function

SUBROUTINE FOO
...
END

I would have to search for defs=foo_ to find FOO.

I think it would be nice if the queries defs=FOO and defs=foo_ would resolve to FOO, but I'm not sure how to achieve that. Happy to implement if you can give me a pointer!

idodeclare · 2018-04-19T03:14:59Z

Hi, @gridaphobe ,

Thank you for your continued efforts in this area. OpenGrok on master currently only indexes a single token per character offset — thus you find with your normalization work that the literal value, FOO, is no longer indexed when you are instead storing foo_.

Lucene of course allows multiple tokens per offset in order to support synonyms. I have a completed patch from earlier this year for Objective-C support (as well as patches to support code-like synonyms for enhanced non-whitespace searching) that augments the symbol processing in JFlexTokenizer for Objective-C akin to what you have done for Fortran. I.e., it supports literal values as well as simultaneous normalized values.

The enhanced JFlexSymbolMatcher API is close to what you have also done:

    /**
     * Raises
     * {@link SymbolMatchedListener#symbolMatched(org.opensolaris.opengrok.analysis.SymbolMatchedEvent)}
     * for a subscribed listener.
     * @param literal the literal representation of the symbol
     * @param str the symbol string
     * @param start the symbol literal start position
     */
    protected void onSymbolMatched(String literal, String str, int start)
    ....

Likely you will be able to hook your normalization routines into these APIs so that the same literal code location (i.e., FOO in your example) will be indexed simultaneously by both foo_ as well as FOO. Would you be willing to see how your branch rebases against 5b0fdde (@idodeclare opengrok feature/objective_c)?

I'm also hoping you can revise your approach so as not have to modify anything in CtagsReader but instead to enhance FortranAnalyzer (and PlainAnalyzer) to allow language-specific post-processing of Definitions with normalization at the time of indexing.

(The caveat with this approach w.r.t. to Fortran will be that DEFS queries for foo_ will return all tags that normalize to foo_ while queries for FOO will only return tags that are literally FOO. Introducing normalization at query time is something we will have to discuss further.)

(Also, with the latest changes in master, revisions to PlainLineTokenizer and getContext() are not generally active but only when source code is modified without re-indexing — so you should either explicitly affirm that you have tested this functionality or else leave those files unmodified so that "legacy" highlighting is not Fortran-aware.)

tarzanek · 2018-04-19T06:40:26Z

hmm, getting objective C support would be cool, let's see if we can get it soon

gridaphobe · 2018-04-19T13:17:19Z

Likely you will be able to hook your normalization routines into these APIs so that the same literal code location (i.e., FOO in your example) will be indexed simultaneously by both foo_ as well as FOO. Would you be willing to see how your branch rebases against 5b0fdde (@idodeclare opengrok feature/objective_c)?

Aha! I had actually made the same change locally but wasn't sure how to deal with the Lucene side of things, hence my question :) I'll take a look at your branch.

I'm also hoping you can revise your approach so as not have to modify anything in CtagsReader but instead to enhance FortranAnalyzer (and PlainAnalyzer) to allow language-specific post-processing of Definitions with normalization at the time of indexing.

Yes, that definitely would be nicer, let me see if I can work that out.

(Also, with the latest changes in master, revisions to PlainLineTokenizer and getContext() are not generally active but only when source code is modified without re-indexing — so you should either explicitly affirm that you have tested this functionality or else leave those files unmodified so that "legacy" highlighting is not Fortran-aware.)

I see, I have not tested it in that mode (not since the rebase at least) and it seems like a corner case, so I'm happy to remove the support.

gridaphobe · 2018-04-20T20:27:37Z

@idodeclare I've rebased my branch against your feature/objective_c branch: https://github.com/gridaphobe/OpenGrok/tree/normalize-fortran (I didn't want to push it to this PR since it would clutter the diff with your changes).

There's one failing test, which seems to have something to do with reindexing, but it was present in your branch as well.

idodeclare · 2018-04-21T17:51:37Z

@gridaphobe ,

Glad to hear! I'll add a few comments directly to that revision 1bfa324.

Edit: oh no I guess the comments didn't take. Github seemed to accept them, but now I don't see them on that commit above. I guess it's only PRs that take reviews reliably.

gridaphobe · 2018-05-11T17:34:18Z

@idodeclare ack, sorry this fell off my radar. I actually addressed your comments a couple weeks ago, but it turns out I forgot to push them to GitHub :)

How do you want to proceed with this? Merge your objective-c branch and then I can open a new PR with my rebased branch?

idodeclare · 2018-05-17T02:54:59Z

@gridaphobe , the objective-c branch will be under review presently, after which you can rebase and PR.

gridaphobe force-pushed the master branch from f0b68ad to 00c47b8 Compare April 17, 2018 22:13

normalize fortran variables like gfortran

3232b2e

gridaphobe force-pushed the master branch from c073906 to 3232b2e Compare April 17, 2018 22:15

increase charCmpEndOffset to 1 to ignore fortran subroutine context tags

5bc375b

gridaphobe added 3 commits April 18, 2018 13:55

ctags instance is shared, set normalization per reader instead

c20f159

pass non-normalized symbol to bestIndexForTag

b2187d2

fix normalization of references

31882fe

gridaphobe added 2 commits April 18, 2018 15:48

fix tests

60ffb94

clean up

6aaceaa

vladak mentioned this pull request Mar 16, 2023

Improve handling of case insensitive languages #4223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize fortran identifiers like gfortran #1957

normalize fortran identifiers like gfortran #1957

gridaphobe commented Dec 18, 2017

tarzanek commented Feb 3, 2018 •

edited

idodeclare commented Feb 3, 2018

tarzanek commented Feb 5, 2018

gridaphobe commented Feb 5, 2018

idodeclare commented Feb 5, 2018

gridaphobe commented Apr 10, 2018

tulinkry commented Apr 10, 2018 •

edited

tarzanek commented Apr 16, 2018

gridaphobe commented Apr 16, 2018

gridaphobe commented Apr 17, 2018

tarzanek commented Apr 18, 2018

vladak commented Apr 18, 2018

vladak commented Apr 18, 2018

gridaphobe commented Apr 18, 2018 •

edited

gridaphobe commented Apr 18, 2018 •

edited

idodeclare commented Apr 19, 2018

tarzanek commented Apr 19, 2018 •

edited

gridaphobe commented Apr 19, 2018

gridaphobe commented Apr 20, 2018 •

edited

idodeclare commented Apr 21, 2018 •

edited

gridaphobe commented May 11, 2018

idodeclare commented May 17, 2018

normalize fortran identifiers like gfortran #1957

Are you sure you want to change the base?

normalize fortran identifiers like gfortran #1957

Conversation

gridaphobe commented Dec 18, 2017

tarzanek commented Feb 3, 2018 • edited

idodeclare commented Feb 3, 2018

tarzanek commented Feb 5, 2018

gridaphobe commented Feb 5, 2018

idodeclare commented Feb 5, 2018

gridaphobe commented Apr 10, 2018

tulinkry commented Apr 10, 2018 • edited

tarzanek commented Apr 16, 2018

gridaphobe commented Apr 16, 2018

gridaphobe commented Apr 17, 2018

tarzanek commented Apr 18, 2018

vladak commented Apr 18, 2018

vladak commented Apr 18, 2018

gridaphobe commented Apr 18, 2018 • edited

gridaphobe commented Apr 18, 2018 • edited

idodeclare commented Apr 19, 2018

tarzanek commented Apr 19, 2018 • edited

gridaphobe commented Apr 19, 2018

gridaphobe commented Apr 20, 2018 • edited

idodeclare commented Apr 21, 2018 • edited

gridaphobe commented May 11, 2018

idodeclare commented May 17, 2018

tarzanek commented Feb 3, 2018 •

edited

tulinkry commented Apr 10, 2018 •

edited

gridaphobe commented Apr 18, 2018 •

edited

gridaphobe commented Apr 18, 2018 •

edited

tarzanek commented Apr 19, 2018 •

edited

gridaphobe commented Apr 20, 2018 •

edited

idodeclare commented Apr 21, 2018 •

edited