Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize fortran identifiers like gfortran #1957

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

gridaphobe
Copy link

Fortran is a case-insensitive language, so it's important to have a
consistent normalization scheme in order to recognize cross-language
function calls, e.g. C-to-Fortran.

gfortran lowercases all Fortran identifiers and adds a trailing
underscore, so this is the convention I adopted.

One issue I have found along the way was that the results page for a
"symbol/refs" query stopped printing the matching lines. I was able to
trace this issue to the Context.getContext method, which for some
reason seems to re-lex the matching file for tokens. It uses a generic
lexer PlainLineTokenizer that isn't aware of my normalization scheme,
and so the symbols that the lexer extracts don't match the user's
query...

I did a quick and dirty fix to conditionally add the normalization logic
to the PlainLineTokenizer, but the whole thing seems a bit
strange. Why are we re-lexing the file instead of storing the relevant
data (filepath, line, etc) in the index, like we do for definitions?

@tarzanek
Copy link
Contributor

tarzanek commented Feb 3, 2018

this actually looks OK
wondering if we can isolate it even more just to Fortran analyser, but this is a good start

Could you resolve conflicts @gridaphobe ?
I think we could merge after that ...

tia
L

@idodeclare
Copy link
Contributor

Sorry I should have chimed in sooner.

As @gridaphobe observed, this will be quite different with the modern Lucene UnifiedHighlighter where context presentation does not need to re-lex source.

@tarzanek, please can you hold off merging this for now? I have a sequence of serial PRs that right now is waiting for feature/LOC to get processed. feature/new_highlighter is about fifth in line after that.

@tarzanek
Copy link
Contributor

tarzanek commented Feb 5, 2018

hehe, OK, so let's hold on with this PR

I will try to have a look at LOC PR and talk to Vlad or Krystof to expedite it

@gridaphobe
Copy link
Author

Should I go ahead and resolve the conflicts now, or do we expect the other PRs to cause their own conflicts? :)

@idodeclare
Copy link
Contributor

The other PRs will change the approach that you would use for this.

@gridaphobe
Copy link
Author

Are the other PRs you mentioned merged yet?

@tulinkry
Copy link
Contributor

tulinkry commented Apr 10, 2018

PR #2052

@tarzanek
Copy link
Contributor

@gridaphobe fwiw we were able to merge the PRs that influence this change ...
I guess you can try to merge and ping here if any issues arise, I hope one of us and @idodeclare of course will be able to guide you ;-)
tia
L

@gridaphobe
Copy link
Author

Thanks! I noticed, I'll try to update my branch this week.

@gridaphobe
Copy link
Author

I've updated my branch. It passes the unit tests, but I'm having trouble running the indexer now, I get the following lucene error when I run ./OpenGrok index /path/to/code

18:16:45 SEVERE: Failed with unexpected RuntimeException
java.lang.IllegalArgumentException: Could not load codec 'Lucene62'.  Did you forget to add lucene-backward-codecs.jar?
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:420)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:359)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1077)
	at org.opensolaris.opengrok.index.IndexDatabase.update(IndexDatabase.java:407)
	at org.opensolaris.opengrok.index.IndexDatabase$1.run(IndexDatabase.java:204)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene62' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene70]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:116)
	at org.apache.lucene.codecs.Codec.forName(Codec.java:116)
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:416)
	... 10 more

Any ideas?

@tarzanek
Copy link
Contributor

yes, please remove old index, current master is on updated lucene, so you need to index from scratch ...

@vladak
Copy link
Member

vladak commented Apr 18, 2018

The test case failure is innocuous (#2030).

@vladak
Copy link
Member

vladak commented Apr 18, 2018

The stale index failure should be better captured and appropriate message given to the user - might be worth filing a new issue for this.

@gridaphobe
Copy link
Author

gridaphobe commented Apr 18, 2018

Thanks! That was indeed the issue.

It looks like the context rendering isn't quite working anymore, it prints the correct lines but does not bold the matched token. I remember having to do some fiddling with the PlainLineTokenizer (Analyzer?) to get that to work originally. Did that part of the code change significantly?

@gridaphobe
Copy link
Author

gridaphobe commented Apr 18, 2018

Nevermind, I sorted it out.

Everything appears to be working now from my local tests. One thing that we might want to discuss is that fortran defs/refs must currently be searched for in their normalized form. For example, if I have a fortran function

SUBROUTINE FOO
...
END

I would have to search for defs=foo_ to find FOO.

I think it would be nice if the queries defs=FOO and defs=foo_ would resolve to FOO, but I'm not sure how to achieve that. Happy to implement if you can give me a pointer!

@idodeclare
Copy link
Contributor

Hi, @gridaphobe ,

Thank you for your continued efforts in this area. OpenGrok on master currently only indexes a single token per character offset — thus you find with your normalization work that the literal value, FOO, is no longer indexed when you are instead storing foo_.

Lucene of course allows multiple tokens per offset in order to support synonyms. I have a completed patch from earlier this year for Objective-C support (as well as patches to support code-like synonyms for enhanced non-whitespace searching) that augments the symbol processing in JFlexTokenizer for Objective-C akin to what you have done for Fortran. I.e., it supports literal values as well as simultaneous normalized values.

The enhanced JFlexSymbolMatcher API is close to what you have also done:

    /**
     * Raises
     * {@link SymbolMatchedListener#symbolMatched(org.opensolaris.opengrok.analysis.SymbolMatchedEvent)}
     * for a subscribed listener.
     * @param literal the literal representation of the symbol
     * @param str the symbol string
     * @param start the symbol literal start position
     */
    protected void onSymbolMatched(String literal, String str, int start)
    ....

Likely you will be able to hook your normalization routines into these APIs so that the same literal code location (i.e., FOO in your example) will be indexed simultaneously by both foo_ as well as FOO. Would you be willing to see how your branch rebases against 5b0fdde (@idodeclare opengrok feature/objective_c)?

I'm also hoping you can revise your approach so as not have to modify anything in CtagsReader but instead to enhance FortranAnalyzer (and PlainAnalyzer) to allow language-specific post-processing of Definitions with normalization at the time of indexing.

(The caveat with this approach w.r.t. to Fortran will be that DEFS queries for foo_ will return all tags that normalize to foo_ while queries for FOO will only return tags that are literally FOO. Introducing normalization at query time is something we will have to discuss further.)

(Also, with the latest changes in master, revisions to PlainLineTokenizer and getContext() are not generally active but only when source code is modified without re-indexing — so you should either explicitly affirm that you have tested this functionality or else leave those files unmodified so that "legacy" highlighting is not Fortran-aware.)

@tarzanek
Copy link
Contributor

tarzanek commented Apr 19, 2018

hmm, getting objective C support would be cool, let's see if we can get it soon

@gridaphobe
Copy link
Author

Likely you will be able to hook your normalization routines into these APIs so that the same literal code location (i.e., FOO in your example) will be indexed simultaneously by both foo_ as well as FOO. Would you be willing to see how your branch rebases against 5b0fdde (@idodeclare opengrok feature/objective_c)?

Aha! I had actually made the same change locally but wasn't sure how to deal with the Lucene side of things, hence my question :) I'll take a look at your branch.

I'm also hoping you can revise your approach so as not have to modify anything in CtagsReader but instead to enhance FortranAnalyzer (and PlainAnalyzer) to allow language-specific post-processing of Definitions with normalization at the time of indexing.

Yes, that definitely would be nicer, let me see if I can work that out.

(Also, with the latest changes in master, revisions to PlainLineTokenizer and getContext() are not generally active but only when source code is modified without re-indexing — so you should either explicitly affirm that you have tested this functionality or else leave those files unmodified so that "legacy" highlighting is not Fortran-aware.)

I see, I have not tested it in that mode (not since the rebase at least) and it seems like a corner case, so I'm happy to remove the support.

@gridaphobe
Copy link
Author

gridaphobe commented Apr 20, 2018

@idodeclare I've rebased my branch against your feature/objective_c branch: https://github.com/gridaphobe/OpenGrok/tree/normalize-fortran (I didn't want to push it to this PR since it would clutter the diff with your changes).

There's one failing test, which seems to have something to do with reindexing, but it was present in your branch as well.

@idodeclare
Copy link
Contributor

idodeclare commented Apr 21, 2018

@gridaphobe ,

Glad to hear! I'll add a few comments directly to that revision 1bfa324.

Edit: oh no I guess the comments didn't take. Github seemed to accept them, but now I don't see them on that commit above. I guess it's only PRs that take reviews reliably.

@gridaphobe
Copy link
Author

@idodeclare ack, sorry this fell off my radar. I actually addressed your comments a couple weeks ago, but it turns out I forgot to push them to GitHub :)

How do you want to proceed with this? Merge your objective-c branch and then I can open a new PR with my rebased branch?

@idodeclare
Copy link
Contributor

@gridaphobe , the objective-c branch will be under review presently, after which you can rebase and PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants