Make code graph APIs SCIP-oriented #59470

varungandhi-src · 2024-01-10T13:38:09Z

Status Quo

At the moment, the way precise code graph APIs work is based on source ranges. For example, when you do Find references, you end up calling something like this: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/codeintel.codenav.graphql?L114-128

references(line: Int!, character: Int!, ...)

The consequences of this design centered around positions is that:

Ref panel URLs use source ranges. This means that the ref panel URLs that are not permalinks (i.e. not pinned a specific commit) can easily break if there were some unrelated changes earlier in the file causing the range for the occurrence to change.
Client-side code cannot distinguish the cases where multiple semantically different symbols are present at the same source range, from the case where only one unique symbol is present. Related issues:
- Go to definition arbitrarily jumps to one definition if multiple are present #57347
- Better handling for multiple precise occurrences for the same source range #59326

Proposed direction

From the start, the vision for SCIP has been to serve as part of the core vocabulary at Sourcegraph. We have already incorporated that to some extent with code navigation for locals, where the syntax highlighter can provide information about occurrences for locals and parameters for a growing set of languages, and the client-side code can retrieve a SCIP document¹: SCIP document (1, 2), without having to make further requests to the server.

Integrating SCIP into the precise code graph APIs would involve adding support for:

Getting the precise SCIP Document corresponding to a file.
Looking up defs/refs/impls etc based on SCIP symbol names (or suffixes)

With 1. available, when attempting to do Go to definition/Find references, the client-side code could surface a choice to the user when multiple symbols have occurrences for the same token. This would address #57347.

With 1. available, even when a source range only has an occurrence for a single symbol, the client-side code could use the SCIP symbol name (or a slightly cleaned up version of it), to form URLs. For example, a URL like:

https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/auth.ts?L27:13-27:30#tab=references

would become something like:

https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/auth.ts?L27:13-27:30?$symbol=@sourcegraph\/web/1.10.1/src\/auth.ts\/AuthenticatedUser\##tab=references

So if there were changes to previous lines, e.g. new imports were added, but if the code continued to have precise code graph data, the URL would still work, because the client-side code could fetch the SCIP document, and locate the nearest source range to L27:13-27:30 that has the matching symbol (and symbol names for top-level symbols themselves do not change based on source locations²), and (optionally) rewrite the URL. We should still probably maintain the source range in the URL so that the blob view can highlight the intended range in the source file.

With 2. available, the client-side code could only fetch precise data for a single symbol instead of presenting a union.

Accommodating new SCIP data sources

When the work on batch indexing is complete (tentatively planned for Q1 FY25), we'll have a new source of code graph data which is technically not precise, but it will be SCIP-oriented. We should be designing APIs and UI changes (e.g. for the ref panel and URLs) taking this into mind. For example:

The client should request data with a setting for a "data source" (precise indexer? tree-sitter? any?), rather than having a tailored API specifically for precise data. This would apply both to the new APIs for fetching SCIP documents, as well as the APIs for fetching defs/refs/impls etc.
Results that are returned should include (or be extensible to include) accompanying information about the data source (this matters for the any data source setting).
The data source should be implied by URLs, so that link sharing shows consistent results. Right now, the "Mix search-based and precise" setting is a user setting which means that it doesn't work if you're logged out 🙃, and if you're logged in, you may see different results compared to your colleague.

Miscellaneous suggestions

Instead of having specialized APIs for defs/refs/impls etc., it might make sense to have a single API where the desired kind of output is specified as a string (this only applies for the output where it is a list of occurrences, this wouldn't apply to potential features like call hierarchy where the result would be a graph of occurrences). There seems to be an unnecessary amount of client-side complexity and code duplication because of the shape of the API.
It would be helpful to document guarantees related to duplicates, ordering etc. to avoid unnecessary client-side code related to sorting/uniquing/merging.

Debugging

The field is unfortunately named lsif in the GraphQL API because SCIP was originally called 'LSIF Typed' before the public announcement. ↩
The only major exception is the symbol naming scheme for macros in C and C++ ↩

The text was updated successfully, but these errors were encountered:

mike-r-mclaughlin · 2024-01-12T16:41:13Z

Great write up @varungandhi-src! Adding a link to the customer discussion that prompted this so it's easier to find when they ask about it.

varungandhi-src · 2024-01-22T09:31:02Z

Recently, we hit an issue #59733 where the hover doesn't show up despite a source range having precise code intelligence. Right now, the client-side code hard-codes a list of SyntaxKinds as potentially "interactive" for UX reasons, and the cause for that bug is that the SyntaxKind for that range (as provided by the syntax highlighter) is not part of the interactive allowlist. https://sourcegraph.com/github.com/sourcegraph/sourcegraph@b52cc224c11a768fa6ce9a794458e32097d2453d/-/blob/client/web/src/repo/blob/codemirror/codeintel/api.ts?L128-151

If the client-side code had access to the precise SCIP document itself, it could directly check the source range without having to check a separate "interactive" allowlist, avoiding the need to coordinate between different precise indexers and client-side code to make sure that they agree on which tokens can potentially have hovers.

camdencheek · 2024-03-07T19:26:51Z

The client should request data with a setting for a "data source"

Thinking through this a little bit, and I had a couple of questions:

Will the client necessarily know what types of data sources are available? For the web client at least, I expect we'll always just want the most precise available.
Will there ever be "mixed" precision documents? Not really sure what this would look like, but maybe something like highlighting coming from treesitter, but symbol data coming from ctags.
Will there ever be a situation where just can't generate a SCIP document for a data source? Or is there an effective "zero value" for a SCIP document (e.g. a document that contains zero symbols)?

varungandhi-src · 2024-03-11T09:41:14Z

Will the client necessarily know what types of data sources are available? For the web client at least, I expect we'll always just want the most precise available.

Right, I think this would be part of the API contract via an enum in GraphQL.

Will there ever be "mixed" precision documents? Not really sure what this would look like, but maybe something like highlighting coming from treesitter, but symbol data coming from ctags.

Generally, it should be fine to mix highlighting data with navigation data from a single source. Today, we already have this - the syntax highlighter sends a single SCIP document with information about both highlighting and code nav for locals & params (when available) based on Tree-sitter.

However, for the general case, I think we want to have separate SCIP Documents for different data sources for code nav specifically, because otherwise, the client may not be able to identify which Occurrence came from which source (and we would not want to modify SCIP to track that on a per-occurrence basis)

Will there ever be a situation where just can't generate a SCIP document for a data source? Or is there an effective "zero value" for a SCIP document (e.g. a document that contains zero symbols)?

Yes, this can happen in a few ways:

The data source does not create SCIP documents on the fly. E.g. precise indexing.
The data source does not support creating a SCIP Document for a specific file. E.g. we may not support syntax highlighting for a specific language.
The data source ran into an error when creating the SCIP Document. E.g. there was a highlighting bug which triggered a panic.

I think we shouldn't have any special handling for zero values. For example, if a file just has comments, then a precise data source will not populate any Occurrences or SymbolInformations in that file. In such a case, it would be incorrect to specially treat the empty array as an error case, and fall back to search-based code nav.

If the new API silently converts errors into zero values, I would consider that as an API design bug.

varungandhi-src · 2024-03-11T10:58:47Z

Related feature request:

SCIP based Language Server & extension for VS Code #60959

varungandhi-src added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) graph/api GraphQL API parts owned by Team Graph labels Feb 26, 2024

varungandhi-src mentioned this issue Mar 12, 2024

Syntactic indexing with heuristics #58727

Open

20 tasks

varungandhi-src self-assigned this Mar 27, 2024

varungandhi-src mentioned this issue Apr 10, 2024

Offer choice for Go to definition when target definition is ambiguous #59026

Open

mmanela added the Migrated label May 6, 2024 — with Linear

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make code graph APIs SCIP-oriented #59470

Make code graph APIs SCIP-oriented #59470

varungandhi-src commented Jan 10, 2024 •

edited

mike-r-mclaughlin commented Jan 12, 2024

varungandhi-src commented Jan 22, 2024

camdencheek commented Mar 7, 2024

varungandhi-src commented Mar 11, 2024

varungandhi-src commented Mar 11, 2024

Make code graph APIs SCIP-oriented #59470

Make code graph APIs SCIP-oriented #59470

Comments

varungandhi-src commented Jan 10, 2024 • edited

Status Quo

Proposed direction

Accommodating new SCIP data sources

Miscellaneous suggestions

Debugging

Footnotes

mike-r-mclaughlin commented Jan 12, 2024

varungandhi-src commented Jan 22, 2024

camdencheek commented Mar 7, 2024

varungandhi-src commented Mar 11, 2024

varungandhi-src commented Mar 11, 2024

varungandhi-src commented Jan 10, 2024 •

edited