Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make code graph APIs SCIP-oriented #59470

Open
varungandhi-src opened this issue Jan 10, 2024 · 5 comments
Open

Make code graph APIs SCIP-oriented #59470

varungandhi-src opened this issue Jan 10, 2024 · 5 comments
Assignees
Labels
graph/api GraphQL API parts owned by Team Graph Migrated team/graph Graph Team (previously Code Intel/Language Tools/Language Platform)

Comments

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Jan 10, 2024

Status Quo

At the moment, the way precise code graph APIs work is based on source ranges. For example, when you do Find references, you end up calling something like this: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/codeintel.codenav.graphql?L114-128

references(line: Int!, character: Int!, ...)

The consequences of this design centered around positions is that:

Proposed direction

From the start, the vision for SCIP has been to serve as part of the core vocabulary at Sourcegraph. We have already incorporated that to some extent with code navigation for locals, where the syntax highlighter can provide information about occurrences for locals and parameters for a growing set of languages, and the client-side code can retrieve a SCIP document1: SCIP document (1, 2), without having to make further requests to the server.

Integrating SCIP into the precise code graph APIs would involve adding support for:

  1. Getting the precise SCIP Document corresponding to a file.
  2. Looking up defs/refs/impls etc based on SCIP symbol names (or suffixes)

With 1. available, when attempting to do Go to definition/Find references, the client-side code could surface a choice to the user when multiple symbols have occurrences for the same token. This would address #57347.

With 1. available, even when a source range only has an occurrence for a single symbol, the client-side code could use the SCIP symbol name (or a slightly cleaned up version of it), to form URLs. For example, a URL like:

https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/auth.ts?L27:13-27:30#tab=references

would become something like:

https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/auth.ts?L27:13-27:30?$symbol=@sourcegraph\/web/1.10.1/src\/auth.ts\/AuthenticatedUser\##tab=references

So if there were changes to previous lines, e.g. new imports were added, but if the code continued to have precise code graph data, the URL would still work, because the client-side code could fetch the SCIP document, and locate the nearest source range to L27:13-27:30 that has the matching symbol (and symbol names for top-level symbols themselves do not change based on source locations2), and (optionally) rewrite the URL. We should still probably maintain the source range in the URL so that the blob view can highlight the intended range in the source file.

With 2. available, the client-side code could only fetch precise data for a single symbol instead of presenting a union.

Accommodating new SCIP data sources

When the work on batch indexing is complete (tentatively planned for Q1 FY25), we'll have a new source of code graph data which is technically not precise, but it will be SCIP-oriented. We should be designing APIs and UI changes (e.g. for the ref panel and URLs) taking this into mind. For example:

  • The client should request data with a setting for a "data source" (precise indexer? tree-sitter? any?), rather than having a tailored API specifically for precise data. This would apply both to the new APIs for fetching SCIP documents, as well as the APIs for fetching defs/refs/impls etc.
  • Results that are returned should include (or be extensible to include) accompanying information about the data source (this matters for the any data source setting).
  • The data source should be implied by URLs, so that link sharing shows consistent results. Right now, the "Mix search-based and precise" setting is a user setting which means that it doesn't work if you're logged out 🙃, and if you're logged in, you may see different results compared to your colleague.

Miscellaneous suggestions

  • Instead of having specialized APIs for defs/refs/impls etc., it might make sense to have a single API where the desired kind of output is specified as a string (this only applies for the output where it is a list of occurrences, this wouldn't apply to potential features like call hierarchy where the result would be a graph of occurrences). There seems to be an unnecessary amount of client-side complexity and code duplication because of the shape of the API.
  • It would be helpful to document guarantees related to duplicates, ordering etc. to avoid unnecessary client-side code related to sorting/uniquing/merging.

Debugging

Footnotes

  1. The field is unfortunately named lsif in the GraphQL API because SCIP was originally called 'LSIF Typed' before the public announcement.

  2. The only major exception is the symbol naming scheme for macros in C and C++

@mike-r-mclaughlin
Copy link
Contributor

Great write up @varungandhi-src! Adding a link to the customer discussion that prompted this so it's easier to find when they ask about it.

@varungandhi-src
Copy link
Contributor Author

Recently, we hit an issue #59733 where the hover doesn't show up despite a source range having precise code intelligence. Right now, the client-side code hard-codes a list of SyntaxKinds as potentially "interactive" for UX reasons, and the cause for that bug is that the SyntaxKind for that range (as provided by the syntax highlighter) is not part of the interactive allowlist. https://sourcegraph.com/github.com/sourcegraph/sourcegraph@b52cc224c11a768fa6ce9a794458e32097d2453d/-/blob/client/web/src/repo/blob/codemirror/codeintel/api.ts?L128-151

If the client-side code had access to the precise SCIP document itself, it could directly check the source range without having to check a separate "interactive" allowlist, avoiding the need to coordinate between different precise indexers and client-side code to make sure that they agree on which tokens can potentially have hovers.

@varungandhi-src varungandhi-src added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) graph/api GraphQL API parts owned by Team Graph labels Feb 26, 2024
@camdencheek
Copy link
Member

The client should request data with a setting for a "data source"

Thinking through this a little bit, and I had a couple of questions:

  1. Will the client necessarily know what types of data sources are available? For the web client at least, I expect we'll always just want the most precise available.
  2. Will there ever be "mixed" precision documents? Not really sure what this would look like, but maybe something like highlighting coming from treesitter, but symbol data coming from ctags.
  3. Will there ever be a situation where just can't generate a SCIP document for a data source? Or is there an effective "zero value" for a SCIP document (e.g. a document that contains zero symbols)?

@varungandhi-src
Copy link
Contributor Author

Will the client necessarily know what types of data sources are available? For the web client at least, I expect we'll always just want the most precise available.

Right, I think this would be part of the API contract via an enum in GraphQL.

Will there ever be "mixed" precision documents? Not really sure what this would look like, but maybe something like highlighting coming from treesitter, but symbol data coming from ctags.

Generally, it should be fine to mix highlighting data with navigation data from a single source. Today, we already have this - the syntax highlighter sends a single SCIP document with information about both highlighting and code nav for locals & params (when available) based on Tree-sitter.

However, for the general case, I think we want to have separate SCIP Documents for different data sources for code nav specifically, because otherwise, the client may not be able to identify which Occurrence came from which source (and we would not want to modify SCIP to track that on a per-occurrence basis)

Will there ever be a situation where just can't generate a SCIP document for a data source? Or is there an effective "zero value" for a SCIP document (e.g. a document that contains zero symbols)?

Yes, this can happen in a few ways:

  1. The data source does not create SCIP documents on the fly. E.g. precise indexing.
  2. The data source does not support creating a SCIP Document for a specific file. E.g. we may not support syntax highlighting for a specific language.
  3. The data source ran into an error when creating the SCIP Document. E.g. there was a highlighting bug which triggered a panic.

I think we shouldn't have any special handling for zero values. For example, if a file just has comments, then a precise data source will not populate any Occurrences or SymbolInformations in that file. In such a case, it would be incorrect to specially treat the empty array as an error case, and fall back to search-based code nav.

If the new API silently converts errors into zero values, I would consider that as an API design bug.

@varungandhi-src
Copy link
Contributor Author

Related feature request:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
graph/api GraphQL API parts owned by Team Graph Migrated team/graph Graph Team (previously Code Intel/Language Tools/Language Platform)
Projects
None yet
Development

No branches or pull requests

4 participants