Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: cache semantic lookup in tokenizer, add semantic tokens bench marks #4069

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

soulomoon
Copy link
Collaborator

@soulomoon soulomoon commented Feb 12, 2024

  1. It is high chance that an identifier might appear multiple times in a file, cache the identifier hs semantic type result in tokenizer saves repeated computation.
  2. A new test case is added to ensure invisible token won't not sabotage the cache lookup
  3. Add bench mark for semantic tokens
  4. fix bench mark running issue.

It gains about 5% performance increasement to the GetSemanticTokens Rule.

@soulomoon soulomoon marked this pull request as ready for review February 13, 2024 05:50
@soulomoon soulomoon changed the title Optimization: cache semantic lookup Optimization: cache semantic lookup in tokenizer Feb 13, 2024
@wz1000
Copy link
Collaborator

wz1000 commented Feb 15, 2024

could we have some benchmarks on large files to see if this actually makes a difference?

@soulomoon
Copy link
Collaborator Author

WIth

  - name: cabal
    package: Cabal
    version: 3.6.3.0
    modules:
        - src/Distribution/Simple/Configure.hs

cabal bench -j --benchmark-options="profiled-cabal"

version configuration name success samples startup setup userT delayedT 1stBuildT avgPerRespT totalT rulesBuilt rulesChanged rulesVisited rulesTotal ruleEdges ghcRebuilds maxResidency allocatedBytes
upstream All semanticTokens True 50 2.09 0.00 9.37 0.03 3.10 0.13 9.40 5147 5146 5147 5908 31251 1 174MB 13491MB
cache-semantic-lookup All semanticTokens True 50 1.87 0.00 8.36 0.10 2.77 0.11 8.46 5147 5146 5147 5908 31251 1 173MB 13437MB

@wz1000
Copy link
Collaborator

wz1000 commented Feb 15, 2024

Looks like a decent improvement. Could you add this benchmark to the ghcide-bench experiments please?

@wz1000
Copy link
Collaborator

wz1000 commented Feb 15, 2024

the ghcRebuilds number seems to indicate that we aren't editing the file in the benchmark passes. Could you ensure that each benchmark iteration also edits the file?

@soulomoon
Copy link
Collaborator Author

the ghcRebuilds number seems to indicate that we aren't editing the file in the benchmark passes. Could you ensure that each benchmark iteration also edits the file?

Sure, I will make some changes

@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 15, 2024

I need to take some more time to come up a more decent benchmark on this. Convert it to draft now.

@soulomoon soulomoon marked this pull request as draft February 15, 2024 17:51
@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 16, 2024

The overal improvement is not really obivious. Slightly better mem usage and speed.

version configuration name success samples startup setup userT delayedT 1stBuildT avgPerRespT totalT rulesBuilt rulesChanged rulesVisited rulesTotal ruleEdges ghcRebuilds maxResidency allocatedBytes
upstream All semanticTokens True 100 2.00 0.00 128.78 1.27 3.41 0.63 130.06 12 10 1364 5914 31271 102 296MB 568821MB
cache-semantic-lookup All semanticTokens True 100 1.92 0.00 126.47 2.11 3.43 0.62 128.60 12 10 1364 5914 31271 102 290MB 562533MB

GetSemanticRules stably gives better results
image

Another result.
image

@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 16, 2024

After update to master, seems it does show decent improvement.

version configuration name success samples startup setup userT delayedT 1stBuildT avgPerRespT totalT rulesBuilt rulesChanged rulesVisited rulesTotal ruleEdges ghcRebuilds maxResidency allocatedBytes
upstream All semanticTokens True 100 1.99 0.00 148.55 2.80 3.75 0.73 151.36 12 10 1364 5914 31271 102 307MB 594060MB
cache-semantic-lookup All semanticTokens True 100 1.93 0.00 134.02 1.15 6.84 0.64 135.18 12 10 1364 5914 31271 102 318MB 562002MB

@soulomoon soulomoon marked this pull request as ready for review February 16, 2024 14:37
@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 18, 2024

Another run, the overall result is similar, but the detailed traces comparison indeed show some decent improvement.

version configuration name success samples startup setup userT delayedT 1stBuildT avgPerRespT totalT rulesBuilt rulesChanged rulesVisited rulesTotal ruleEdges ghcRebuilds maxResidency allocatedBytes
upstream All semanticTokens True 50 1.83 0.00 64.00 1.18 3.50 0.62 65.18 12 10 1364 5914 31271 52 298MB 287351MB
HEAD All semanticTokens True 50 1.91 0.00 64.61 0.70 3.60 0.62 65.32 12 10 1364 5914 31271 52 299MB 285555MB
Screenshot 2024-02-18 at 12 50 39 image image

@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 18, 2024

cc @wz1000 , I have produced some more accurate result, seems about 5% improvement to GetSemanticTokens Rule.
Should be more if we consider the tokenizer alone.

@@ -333,7 +333,7 @@ benchRules build MkBenchRules{..} = do
++ concat
[[ "-h"
, "-i" <> show i
, "-po" <> outHp
, "-po" <> dropExtension outHp
Copy link
Collaborator Author

@soulomoon soulomoon Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In macos, the original produce duplicated suffix *.hp.hp making the bench test failed.
dropExtension fix it.

dependabot bot and others added 3 commits February 22, 2024 21:27
Bumps [pre-commit/action](https://github.com/pre-commit/action) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/pre-commit/action/releases)
- [Commits](pre-commit/action@v3.0.0...v3.0.1)

---
updated-dependencies:
- dependency-name: pre-commit/action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Michael Peyton Jones <me@michaelpj.com>
@soulomoon soulomoon changed the title Optimization: cache semantic lookup in tokenizer Optimization: cache semantic lookup in tokenizer, add semantic tokens bench marks Feb 22, 2024
@michaelpj
Copy link
Collaborator

I have produced some more accurate result, seems about 5% improvement to GetSemanticTokens Rule.
Should be more if we consider the tokenizer alone.

FWIW I'm unsure if that's a big enough difference to make this worth it. It might be better just to keep the simpler code (we should definitely add the benchmark, though). Is the improvement bigger on bigger files?

@soulomoon
Copy link
Collaborator Author

soulomoon commented Feb 28, 2024

I have produced some more accurate result, seems about 5% improvement to GetSemanticTokens Rule.

Should be more if we consider the tokenizer alone.

FWIW I'm unsure if that's a big enough difference to make this worth it. It might be better just to keep the simpler code (we should definitely add the benchmark, though). Is the improvement bigger on bigger files?

The bench is running on files with 2.5k and 1k lines of code in cabal package.

agree,not much different since the current computation is relatively simple, perhaps I should turn it into draft and revisit this if/when more features are introduced that make the computation heavy. (btw, the semantic tokens bench and the bench fix are already included his-graph patch)

@soulomoon soulomoon marked this pull request as draft February 28, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants