Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon7b prefill remaining fixes and cleanup after enabling optimized version #8349

Open
1 of 8 tasks
s-jovic opened this issue May 10, 2024 · 2 comments
Open
1 of 8 tasks

Comments

@s-jovic
Copy link
Contributor

s-jovic commented May 10, 2024

  • Segfault for 1k and 2k prefill on multi-device (single run) and single device (when run in a loop): Seg fault in falcon7b prefil optimised attention #8644
  • Async mode (not all ops used in optimized prefill support async mode)
  • Make model initialization agnostic to sequence length
  • GS path - decide whether to use the optimized version for GS and if so make it work
  • Use appropriate memory config configuration (l1 sharded?)
  • Resolve lm head e2e perf impact
  • Push PCC to 0.99: [issue](Falcon7b prefill PCC below 0.99 #8487)
  • Reduce memory usage of e2e tests - the 2k test uses up to 80gb ram for some reason; try to run PyTorch and tt model sequentially
@pavlepopovic
Copy link

  • Check that all MMs have optimal settings (sunblock_h/w,, in0_block_w) following di/dt fixes (attention, MLP, LM head matmuls)
  • Check if any more sharding is possible throughout model

@s-jovic
Copy link
Contributor Author

s-jovic commented May 24, 2024

  • remove persistent kernel cache usage
  • put optimized attention on CI
  • unify paths for 128/1024/2k and other sequence lengths
  • perf breakdown
  • update perf targets in CI tests
  • add multi-chip 128, 1k, 2k prefill tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants