Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there anything we can do about the JavaScript overhead that comes from predictions made in a loop? #8212

Open
Vectorrent opened this issue Mar 16, 2024 · 0 comments

Comments

@Vectorrent
Copy link
Contributor

Vectorrent commented Mar 16, 2024

I have a simple, stacked RNN, which predicts text in a loop, character-by-character. Here is a simplified version of that code.

No matter the size of my model, there is a constant ~150ms of latency associated with each prediction, per-layer. For reference:

layers latency
[16] 150ms/token
[256] 150ms/token
[16, 16] 300ms/token
[256, 256] 300ms/token
[16, 16, 16] 450ms/token
[256, 256, 256] 450ms/token

Currently, I'm running this code in Node.js (on GPU), but I can confirm that the latency persists in WebGL as well.

Is there anything we can do to speed-up predictions here? Text generation is unbearably slow, to the point where TFJS is barely even useful for my task. Conversely, training is fast, even with big batches and many layers! Clearly, the issue comes from repeated calls to .predict(), and the overhead associated with each call. Is there any way to move this computation into the model, and return an entire sequence from a single prediction - rather than token-by-token, in a loop?

For reference, a Brain.js model of comparable size is able to predict an entire sequence of any length nearly instantaneously - via CPU, no less! Is there any way we could integrate such optimizations here?

Any advice would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants