You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No matter the size of my model, there is a constant ~150ms of latency associated with each prediction, per-layer. For reference:
layers
latency
[16]
150ms/token
[256]
150ms/token
[16, 16]
300ms/token
[256, 256]
300ms/token
[16, 16, 16]
450ms/token
[256, 256, 256]
450ms/token
Currently, I'm running this code in Node.js (on GPU), but I can confirm that the latency persists in WebGL as well.
Is there anything we can do to speed-up predictions here? Text generation is unbearably slow, to the point where TFJS is barely even useful for my task. Conversely, training is fast, even with big batches and many layers! Clearly, the issue comes from repeated calls to .predict(), and the overhead associated with each call. Is there any way to move this computation into the model, and return an entire sequence from a single prediction - rather than token-by-token, in a loop?
For reference, a Brain.js model of comparable size is able to predict an entire sequence of any length nearly instantaneously - via CPU, no less! Is there any way we could integrate such optimizations here?
Any advice would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
I have a simple, stacked RNN, which predicts text in a loop, character-by-character. Here is a simplified version of that code.
No matter the size of my model, there is a constant ~150ms of latency associated with each prediction, per-layer. For reference:
Currently, I'm running this code in Node.js (on GPU), but I can confirm that the latency persists in WebGL as well.
Is there anything we can do to speed-up predictions here? Text generation is unbearably slow, to the point where TFJS is barely even useful for my task. Conversely, training is fast, even with big batches and many layers! Clearly, the issue comes from repeated calls to
.predict()
, and the overhead associated with each call. Is there any way to move this computation into the model, and return an entire sequence from a single prediction - rather than token-by-token, in a loop?For reference, a Brain.js model of comparable size is able to predict an entire sequence of any length nearly instantaneously - via CPU, no less! Is there any way we could integrate such optimizations here?
Any advice would be greatly appreciated.
The text was updated successfully, but these errors were encountered: