Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performances #8

Open
astariul opened this issue Dec 27, 2019 · 3 comments
Open

Performances #8

astariul opened this issue Dec 27, 2019 · 3 comments

Comments

@astariul
Copy link

Thanks for open-sourcing the code !

This approach is very interesting, but I'm curious about the impact on performance (inference speed).

Is there any benchmark showing the impact on performance with different parameters ?

@dathath
Copy link
Contributor

dathath commented Dec 27, 2019

Thanks for the question! We have not run a detailed analysis on the inference speed, but it is slower than normal inference because of the gradient based updates to the activations. We are working on an extension that alleviates some of this, but it does get slower with an increased number of gradient updates.

@erik-dunteman
Copy link

(not an issue or resolution, just a note)

I'm also super grateful you've open-sourced this! It's a very creative approach to perturb the past and rerun iteratively.

I've productionized this, figured I'd share some learnings:

  • because the need to record and backpropagate gradients, it's not possible (as far as I know) to serve inference from a web based serving engine such as Tensorflow serving, as the .backward() method passes through the model itself, and gradient taping doesn't work over distributed system calls such as REST or gRPC
  • It must be ran directly as is (loading the model directly into the application, and calling with model()).
  • On a g4dn.2xlarge instance, I have an XL GPT2 model running, wrapped in the flask server. With that setup, and with num_iterations=3, I'm pulling an average of 0.6s per word. This drops with fewer iterations, and drops dramatically if serving a smaller parameter model.
  • Because of model size in memory, it's not feasible to run multiple workers in flask, as you'd load redundant models into memory, proportional to the quantity of workers

In short, running this setup in production is tough; you can get decent speeds (5+ words per second with smaller gpt2 models on a GPU), but concurrent calls will queue since your flask server only has one worker.

@erik-dunteman
Copy link

To directly answer the question, if I understand this code correctly, the performance impact is (1 + num_iterations) times greater than simply calling the model as-is. That's making the simplification assumption that the model predict function is 100% of the total inference time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants