Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Model Loading times in Multimodel Server #113

Open
AlexRaschl opened this issue Sep 26, 2022 · 0 comments
Open

Long Model Loading times in Multimodel Server #113

AlexRaschl opened this issue Sep 26, 2022 · 0 comments

Comments

@AlexRaschl
Copy link

Describe the bug
According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again.
First Question would be: What does 'frequently' mean?

If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.

To reproduce

  • Deploy a Sagemaker Multimodel Server using boto3
  • Generate a sagemaker runtime_client using boto3 and execute the following code:
for i in range(20):
  start = time.time()
  response = rt_client.invoke_endpoint(
                              EndpointName=self.endpoint_name,
                              ContentType='application/x-npy',
                              TargetModel='model_store/custom_model_1.tar.gz', # Constantly the same model
                              Body=payload, # Byte encoded numpy array
                          )
   end = time.time()
   response_time  = end - start
   print(f'Request took {response_time}s'.)
   time.sleep(30)

Expected behavior
First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.

Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.

Ist there any way to influence the timing of the unloading behavior?
To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.

Screenshots or logs
Time sleep 30s:

	 Call: 0 of 20 with 4 samples took: 2.847299098968506s.
	 Call: 1 of 20 with 4 samples took: 3.017570734024048s.
	 Call: 2 of 20 with 4 samples took: 2.866020917892456s.
	 Call: 3 of 20 with 4 samples took: 2.888610363006592s.
	 Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
	 Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
	 Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
	 Call: 7 of 20 with 4 samples took: 2.912917375564575s.
	 Call: 8 of 20 with 4 samples took: 2.866114854812622s.
	 Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
	 Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
	 Call: 11 of 20 with 4 samples took: 2.79472017288208s.
	 Call: 12 of 20 with 4 samples took: 2.992703437805176s.
	 Call: 13 of 20 with 4 samples took: 2.954014301300049s.
	 Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
	 Call: 15 of 20 with 4 samples took: 2.928661346435547s.
	 Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
	 Call: 17 of 20 with 4 samples took: 2.922405481338501s.
	 Call: 18 of 20 with 4 samples took: 2.982257843017578s.
	 Call: 19 of 20 with 4 samples took: 2.8227620124816895s.

Time sleep(20)s

	 Call: 0 of 20 with 4 samples took: 3.329136848449707s.
	 Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
	 Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
	 Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
	 Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
	 Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
	 Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
	 Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
	 Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
	 Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
	 Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
	 Call: 11 of 20 with 4 samples took: 0.572312593460083s.
	 Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
	 Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
	 Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
	 Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
	 Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
	 Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
	 Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
	 Call: 19 of 20 with 4 samples took: 0.579679012298584s.

System information

  • Custom Docker Image:
    • Inference Framework: SkLearn
    • Sagemaker Inference Toolkit: 1.6.1
    • Multimodel Server: 1.1.8
    • Python version: 3.9
    • processing unit type CPU (ml.t2.medium)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant