Long Model Loading times in Multimodel Server #113

AlexRaschl · 2022-09-26T08:47:49Z

Describe the bug
According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again.
First Question would be: What does 'frequently' mean?

If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.

To reproduce

Deploy a Sagemaker Multimodel Server using boto3
Generate a sagemaker runtime_client using boto3 and execute the following code:

for i in range(20):
  start = time.time()
  response = rt_client.invoke_endpoint(
                              EndpointName=self.endpoint_name,
                              ContentType='application/x-npy',
                              TargetModel='model_store/custom_model_1.tar.gz', # Constantly the same model
                              Body=payload, # Byte encoded numpy array
                          )
   end = time.time()
   response_time  = end - start
   print(f'Request took {response_time}s'.)
   time.sleep(30)

Expected behavior
First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.

Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.

Ist there any way to influence the timing of the unloading behavior?
To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.

Screenshots or logs
Time sleep 30s:

	 Call: 0 of 20 with 4 samples took: 2.847299098968506s.
	 Call: 1 of 20 with 4 samples took: 3.017570734024048s.
	 Call: 2 of 20 with 4 samples took: 2.866020917892456s.
	 Call: 3 of 20 with 4 samples took: 2.888610363006592s.
	 Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
	 Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
	 Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
	 Call: 7 of 20 with 4 samples took: 2.912917375564575s.
	 Call: 8 of 20 with 4 samples took: 2.866114854812622s.
	 Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
	 Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
	 Call: 11 of 20 with 4 samples took: 2.79472017288208s.
	 Call: 12 of 20 with 4 samples took: 2.992703437805176s.
	 Call: 13 of 20 with 4 samples took: 2.954014301300049s.
	 Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
	 Call: 15 of 20 with 4 samples took: 2.928661346435547s.
	 Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
	 Call: 17 of 20 with 4 samples took: 2.922405481338501s.
	 Call: 18 of 20 with 4 samples took: 2.982257843017578s.
	 Call: 19 of 20 with 4 samples took: 2.8227620124816895s.

Time sleep(20)s

	 Call: 0 of 20 with 4 samples took: 3.329136848449707s.
	 Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
	 Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
	 Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
	 Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
	 Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
	 Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
	 Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
	 Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
	 Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
	 Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
	 Call: 11 of 20 with 4 samples took: 0.572312593460083s.
	 Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
	 Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
	 Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
	 Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
	 Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
	 Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
	 Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
	 Call: 19 of 20 with 4 samples took: 0.579679012298584s.

System information

Custom Docker Image:
- Inference Framework: SkLearn
- Sagemaker Inference Toolkit: 1.6.1
- Multimodel Server: 1.1.8
- Python version: 3.9
- processing unit type CPU (ml.t2.medium)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Model Loading times in Multimodel Server #113

Long Model Loading times in Multimodel Server #113

AlexRaschl commented Sep 26, 2022

Long Model Loading times in Multimodel Server #113

Long Model Loading times in Multimodel Server #113

Comments

AlexRaschl commented Sep 26, 2022