Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Permission denied when downloading or serving any model. #112

Open
Anindyadeep opened this issue Jan 6, 2024 · 0 comments
Open

Permission denied when downloading or serving any model. #112

Anindyadeep opened this issue Jan 6, 2024 · 0 comments

Comments

@Anindyadeep
Copy link

Anindyadeep commented Jan 6, 2024

Hello,

Thanks for the awesome implementation. However, I am running into several problems, and am not able to run the model successfully. Here is my full reproduction procedure and the corresponding issues I ran into.

Reproduction procedure

I started with the cloning of the repo. And then the first issue I ran was in the docker command. The original command was:

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash

Here is the error I got:

docker: Error response from daemon: invalid volume specification: '/home/paperspace/.cache:~/data': invalid mount config for type "bind": invalid mount path: '~/data' mount path must be absolute.
See 'docker run --help'.

This is how I gave a quick fix to this:

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir/data anyscale/ray-llm:latest bash

This ran, and I came inside the bash. After this, I did not find the server_configs folder. These were the files:

(base) ray@cc135fcd33a8:~$ ls
anaconda3            dist    pip-freeze.txt             requirements_compiled_py37.txt
configure-bashrc.sh  models  requirements_compiled.txt  run-vscode-server.sh

So, I cloned ray-llm inside the container, and then I tried to run the server with this command:

serve run serve_configs/amazon--LightGPT.yaml 

this produces the following error:

(ServeController pid=589) INFO 2024-01-06 09:35:39,656 controller 589 deployment_state.py:1679 - Adding 2 replicas to deployment Router in application 'ray-llm'.
(autoscaler +12s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +12s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster to resolve this issue.
(ServeController pid=589) WARNING 2024-01-06 09:36:09,717 controller 589 deployment_state.py:1987 - Deployment 'VLLMDeployment:amazon--LightGPT' in application 'ray-llm' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0, "accelerator_type_a10": 0.01}, {"GPU": 1.0, "CPU": 8.0, "accelerator_type_a10": 0.01}], total resources available: {}. Use `ray status` for more details.
(ServeReplica:ray-llm:Router pid=708) There was a problem when trying to write in your cache folder (/home/paperspace/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ServeReplica:ray-llm:Router pid=708) [WARNING 2024-01-06 09:35:42,961] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead. [repeated 2x across cluster]
(autoscaler +47s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 9.0, 'accelerator_type_a10': 0.02, 'GPU': 1.0}). Add suitable node types to this cluster to resolve this issue.
(ServeController pid=589) WARNING 2024-01-06 09:36:39,753 controller 589 deployment_state.py:1987 - Deployment 'VLLMDeployment:amazon--LightGPT' in application 'ray-llm' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0, "accelerator_type_a10": 0.01}, {"GPU": 1.0, "CPU": 8.0, "accelerator_type_a10": 0.01}], total resources available: {}. Use `ray status` for more details.

Doing a quick Google search got me into issue #101 and I got a quick fix there. I edited some part based on my device config and then pasted this command:

ray start --head --dashboard-host=0.0.0.0 --num-cpus 12 --num-gpus 1 --resources '{"accelerator_type_a10":1}'

After this, I ran the same command:

(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,912] vllm_models.py: 218  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f04c6f0a850> PlacementGroupID(9e905cb0eba5cd4f2a25fea839a201000000). {'placement_group_id': '9e905cb0eba5cd4f2a25fea839a201000000', 'name': 'SERVE_REPLICA::ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG', 'bundles': {0: {'accelerator_type_a10': 0.01, 'CPU': 1.0}, 1: {'accelerator_type_a10': 0.01, 'GPU': 1.0, 'CPU': 8.0}}, 'bundles_to_node_id': {0: '5a6951a3f5dafbc1bb6ff8115ee842faf1177d5ca9a623a1708173f1', 1: '5a6951a3f5dafbc1bb6ff8115ee842faf1177d5ca9a623a1708173f1'}, 'strategy': 'STRICT_PACK', 'state': 'CREATED', 'stats': {'end_to_end_creation_latency_ms': 2.788, 'scheduling_latency_ms': 2.63, 'scheduling_attempt': 1, 'highest_retry_delay_ms': 0.0, 'scheduling_state': 'FINISHED'}}
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,913] vllm_models.py: 221  Using existing placement group <ray.util.placement_group.PlacementGroup object at 0x7f04c6f0a850>
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:52,913] vllm_node_initializer.py: 38  Starting initialize_node tasks on the workers and local node...
(pid=1133) There was a problem when trying to write in your cache folder (/home/paperspace/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ServeReplica:ray-llm:Router pid=1548) [WARNING 2024-01-06 09:38:53,032] api.py: 382  DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead. [repeated 3x across cluster]
(ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT pid=1546) [INFO 2024-01-06 09:38:55,653] utils.py: 299  Did not receive s3_mirror_config or gcs_error_config. Not downloading model from AWS S3 or Google Cloud Storage.
(ServeController pid=1431) ERROR 2024-01-06 09:38:55,751 controller 1431 deployment_state.py:617 - Exception in replica 'ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG', the replica will be stopped.
(ServeController pid=1431) Traceback (most recent call last):
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_ready
(ServeController pid=1431)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
(ServeController pid=1431)     return fn(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=1431)     return func(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
(ServeController pid=1431)     raise value.as_instanceof_cause()
(ServeController pid=1431) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT.initialize_and_get_metadata() (pid=1546, ip=172.17.0.2, actor_id=bf8c710a99c964b227d2ee6f01000000, repr=<ray.serve._private.replica.ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT object at 0x7f059a46a250>)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeController pid=1431)     return self.__get_result()
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=1431)     raise self._exception
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
(ServeController pid=1431)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=1431) RuntimeError: Traceback (most recent call last):
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
(ServeController pid=1431)     await self._initialize_replica()
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=1431)     await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/server/vllm/vllm_deployment.py", line 37, in __init__
(ServeController pid=1431)     await self.engine.start()
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_engine.py", line 78, in start
(ServeController pid=1431)     pg, runtime_env = await self.node_initializer.initialize_node(self.llm_app)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 52, in initialize_node
(ServeController pid=1431)     await self._initialize_local_node(engine_config)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
(ServeController pid=1431)     result = self.fn(*self.args, **self.kwargs)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 72, in _initialize_local_node
(ServeController pid=1431)     _ = AutoTokenizer.from_pretrained(engine_config.actual_hf_model_id)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 701, in from_pretrained
(ServeController pid=1431)     tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in get_tokenizer_config
(ServeController pid=1431)     resolved_config_file = cached_file(
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/hub.py", line 429, in cached_file
(ServeController pid=1431)     resolved_file = hf_hub_download(
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
(ServeController pid=1431)     return fn(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
(ServeController pid=1431)     os.makedirs(storage_folder, exist_ok=True)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 225, in makedirs
(ServeController pid=1431)     mkdir(name, mode)
(ServeController pid=1431) PermissionError: [Errno 13] Permission denied: '/home/paperspace/data'
(ServeController pid=1431) INFO 2024-01-06 09:38:55,861 controller 1431 deployment_state.py:2027 - Replica ray-llm#VLLMDeployment:amazon--LightGPT#TSxkyG is stopped.
(ServeController pid=1431) INFO 2024-01-06 09:38:55,861 controller 1431 deployment_state.py:1679 - Adding 1 replica to deployment VLLMDeployment:amazon--LightGPT in application 'ray-llm'.
(ServeController pid=1431) ERROR 2024-01-06 09:38:59,054 controller 1431 deployment_state.py:617 - Exception in replica 'ray-llm#VLLMDeployment:amazon--LightGPT#Oywlbj', the replica will be stopped.
(ServeController pid=1431) Traceback (most recent call last):
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_ready
(ServeController pid=1431)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
(ServeController pid=1431)     return fn(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=1431)     return func(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
(ServeController pid=1431)     raise value.as_instanceof_cause()
(ServeController pid=1431) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT.initialize_and_get_metadata() (pid=1703, ip=172.17.0.2, actor_id=389a40046d7c102bd4b6d1c101000000, repr=<ray.serve._private.replica.ServeReplica:ray-llm:VLLMDeployment:amazon--LightGPT object at 0x7f290c1de250>)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeController pid=1431)     return self.__get_result()
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=1431)     raise self._exception
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
(ServeController pid=1431)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=1431) RuntimeError: Traceback (most recent call last):
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
(ServeController pid=1431)     await self._initialize_replica()
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
(ServeController pid=1431)     await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/server/vllm/vllm_deployment.py", line 37, in __init__
(ServeController pid=1431)     await self.engine.start()
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_engine.py", line 78, in start
(ServeController pid=1431)     pg, runtime_env = await self.node_initializer.initialize_node(self.llm_app)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 52, in initialize_node
(ServeController pid=1431)     await self._initialize_local_node(engine_config)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
(ServeController pid=1431)     result = self.fn(*self.args, **self.kwargs)
(ServeController pid=1431)   File "/home/ray/ray-llm/rayllm/backend/llm/vllm/vllm_node_initializer.py", line 72, in _initialize_local_node
(ServeController pid=1431)     _ = AutoTokenizer.from_pretrained(engine_config.actual_hf_model_id)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 701, in from_pretrained
(ServeController pid=1431)     tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 534, in get_tokenizer_config
(ServeController pid=1431)     resolved_config_file = cached_file(
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/utils/hub.py", line 429, in cached_file
(ServeController pid=1431)     resolved_file = hf_hub_download(
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
(ServeController pid=1431)     return fn(*args, **kwargs)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
(ServeController pid=1431)     os.makedirs(storage_folder, exist_ok=True)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=1431)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=1431)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 225, in makedirs
(ServeController pid=1431)     mkdir(name, mode)
(ServeController pid=1431) PermissionError: [Errno 13] Permission denied: '/home/paperspace/data'

Specifically saw this permission denied problem at: '/home/paperspace/data'. I tried to change the folder from this given folder to a different folder but got the same error.

Additionally, I got these error logs too, after the above permission-denied error was printed. Which is very similar to the issue #55

alanwguo pushed a commit that referenced this issue Jan 25, 2024
Fixes #9

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant