Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rclone unable to access remote directory #1134

Open
willamloo3192 opened this issue Feb 28, 2024 · 24 comments
Open

Rclone unable to access remote directory #1134

willamloo3192 opened this issue Feb 28, 2024 · 24 comments
Assignees

Comments

@willamloo3192
Copy link

Command: rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P

Error Message:
2024-02-28 13:54:17 ERROR : : error reading source directory: directory not found
2024-02-28 13:54:17 ERROR : Attempt 1/3 failed with 1 errors and: directory not found
2024-02-28 13:54:18 ERROR : : error reading source directory: directory not found
2024-02-28 13:54:18 ERROR : Attempt 2/3 failed with 1 errors and: directory not found
2024-02-28 13:54:18 ERROR : : error reading source directory: directory not found
2024-02-28 13:54:18 ERROR : Attempt 3/3 failed with 1 errors and: directory not found
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors: 1 (retrying may help)
Elapsed time: 1.8s
2024/02/28 13:54:18 Failed to copy: directory not found

@gfursin
Copy link
Contributor

gfursin commented Feb 28, 2024

Thank you for reporting @willamloo3192 . There are multiple issues with the MLCommons cloud at the moment. I believe we had an alternative way to download models. Let me sync with @arjunsuresh today.

@gfursin
Copy link
Contributor

gfursin commented Feb 28, 2024

Hi agian @willamloo3192 - actually you can't use rclone command like that because you need rclone config that will convert "mlc-inference" to an URL in the MLCommons cloud. CM is generating such config on the fly but we still need to fix the previous problem. I hope to provide fixes today ... Once again thank you for reporting!

Related: #1136

@gfursin
Copy link
Contributor

gfursin commented Feb 29, 2024

For rclone to work without CM, you need to run this command before to set up rclone keys:
https://github.com/mlcommons/ck/blob/master/cm-mlops/script/get-ml-model-stable-diffusion/_cm.json#L159

@willamloo3192
Copy link
Author

Hi @gfursin

I tried this before, but still unable to make it.
user@host:~$ rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
Remote config

[mlc-inference]
provider=Cloudflare = access_key_id=f65ba5eef400db161ea49967de89f47b
secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

user@hsot:~$ rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P
2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found
2024-03-04 10:41:48 ERROR : Attempt 1/3 failed with 1 errors and: directory not found
2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found
2024-03-04 10:41:48 ERROR : Attempt 2/3 failed with 1 errors and: directory not found
2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found
2024-03-04 10:41:48 ERROR : Attempt 3/3 failed with 1 errors and: directory not found
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors: 1 (retrying may help)
Elapsed time: 2.4s
2024/03/04 10:41:48 Failed to copy: directory not found

@gfursin
Copy link
Contributor

gfursin commented Mar 4, 2024

I have a feeling that it still related to your use of PROXY that rclone may not support (or may need explicit flags). I am checking it with @arjunsuresh and we will see if we can either emulate the environment with proxy to access internet or still try to provide a few possible solutions.

It seems that each tool handles proxy differently. I see in rclone docs that it may be needed to set up the following var to use proxy:

set HTTP_PROXY=...

Do you always have some env variables set in your environment to point to the proxy server? Is it a full URL with port? Is it different for HTTP and HTTPS ? I am just trying to see how we can set something like --proxy=yes in CM and then internally map some of your environment variables to the tool flags that are wrapped by CM ...

@gfursin
Copy link
Contributor

gfursin commented Mar 4, 2024

@arjunsuresh - let's sync there too ...

@willamloo3192
Copy link
Author

I have a feeling that it still related to your use of PROXY that rclone may not support (or may need explicit flags). I am checking it with @arjunsuresh and we will see if we can either emulate the environment with proxy to access internet or still try to provide a few possible solutions.

It seems that each tool handles proxy differently. I see in rclone docs that it may be needed to set up the following var to use proxy:

set HTTP_PROXY=...

Do you always have some env variables set in your environment to point to the proxy server? Is it a full URL with port? Is it different for HTTP and HTTPS ? I am just trying to see how we can set something like --proxy=yes in CM and then internally map some of your environment variables to the tool flags that are wrapped by CM ...

To answer this, we have set the proxy parameter with both capitalized and uncapitalized. The outcome is still the same.

@arjunsuresh
Copy link
Contributor

I believe the config should be as follows:

$ cat ~/.config/rclone/rclone.conf
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

@willamloo3192
Copy link
Author

I believe the config should be as follows:

$ cat ~/.config/rclone/rclone.conf
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

I have a quick check on my config file and i found some divergence based on what you provided.

[mlc-inference]
type = s3
`provider=Cloudflare` = access_key_id=f65ba5eef400db161ea49967de89f47b
`secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b` = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

After applied suggested config from you, I found some errors due to timeout.

2024/03/05 08:24:09 ERROR : S3 bucket mlcommons-inference-wg-public path stable_diffusion_fp16: error reading source root directory: RequestError: send request failed
caused by: Get "https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com/mlcommons-inference-wg-public?delimiter=%2F&encoding-type=url&list-type=2&max-keys=1000&prefix=stable_diffusion_fp16%2F": tls: failed to verify certificate: x509: certificate signed by unknown authority

@arjunsuresh
Copy link
Contributor

Does adding this option help? --ftp-no-check-certificate ?

@willamloo3192
Copy link
Author

--ftp-no-check-certificate

it doesn't help out with the command rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P --ftp-no-check-certificate

@willamloo3192
Copy link
Author

@arjunsuresh I tried with --no-check-certificate flag, it seems works but I would like to get your assistance to verify whether the file size is correct or not.

user@host:~/CM/repos/local/cache/24889d8c0a934aec/inference$ rclone --no-check-certificate copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp16 ./stable_diffusion_fp16 -P
Transferred:       35.474k / 35.474 kBytes, 100%, 36.082 kBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         3.6s

@arjunsuresh
Copy link
Contributor

Unfortunately don't think so as the file is supposed to be in GBs. I believe this is the proxy issue. Does this link help?

@willamloo3192
Copy link
Author

Unfortunately don't think so as the file is supposed to be in GBs. I believe this is the proxy issue. Does this link help?

Doesn't help as the HTTP_PROXY and HTTPS_PROXY I had set in the environment variable. Might need your assistance.

@willamloo3192
Copy link
Author

willamloo3192 commented Mar 6, 2024

Anyhow using the latest CM repo, I'm still unable to download the model.

Command: cm run script --tags=get,ml-model,sdxl,_fp32,_rclone -j

rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com


rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32 -P
Transferred:       35.474 KiB / 35.474 KiB, 100%, 0 B/s, ETA -
Transferred:            1 / 1, 100%
Elapsed time:         1.7s
           ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/download-file/customize.py
         ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/download-and-extract/customize.py
       ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/get-ml-model-stable-diffusion/customize.py

{
  "return": 0,
  "env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32"
  },
  "new_env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32"
  },
  "state": {},
  "new_state": {},
  "deps": [
    "download-and-extract,_rclone,_url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"
  ]
}

S
table diffusion checkpoint path: /home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32

The model in the huggingface.co seems like have different filename. I'm not sure whether my suspection is correct or not.
image

@arjunsuresh
Copy link
Contributor

Transferred:       35.474 KiB / 35.474 KiB, 100%, 0 B/s, ETA -
Transferred:            1 / 1, 100%

This means nothing really got downloaded. Since rclone download with proxy is not working outside of CM, it won't work via CM either. But we do see people using rclone behind proxy without any special settings in some MLPerf submissions. So I'm not sure what's the issue at your end.

@willamloo3192
Copy link
Author

From my side, we set the proxy via HTTP_PROXY and HTTPS_PROXY at environment variable and apt config file, then we are able to download file via wget with flag --no-check-certificate and install package via apt install xxx

For rclone wise, I'm kind out of idea

@arjunsuresh
Copy link
Contributor

Unfortunately we are also not entirely sure there as we just wrap the rclone command. We don't have an environment similar to yours to test further either.

@willamloo3192
Copy link
Author

Would you rerun the same command and share with me the output of the CM? Thanks.

@arjunsuresh
Copy link
Contributor

It is still ongoing...

[cmuser@e761b48fa277 ~]$ cm run script --tags=get,ml-model,sdxl,_fp32,_rclone -j

* cm run script "get ml-model sdxl _fp32 _rclone"
=================================================
WARNINGS:

  Required disk space: 13000 MB
=================================================

  * cm run script "download-and-extract _rclone _url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"

    * cm run script "download file _rclone _url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"

      * cm run script "detect os"
             ! cd /home/cmuser/CM/repos/local/cache/194c9e164d68412b
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/run.sh from tmp-run.sh
             ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/customize.py

      * cm run script "get rclone"

        * cm run script "detect os"
               ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
               ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/run.sh from tmp-run.sh
               ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/customize.py
        - Searching for versions:  == 1.65.2
                 ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
                 ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/run.sh from tmp-run.sh
/home/cmuser/.local/bin:/home/cmuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/cmuser/.local/bin
rclone was not detected
      Downloading https://downloads.rclone.org/v1.65.2/rclone-v1.65.2-linux-amd64.zip
             ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/install.sh from tmp-run.sh
--2024-03-06 03:04:18--  https://downloads.rclone.org/v1.65.2/rclone-v1.65.2-linux-amd64.zip
Resolving downloads.rclone.org (downloads.rclone.org)... 95.217.6.16, 2a01:4f9:c012:7154::1
Connecting to downloads.rclone.org (downloads.rclone.org)|95.217.6.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20348123 (19M) [application/zip]
Saving to: ‘rclone-v1.65.2-linux-amd64.zip’

rclone-v1.65.2-linux-amd64.zip                             100%[=======================================================================================================================================>]  19.41M  3.70MB/s    in 6.1s

2024-03-06 03:04:26 (3.17 MB/s) - ‘rclone-v1.65.2-linux-amd64.zip’ saved [20348123/20348123]

Archive:  rclone-v1.65.2-linux-amd64.zip
   creating: rclone-v1.65.2-linux-amd64/
  inflating: rclone-v1.65.2-linux-amd64/rclone.1
  inflating: rclone-v1.65.2-linux-amd64/README.txt
  inflating: rclone-v1.65.2-linux-amd64/README.html
  inflating: rclone-v1.65.2-linux-amd64/git-log.txt
  inflating: rclone-v1.65.2-linux-amd64/rclone
             ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/run.sh from tmp-run.sh
/home/cmuser/CM/repos/local/cache/8d72574a4a69426f:/home/cmuser/.local/bin:/home/cmuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/cmuser/.local/bin
             ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/customize.py
          Detected version: 1.65.2

Downloading from mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32
           ! cd /home/cmuser/CM/repos/local/cache/194c9e164d68412b
           ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-file/run.sh from tmp-run.sh

rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
2024/03/06 03:04:27 NOTICE: Config file "/home/cmuser/.config/rclone/rclone.conf" not found - using defaults
[mlc-inference]
type = s3
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b


rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32 -P
Transferred:   	    1.708 GiB / 12.926 GiB, 13%, 72.797 MiB/s, ETA 2m37s
Transferred:           17 / 19, 89%
Elapsed time:        25.2s
Transferring:
 * checkpoint_pipe/unet/d…orch_model.safetensors:  6% /9.565Gi, 29.152Mi/s, 5m15s
 * checkpoint_pipe/text_e…er_2/model.safetensors: 13% /2.588Gi, 21.794Mi/s, 1m45s

@willamloo3192
Copy link
Author

Awesome! Did you mind to share with me your environment variable? I just want to make apple-to-apple comparison.

@arjunsuresh
Copy link
Contributor

I'm not having any proxy in use. It is actually a clean docker running RHEL 8.

@arjunsuresh
Copy link
Contributor

The final output

rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32 -P
Transferred:   	   12.926 GiB / 12.926 GiB, 100%, 530.534 KiB/s, ETA 0s
Transferred:           19 / 19, 100%
Elapsed time:     11m33.4s
           ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-file/customize.py
         ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-and-extract/customize.py
       ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-ml-model-stable-diffusion/customize.py

{
  "return": 0,
  "env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32"
  },
  "new_env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32"
  },
  "state": {},
  "new_state": {},
  "deps": [
    "download-and-extract,_rclone,_url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"
  ]
}

Stable diffusion checkpoint path: /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32

@willamloo3192
Copy link
Author

I see. Okay, I have to consult my company's IT department how to unblock it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants