Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SKY-270] [bug] Leaked instances for Ctrl-C of transfers #885

Open
sarahwooders opened this issue Jun 23, 2023 · 0 comments
Open

[SKY-270] [bug] Leaked instances for Ctrl-C of transfers #885

sarahwooders opened this issue Jun 23, 2023 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@sarahwooders
Copy link
Contributor

sarahwooders commented Jun 23, 2023

Describe the bug
There is consistently 1 leaked VM after a transfer is quit.

To Reproduce
Run transfer skyplane cp -r gs://skyplane-big-test-bucket/OPT-cloudflare/ s3://test-us-east-1-7711e4ae/. During dispatch, Ctrl-C exit the transfer.

Transfer client log

Logging to: /tmp/skyplane/transfer_logs/20230623_145734-bd9ae325/client.log
Using Skyplane version 0.3.2
Will transfer objects from gcp:us-central1-a to aws:us-east-1
14:57:36 [WARN]  Quota limit file not found for aws:us-east-1. Try running `skyplane init --reinit-aws` to load the quota information
  VMs to provision: 1x aws:us-east-1, 1x gcp:us-central1-a
  Estimated egress cost: $0.12/GB
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-0.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-0.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-1.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-1.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-2.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-2.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-3.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-3.pt
(15.34GB)
  gs://skyplane-big-test-bucket/OPT-cloudflare/reshard-model_part-4.pt => s3://test-us-east-1-7711e4ae/reshard-model_part-4.pt
(15.34GB)
  ...
Transfer starting
14:57:41 [WARN]  Quota limit file not found for aws:us-east-1. Try running `skyplane init --reinit-aws` to load the quota information
✓ Provisioning VMs (2/2) in 37.14s
⠼ Authorizing gateways with firewalls ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/2 0:00:0114:58:41 [WARN]  :us-east-1 Error adding IPs to security group, since it already exits: An error occurred (InvalidPermission.Duplicate)
when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists
✓ Starting gateway container on VMs (2/2) in 28.52s
⠹ Transfer progressaws:us-east-1 ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/122.7 GiB 482.5 MB/s 0:04:15^C
Transfer cancelled by user. Copying gateway logs and exiting.
⠇ Transfer progressaws:us-east-1 ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/122.7 GiB 473.2 MB/s 0:04:1415:00:00 [ERROR] Error running <lambda>, GCPServer(region_tag=gcp:us-central1-a, instance_name=skyplane-gcp-de24eada): 'NoneType'
object has no attribute 'open_session'
15:00:00 [ERROR] Error running <lambda>, AWSServer(region_tag=aws:us-east-1, instance_id=i-0861627e6ae3b80f1): 'NoneType' object has no
attribute 'open_session'
Exception in thread Thread-35:
Traceback (most recent call last):
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 181, in monitor_single_dst_helper
    self.monitor_transfer(dst_region)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/imports.py", line 33, in wrapped
    return fn(*modules_imported, *args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 278, in monitor_transfer
    do_parallel(lambda i: i.run_command("echo 1"), self.dataplane.bound_nodes.values(), n=8)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 57, in do_parallel
    args, result = future.result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/thread.py",
line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 43, in wrapped_fn
    return args, func(args)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 278, in <lambda>
    do_parallel(lambda i: i.run_command("echo 1"), self.dataplane.bound_nodes.values(), n=8)
  File "/Users/sarahwooders/repos/skyplane/skyplane/compute/server.py", line 241, in run_command
    _, stdout, stderr = client.exec_command(command)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.10/site-packages/paramiko/client.py", line 560, in exec_command
    chan = self._transport.open_session(timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'open_session'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1016, in
_bootstrap_inner
    self.run()
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 216, in run
    raise e
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 214, in run
    results.append(future.result())
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
    return self.__get_result()
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/thread.py",
line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/tracker.py", line 194, in monitor_single_dst_helper
    UsageClient.log_exception(
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 147, in log_exception
    stats = client.make_error(
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 304, in make_error
    dest_regions = [tag.split(":")[1] for tag in dest_region_tags]
  File "/Users/sarahwooders/repos/skyplane/skyplane/api/usage.py", line 304, in <listcomp>
    dest_regions = [tag.split(":")[1] for tag in dest_region_tags]
IndexError: list index out of range
⠇ Transfer progressaws:us-east-1 ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/122.7 GiB 473.2 MB/s 0:04:14%

Environment info (please complete the following information):

  • OS: Mac OS
  • Python version: 3.10
  • Skyplane version: 0.3.2

SKY-270

@sarahwooders sarahwooders added the bug Something isn't working label Jun 23, 2023
@sarahwooders sarahwooders changed the title [bug] Leaked instances for Ctrl-C of transfers [SKY-270] [bug] Leaked instances for Ctrl-C of transfers Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants