Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Random medperf local server failures: os.getcwd(), FileNotFoundError: No such file or directory #501

Open
VukW opened this issue Nov 17, 2023 · 0 comments
Labels
project: Core type: bug Something isn't working

Comments

@VukW
Copy link
Contributor

VukW commented Nov 17, 2023

Issue description

I'm running medperf tutorials in WSL and face a strange behaviour when client and server start to fail randomly. As firstly I was thinking that's an internal medperf issue, I'm going to document details here.
When passing tutorials https://docs.medperf.org/getting_started/benchmark_owner_demo/ (this and other ones), I use a local medperf server. While running some (random) commands (usually heavy ones, that require a lot of i/o operations), I got the following error:

Client side:

Traceback (most recent call last):
File "/home/vukw/anaconda3/envs/env39_medperf/bin/mlcube", line 5, in <module>
from mlcube.__main__ import cli
File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/mlcube/__main__.py", line 66, in <module>
default=os.getcwd(),
FileNotFoundError: [Errno 2] No such file or directory

Interesting thing is that it touch not only client side, but a server side also (that's running in an independent bash terminal):

Traceback (most recent call last):
  File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/home/vukw/anaconda3/envs/env39_medperf/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 209, in get_new_connection
    conn = Database.connect(**conn_params)
sqlite3.OperationalError: unable to open database file

Still, rerun server doesn't help:

$ sh setup-dev-server.sh
realpath: cert.crt: No such file or directory
realpath: cert.key: No such file or directory


1
1

0
CERT FILE must not be empty

Moreover, not just medperf is broken, but pip also:

$ pip list
The folder you are executing pip from can no longer be found.

Workarounds and solutions.

Workarounds

  1. First of all, rerunning server and client in a new bash terminal helps to fix issue - for a while. Still after a few commands error is raised again.
  2. cd . also helps like a magic. Looks like it resets working directory path - but again only for a while.

Solution debugging

Together with @hasan7n we've found that sometimes such a behavior can be noticed on external encrypted storages: stackoverflow discussion. In my case I checked out repo in Windows env - so all the files are located somewhere on /mnt/c/Users/vykuk/repos/mlc/medperf, that's actually an external and encrypted drive. Moreover, we've found a WSL issue with a similar behavior and workaround, but without notes about drive encryption. So, looks like WSL mounting drive (in my case) is a particular kind of main problem - that sometimes external drives can be locked & unlocked, and it causes working directory issues for all the scripts running on that storages.

Solution

Thus, a reasonable solution (that helped in my case also) is to move a whole medperf repository from windows host mounted drive /mnt/c/.... to the internal WSL filesystem. Moving the whole repo folder to /home/medperf removes the issue.

Future explorations

I still don't know why exactly mounted storage is locked, which conditions lead to it and who is responsible (Windows host or Ubuntu itself). Also, I didn't met such an issue with other projects located on mounted drive - medperf is the first one who reproduces that behavior. Finally, the nature of the issue makes it extremely hard to find a way to reproduce it with 100% guarantee. Same commands can sometimes pass successfully, and next time fail with error.

We can expect same issue may arise in other systems & combinations - when medperf repo is located on external storages.

Environment

  • Host system: Windows 11, 22H2, OS build 22623.891
  • WSL 1.2.5.0
  • WSL image ($ uname -r): 5.15.90.1-microsoft-standard-WSL2
  • Guest system: (lsb_release -a): Ubuntu 22.04.1 LTS
@VukW VukW added the type: bug Something isn't working label Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: Core type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants