Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChiaDog is not recovering from a remote harvester being down #283

Open
Jacek-ghub opened this issue Aug 23, 2021 · 8 comments
Open

ChiaDog is not recovering from a remote harvester being down #283

Jacek-ghub opened this issue Aug 23, 2021 · 8 comments

Comments

@Jacek-ghub
Copy link

Jacek-ghub commented Aug 23, 2021

Hi, I have ChiaDog running on a CentOS box. I mapped my harvesters to local folders. Works great.

However, when a harvester box is restarted, ChiaDog is stuck on not seeing that log file anymore, until I restart ChiaDog for that harvester. Maybe when ChiaDog is detecting harvester down (no access to the file), it should try to check whether the file access has been restored?

A clear and concise description of what the bug is and how it can be reproduced.

  • Setup
    • One box for harvester, one for ChiaDog
  • Map harvester log folder to a local folder on ChiaDog box
  • Run ChiaDog
  • Pull down the network cable from the ChiaDog box
    • ChiaDog starts sending "Harvester Down" notifications
  • Reconnect network to ChiaDog
    • ChiaDog keeps sending "Harvester Down" notifications

Environment:

  • OS: CentOS (for ChiaDog box)
  • Python version: 3.9.6
  • Chiadog version: hmm, latest? Maybe ChiaDog version should be included in those notifications, or in the first log line, when it is started?
  • Harvester: remote; however, mapped to a local folder, so seen as local to ChiaDog (maybe this is the reason that ChiaDog is not checking whether file access was restored, as it assumed that this is a catastrophic failure, and is due to reboot?)

Here is the exception generated when harvester went down:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/mnt/chia_logs/chiadog/ox/src/chia_log/log_consumer.py", line 75, in _consume_loop
    for log_line in Pygtail(self._expanded_log_path, read_from_end=True, offset_file=self._offset_path):
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 89, in __init__
    if self._offset_file_inode != stat(self.filename).st_ino or \
OSError: [Errno 112] Host is down: '/mnt/chia_logs/ox/debug.log'
Exception ignored in: <function Pygtail.__del__ at 0x7f8f87633c10>
Traceback (most recent call last):
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 97, in __del__
    if self._filehandle():
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 179, in _filehandle
    self._fh = open(filename, "r", 1)
OSError: [Errno 112] Host is down: '/mnt/chia_logs/ox/debug.log'

@sorenfriis
Copy link

I see the same issue when using the network_log_consumer with SSH
When the connection is lost (e.g. lost WiFi), it is never restored, and I have to restart the chia-dog instance, to reestablish the connection and consume the log files.

@Jacek-ghub
Copy link
Author

@sorenfriis Is there any reason why you would prefer to expose the whole box (using SSH), vs. just local mapping the log folder with read only privileges? You can map Samba or NFS, so any box/OS combination will work.

@sorenfriis
Copy link

@Jacek-ghub I am only letting chiadog in over SSH with a dedicated user who only has read access to the log file

@Jacek-ghub
Copy link
Author

I would also suggest that just one notification about the harvester being down event is being sent. I guess, we all know what to do when we get notified, so those extra notifications are both redundant and (to me only?) annoying.

Saying that, I would also like to see a notification when a bunch of plots is being added (what would indicate connecting a new drive with plots - moving HDs around). That notification would be most often complementary to the one that is being sent when plots are disappearing from the harvester (HD unplugged from the plotter). This way, it would be a good notification that the added drive was recognized by the harvester, so we would not need to relay on rather hopeless full node UI.

@martomi
Copy link
Owner

martomi commented Aug 26, 2021

Like the suggestions & ideas! Happy to provide guidance if you or anyone else is interested to tackle them in code :-)

@Jacek-ghub
Copy link
Author

Sorry, I don't know anything about Python, so potentially my questions will be rather dumb. I did test changes to daily status messages, but it was a pain sitting in a root folder, and trying to grep stuff.

Which files are involved in opening those log files?

@martomi
Copy link
Owner

martomi commented Aug 27, 2021

You can see high-level architecture diagram here - it should make the file structure more intuitive. The log consumers are defined in log_consumer.py.

Since you have mapped the remote log file to the local filesystem, the most relevant part of the code is in the FileLogConsumer here (we use pygtail):

for log_line in Pygtail(self._expanded_log_path, read_from_end=True, offset_file=self._offset_path):

@ZwaZo22
Copy link

ZwaZo22 commented Sep 24, 2021

Hi, I have ChiaDog running on a CentOS box. I mapped my harvesters to local folders. Works great.

However, when a harvester box is restarted, ChiaDog is stuck on not seeing that log file anymore, until I restart ChiaDog for that harvester. Maybe when ChiaDog is detecting harvester down (no access to the file), it should try to check whether the file access has been restored?

Got the same behaviour here. The only way I found right now is to kill the chiadog process and restart it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants