Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilizing the signals feature - $SA 4M bounty! #4727

Open
julian-molina opened this issue Feb 1, 2023 · 4 comments
Open

Stabilizing the signals feature - $SA 4M bounty! #4727

julian-molina opened this issue Feb 1, 2023 · 4 comments
Labels

Comments

@julian-molina
Copy link
Member

We've been experiencing instability both in the broadcasting and the following of signals, with several reports in the Signalytic group that may point to multiple different issues.

There is a hefty bounty (see the title of the issue) in place for the team that takes us back to a stable state, as per the following criteria:

  1. Followers should enjoy at least three weeks of uninterrupted service for all the available Sygnalitic signals. We may ignore clear instances of outages caused by Force Majeure or those dependent on followers' actions/infrastructure.

  2. This is implied in point 1 but worth clarifying: the broadcasting of signals must also be uninterrupted, with the same caveats as to Force Majeure events.

  3. The team will follow the lead developer's advice and work on improving the infrastructure to make it more resilient and easier to debug -- not just fixing existing bugs.

  4. The team should self-organize and be vigilant on the reports that followers may bring forward on both the Signalytic group and the Superalgos Trading group. It is the team's responsibility to ask for clarification, logs, or whatever may help investigate the issues, as well as opening and tracking issues on GitHub.

  5. The three weeks of stability will be measured from the day all reported issues have been solved and after followers get to update to the latest version of the codebase (and workspaces if necessary).

Half of the bounty will be paid once the three weeks stability threshold is achieved. The second half of the bounty will be paid when we get to a period of six weeks of continuous stability.

@julian-molina
Copy link
Member Author

As with every other bounty, people are free to chip in and pledge more SA to the bounty. I will update the title accordingly.

@BastianMuc
Copy link
Contributor

BastianMuc commented Feb 16, 2023

I have been able to log one occasion of the trading task on the receiver side dying as follows:

0AC4DF0D-FF0A-47FD-AD62-3A15083DD8E8

The message points to an issue with writing the log file to disk, this leading to the task exiting - i.e. the logging engine may be the suspect for this specific issue. The previous error of not being able to download the signal package content occurs more regularly and is cleanly handled by SA.

I checked the docs of our logging engine Winston and found a few parameters which prevent the logging from exiting when such errors occur. A configuration change to logging including these parameters was submitted with PR #4753

This most likely isn't the root cause of all issues (sender side interruptions look behave different as per current reports). It may though be one piece of the overall puzzle.

@BastianMuc
Copy link
Contributor

@BlaaSwe thankfully provided a bunch of log files after the recent Trend Soaring outages. The log file status for both outages I was able to analyze was identical.

This is the Trading Process log of the died process (last line after restarting the died task):
image

In successful process loops, the logging normally continues as follows:
image

Learning:
When hangups occur, the Trading Task seems to wait for an Event about a completed Data Process. This event though never arrives.

As a next step, I checked the logs of the Data Task the trading task was waiting for:
image

Learning:
The Data Task for which the Trading Task was waiting finished normally. The Execution Finished Event was correctly raised, but never arrived at the Trading Task.

Possible Root Causes (ideation):

  • Event Server Outage (unlikely, as the other strategies on the same machine continued to work)
  • Sudden Death of the Trading Task without leaving any trace in the logs
  • Interrupted communication between Event Server Clients and Event Server (Websockets connections)

As Websockets Connections were already creating some headaches on the signal distribution side, they are my main suspect for now (despite sender and recipient running on the same machine here).

I have submitted PR #4833 to prevent undetected connection losses between event servers and clients. Let's see if this helps.

Further ideas based on above logging inputs are highly welcome!

@julian-molina julian-molina changed the title Stabilizing the signals feature: 4M SA bounty! Stabilizing the signals feature - $SA 4M bounty! Aug 9, 2023
@BastianMuc
Copy link
Contributor

PR #5043 submitted to hopefully kill this issue (finally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants