Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG, stderr, stdout logs not being retrieved and displayed in Metaflow UI #81

Open
e-conway opened this issue Oct 21, 2022 · 5 comments

Comments

@e-conway
Copy link

Description

We've got metaflow and metaflow UI deployed on AWS (on local IP, so not publicly accessible), but the logs aren't being retrieved. There was an initially an issue with our ServiceInfoUI container not having enough memory, but this was upped. The RDS burst balance was also too low, but upping the storage to 1000 GiB removed this queue, and changed the error message to a generic error, so I don't think this is the issue any more.

The RDS is accessible, and appears to be storing the logs. The logs are also available from the relevant S3 buckets, Step Functions and Batch.

I can't find exactly where the UI is trying to pull data from, so not sure whether it's a permissions issue with access to the RDS, but the S3 bucket seems to be accessible. As far as I can see, the permissions/configuration is the same as the metaflow UI CF template, so was interested to know if anyone else had had/is having this issue.

Steps to Reproduce

  1. Not exactly certain, after running a few flows for a while, view the metaflow UI

Expected behavior:

DAG, stderr and stdlog display the error messages being logged in CloudWatch in the UI.

Actual behavior:

Error messages don't appear:
image

Reproduces how often:

Every time the UI is used. I've previously looked at a public example from Outerbounds, but can't view that at the moment. This one wasn't having the same issue a couple of months ago.

Versions

Application version: 1.1.4
Service version: 2.3.2
My machine: MacOS 12.6
Viewing on Safari: v16.0

@obgibson
Copy link
Collaborator

Do you see the tasks for each run in the UI? When you go to the DAG tab, what do you see? Do you see stderr, stdout, or cards showing on the task view?

@e-conway
Copy link
Author

Thanks for getting back to me. This is what I get on the DAG tab:
image

Then the timeline is loading correctly, but as soon as I click on a task, it loads very slowly, and if it eventually loads then it loads the error card of above, but if not, looks like this:
Screenshot 2022-10-24 at 08 27 36

Interestingly, the task details seem to be loading though:
image

@obgibson
Copy link
Collaborator

Have a look at the javascript console? Do you see any errors? Also take a look at the Network tab in your developer tools. Can you see any errors for requests (e.g. to /dag)?

If you'd like, you can move this discussion across to the #ask-metaflow slack channel at https://outerbounds-community.slack.com/

@e-conway
Copy link
Author

Yep, so I'm getting a 504, as below:
Screenshot 2022-10-25 at 08 33 52

but then if I leave it longer there's also a web socket connection error:

image

And then looking into it in cloud front, getting a lot of errors - as far as I can tell, the 5xx errors are when I'm loading it, and the 4xx errors seem to just be continual. I had a look into this further in the logs, but couldn't tell what was wrong - found some failed access from IPs globally, but they all seem to be in AWS datacenter locations/say datacenter on IP lookup, but wasn't sure if that played into it too

image

On the network side, it's timing out for out, err and dag after 1.5mins.

@obgibson
Copy link
Collaborator

Can you take a look at the metaflow-service logs? For around the time that the request to runs/80332/dag gave a 504?

Logs look like -

metaflow-service-ui_backend-1 | INFO:aiohttp.access:172.19.0.1 [25/Oct/2022:17:55:47 +0000] "GET /flows/TextToImages/runs/78/metadata?step_name=start HTTP/1.1" 200 3889 "http://localhost:3000/?_group_limit=30&_limit=30&_order=-ts_epoch&status=completed%2Cfailed%2Crunning&timerange_start=1664064000000" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"

Generally this type of error is because MFGUI can't talk to S3 due to permissions and/or auth. Hopefully we can see some evidence in the metaflow-service logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants