Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec stdout w/o newlines breaks health checks by logging on closed pipe #423

Open
ochanism opened this issue Jul 3, 2017 · 16 comments
Open

Comments

@ochanism
Copy link

ochanism commented Jul 3, 2017

Recently, I updated the containerpilot version from 2.7.3 to 2.7.4 and 2.7.5 and I got the error as described below.

2017/07/03 11:28:35 us-east-1b-log-zookeeper.health exited with error: io: read/write on closed pipe

Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service. When starting the service, it looked fine, but the problem occurred in a few days. I looked into containerpilot code to figure out what caused it. The error came out from here and it might have been caused by closing the logger in here. I couldn't figure out why the logger was closed and who did it. When I sent SIGHUP to a containerpilot process, the logger pipe was recreated and the service health was changed to in-service.

Anyway, in this situation, I think service health should be in-service, even if the logger is closed.
I've attached the some scripts below to help you debug this issue.

health check script

#!/bin/bash
echo ruok | nc ${HOSTNAME} ${ZOOKEEPER_PORT}

containerpilot configuration

{
  "consul": "localhost:8500",
  "preStart": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/preStart.sh",
  "logging": {
    "level": "INFO",
    "format": "default",
    "output": "stdout"
  },
  "stopTimeout": 5,
  "preStop": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/preStop.sh",
  "postStop": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/postStop.sh",
  "services": [
    {
      "id": "{{.DC}}-{{.ZOOKEEPER_CONSUL_ID}}-{{.HOSTNAME}}",
      "name": "{{.DC}}-{{.ZOOKEEPER_CONSUL_NAME}}",
      "port": {{.ZOOKEEPER_PORT}},
      "health": "/zookeeper/{{.ZOOKEEPER_NAME}}/config/healthCheck.sh",
      "tags": [
          {{.ZOOKEEPER_TAGS}}
      ],
      "interfaces": [
          {{.CONTAINER_PILOT_INTERFACES}}
      ],
      "poll": 10,
      "timeout": "10s",
      "ttl": 30
    }
  ],
  "backends": [
  ],
  "tasks": [
  ],
  "coprocesses": [
    {
      "name": "setExporter",
      "command": ["/zookeeper/{{.ZOOKEEPER_NAME}}/config/setExporter.sh"],
      "restarts": "never"
    }
  ]
}

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

Thanks for the detailed report @ochanism. I've dug into this a bit but don't have a conclusive answer yet.

At one point we had a problem where we were dropping logs (this was at the root of #267 #294 #295 #296). We fixed that in #297 by making it so that we didn't close the logger for the health check until we stop polling entirely in Service.PollStop. And that shouldn't happen unless ContainerPilot gets its configuration reloaded or terminated (and in that case it's racy but we can't really do much about that).

That being said, the error message definitely seems to be bubbling up from io.ErrClosedPipe, which if you chase it down through the stdlib is I think what we'd expect if the logger is closed when we try to write to it. I'm trying to write a repeatable repro now.


Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service.

Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

I've broken this down a bit to confirm the error message is as expected, at least.

package main

import (
	"os/exec"

	log "github.com/Sirupsen/logrus"
)

func main() {
	logger := log.StandardLogger().Writer()

	cmd := exec.Command("sh", "-c", "echo stdout; echo 1>&2 stderr")
	cmd.Stdout = logger
	cmd.Stderr = logger

	cmd.Start()
	//	logger.Close() // uncomment this line to get expected err

	if err := cmd.Wait(); err != nil {
		log.Fatal(err)
	}
}

With the logger.Close() line commented-out we get:

$ go run logbug.go
INFO[0000] stdout
INFO[0000] stderr

Whereas with that line we get:

$ go run logbug.go
FATA[0000] io: read/write on closed pipe
exit status 1

So think the problem is now figuring out how or why we're closing the logger. With it only happening after a few days getting a reliable repro will be key.

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

I think I've narrowed this down further with the following:

check.sh:

#!/bin/bash

while true; do
    echo 'check'
    sleep 1
done

logbug.go:

package main

import (
	"os/exec"
	"time"

	log "github.com/Sirupsen/logrus"
)

func main() {
	logger := log.StandardLogger().Writer()
	cmd := exec.Command("./check.sh")
	cmd.Stdout = logger
	cmd.Stderr = logger

	err := cmd.Start()
	if err != nil {
		log.Fatalf("failed to start: %v", err)
	}

	go func() {
		timer := time.NewTimer(time.Second * 2)
		<-timer.C
		logger.Close()
	}()

	if err := cmd.Wait(); err != nil {
		log.Fatalf("failed: %v", err)
	}
}

output:

$ go run logbug.go
INFO[0000] check
INFO[0001] check
FATA[0003] failed: signal: broken pipe
exit status 1

Which means that in a case where the exec is running and we've had a chance to properly set up the loggers, we won't get the io: read/write on closed pipe error even if the logger is closed prematurely (we get a different error entirely). So the problem we're seeing is where the logger is being closed before we call wait?

@ochanism do you have any other debug logs from the health check around this error? It might help to see if we got other messages.

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

I spent a bit of time trying to investigate whether it had to do with the timeout, which is the same as the poll interval, under the theory that subsequent runs were interfering with each other. I wrote a config that would exercise this area with a very very tight interval and let it run for a good long while. But I still can't replicate the closing of the logger except artificially; there's not a code path that hits that. I'm beginning to wonder whether the logger is a red herring, and that it's another io.WriteCloser in the path that's getting closed when it shouldn't be?

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

I think I've figured out a way to bypass the issue entirely as well as solidify the fix in #297 and fix a regression in log formatting which I was planning on fixing for #393.

Instead of setting up individual loggers, I'm going to update Logrus to 1.0 and then use a context logger and attach the new logrus.Entry.Writer to the process stdout/stderr, rather than creating a new logger. I'll have a PR for v3 shortly (probably reviewed on Weds morning at this point, as it's a holiday here in the US tomorrow) and backport it to v2.

@tgross
Copy link
Contributor

tgross commented Jul 3, 2017

PR open for v3 #424
PR open for v2 #425

@ochanism
Copy link
Author

ochanism commented Jul 4, 2017

Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case.

@tgross, yes, I got the error every 10 sec.

do you have any other debug logs from the health check around this error? It might help to see if we got other messages.

@tgross, I launched it with debug log level yesterday and I've been waiting for the error.
I will provide the debug log to you soon if I get it.

Thanks for your fast support.

@ochanism
Copy link
Author

ochanism commented Jul 4, 2017

I got the debug logs concerning the error. It occurred 26 hours after the service started.
According to the logs, the logger was not closed. It was broken due to a large size token.
Do you have any idea about this log?

2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout start
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.Cmd.Start
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run waiting for PID 73299:
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run complete
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.RunWithTimeout end
2017/07/04 13:24:11 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout start
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.Cmd.Start
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run waiting for PID 73424:
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run complete
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.RunWithTimeout end
2017/07/04 13:24:21 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout start
2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.Cmd.Start
2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run waiting for PID 73622:
2017/07/04 13:24:31 Error while reading from Writer: bufio.Scanner: token too long
2017/07/04 13:24:31 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe

2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.RunWithTimeout end
2017/07/04 13:24:31 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout start
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.Cmd.Start
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run waiting for PID 73772:
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.RunWithTimeout end
2017/07/04 13:24:41 us-east-1a-log-api-proxy.health.run#gofunc received quit

2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout start
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.Cmd.Start
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run waiting for PID 73905:
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health exited with error: io: read/write on closed pipe
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.RunWithTimeout end
2017/07/04 13:24:51 us-east-1a-log-api-proxy.health.run#gofunc received quit

@ochanism
Copy link
Author

ochanism commented Jul 4, 2017

I found that Logrus 1.0 also has the same issue related to token size in here. I think Logrus 1.0 wouldn't be safe from this error.

@ochanism
Copy link
Author

ochanism commented Jul 5, 2017

I think I figured out what caused it. The output string of the health check script did not contain a new line character at the end of the string, so the output might be accumulated on the buffer.

The defualt value of bufio.MaxScanTokenSize is set to 64KB. In case of my situation, the length of the output string was 7 bytes and the health check interval was 10 sec. And the service was in-service for 26 hours. I calculated the total length of the output during 26 hours like this: ((26*3600)/10)*7 = 65520. It is very close to 64KB.

I added a new line character to the health check script to verify my hypothesis and I've been keeping my eyes on it

@ochanism
Copy link
Author

ochanism commented Jul 5, 2017

I confirmed that there was no error with a new line character in the health check script.
You can easily reproduce the error with the below health check script.

#!/bin/bash
printf aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

@tgross
Copy link
Contributor

tgross commented Jul 5, 2017

That's some nice detective work there, @ochanism!

I'm glad you were able to find a workaround for now, but we should probably try to come up with something more robust to this failure. Next steps:

  • I can take a look at Error with Entry.writerScanner: token too long sirupsen/logrus#564 and see whether we can fix it in the library. Blindly increasing the buffer size seems like a bad idea, so handling ErrTooLong (or ErrBufferFull) gracefully as suggested in that issue seems like a better approach.
  • We can see if it's feasible to inject a newline at the end of an exec run so that we ensure the log entry is closed, but do so without adding a spurious empty log line in the case where the health check does exit with a newline.
  • Alternately, we might want to see about capturing this error and handling it more gracefully in our wait, but we need to make sure we don't break all other logging in the process.
  • I'm going to open a new issue to track this work Edit: actually I'm just going to keep this one open and rename it.
  • I'll open a PR w/ a failing test that we can have commented-out in master but will serve as a "definition of done" when it works.

@tgross tgross added v3.x and removed open PR labels Jul 5, 2017
@tgross tgross changed the title Failed to check service health due to logging on closed pipe exec stdout w/o newlines breaks health checks by logging on closed pipe Jul 5, 2017
@tgross
Copy link
Contributor

tgross commented Jul 5, 2017

With respect to trying to handle the error in ContainerPilot, it looks like neither of the two approaches above are going to work (injecting a newline or capturing the error and handling it gracefully).

From the docs of the underlying stdlib interface (https://golang.org/pkg/bufio/#Scanner):

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token.

Once we've hit the error the logger is already broken. If it were possible to inject a newline we would reduce the risk of hitting an error but we'll still hit it in the case where a single exec outputs more than 64k w/o a newline. Handling this error within ContainerPilot would mean recreating the logger entirely, but even in that case we'd end up losing all the logs from any long-running jobs forever. This doesn't seem like a good approach, so we should try to get the fix done at the library level.

@tgross
Copy link
Contributor

tgross commented Jul 5, 2017

I've pushed 3408f95 to update the documentation to warn about this issue. I'm going to cut a release for 2.7.6 and 3.1.2 that gives us the logrus 1.0 update and closes any possibility of a race around log closing.

And then I'll circle back to look at sirupsen/logrus#564, in particular what's happening there around the bufio.Scanner.Scan call

@tgross
Copy link
Contributor

tgross commented Jul 31, 2017

I don't have the bandwidth right now to submit a fix to sirupsen/logrus. Ultimately this is an end-user issue that we've documented a warning for, and fixing the upstream problem doesn't have a ton of urgency. I'm going to marked this as blocked and help-wanted for the time being.

@tgross
Copy link
Contributor

tgross commented Aug 3, 2017

Another workaround option for someone running into this kind of problem will be to use the passthru logging option described in #462 #437

larslevie added a commit to levieindustries/elementary that referenced this issue Nov 22, 2017
@tgross tgross removed their assignment Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants