Default 404 server with metrics #709

vbannai · 2019-04-02T01:10:59Z

Added a new 404 server with metrics that supports the following:

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
- How many requests it is serving
- Serving latency
Add logging
- Respong with a 404 status code and relevant message to every request
- Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
- Periodically if no traffic, just to say “I am alive”

Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM

Prometheus version 2.8.0

includes yml file for setting alerts and rates

Benchmark results

Tested with "ab" generating 20M packets over 2000 connections

k8s-ci-robot · 2019-04-02T01:11:04Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2019-04-02T01:11:09Z

Hi @vbannai. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vbannai · 2019-04-02T01:23:36Z

CLA signed. Please check.

vbannai · 2019-04-02T02:51:51Z

@BenTheElder do you mind looking at the bot that is blocking the pull request. I have already signed the CLA. Thanks.

BenTheElder · 2019-04-02T04:05:59Z

sorry @vbannai, the cla/linuxfoundation status is failing which is from the linux foundation / CNCF, not SIG-Testing, it doesn't seem to think your account has signed the CLA.

I think you'll have to contact the help desk https://github.com/kubernetes/community/blob/master/CLA.md#troubleshooting kubernetes/kubernetes#27796 (comment)

rramkumar1 · 2019-04-02T16:31:35Z

/ok-to-test

vbannai · 2019-04-02T16:37:02Z

I signed it

BenTheElder · 2019-04-02T16:57:49Z

@rramkumar1 removing the label manually will not work.
@vbannai please check with the Linux foundation help desk. Their bot still does not seem to think your account has CLA

vbannai · 2019-04-02T18:34:47Z

@BenTheElder : I think I was missing being a member of the Google corp. I have taken care of it now. Hopefully this should work now.

vbannai · 2019-04-02T21:09:34Z

I think I am now authorized to contributed to CNCF.
Can we restart the check for CLA?

BenTheElder · 2019-04-02T21:27:21Z

/check-cla

BenTheElder · 2019-04-02T21:28:18Z

CLA is green 🎉

there is a test failure remaining:

I0402 16:35:27.471] --- FAIL: TestBackendService (0.00s)
I0402 16:35:27.471] composite_test.go:41: BackendService should contain 36 fields. Got 29
I0402 16:35:27.472] FAIL
I0402 16:35:27.472] FAIL	k8s.io/ingress-gce/pkg/composite	0.150s

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/ingress-gce/709/pull-ingress-gce-test/1113116895336206340/

rramkumar1

First round of comments.

Makefile

rramkumar1 · 2019-04-03T00:18:49Z

cmd/404-server-with-metrics/README

@@ -0,0 +1,374 @@
+


Nit: Empty line

cmd/404-server-with-metrics/README

rramkumar1 · 2019-04-03T00:24:10Z

cmd/404-server-with-metrics/server-with-metrics.go

+limitations under the License.
+*/
+
+// A webserver that only serves a 404 page. Used as a default backend for ingress gce object for kubernetes cluster.


Nit: "A webserver that only serves a 404 page. Used as a default backend for ingress-gce"

rramkumar1 · 2019-04-03T00:24:16Z

cmd/404-server-with-metrics/server-with-metrics.go

@@ -0,0 +1,210 @@
+/*
+Copyright 2017 The Kubernetes Authors.


rramkumar1 · 2019-04-03T00:59:25Z

cmd/404-server-with-metrics/server-with-metrics.go

+	readHeaderTimeout = flag.Duration("read header timeout", 10*time.Second, "Time in seconds to read the request header before timing out.")
+	writeTimeout      = flag.Duration("write timeout", 10*time.Second, "Time in seconds to write response before timing out.")
+	idleTimeout       = flag.Duration("idle timeout", 10*time.Second, "Time in seconds to wait for the next request when keep-alives are enabled.")
+	maxJobs           = flag.Int("max workers", 100, "Number of parallel/concurrent jobs to run.")


Nit: Should this be maxWorkers and also don't see it used anywhere?

I was originally planning to use maxJobs to restrict the number of simultaneous, but it turns out that won't help much as the go routines are spun up for each connection in ListenAndServe().
I will remove it.

rramkumar1 · 2019-04-03T01:00:47Z

cmd/404-server-with-metrics/server-with-metrics.go

+	// command line flags/arguments
+	port              = flag.Int("port", 8080, "Port number to serve default backend 404 page.")
+	serverTimeout     = flag.Duration("timeout", 5*time.Second, "Time in seconds to wait before forcefully terminating the server.")
+	readTimeout       = flag.Duration("read timeout", 10*time.Second, "Time in seconds to read the entire request before timing out.")


I think all the flags here and below should have a "-" instead of the spaces right?

I missed that. I have changed the flag names with a "_" instead of "-" as that is what is used in Google3.

rramkumar1 · 2019-04-03T01:18:22Z

cmd/404-server-with-metrics/server-with-metrics.go

+				fmt.Fprintf(os.Stderr, "server shutting down or received shutdown: %v\n", err)
+				os.Exit(0)
+			case http.ErrHandlerTimeout:
+				fmt.Fprintf(os.Stderr, "handler timedout: %v\n", err)


Nit: timed out

rramkumar1 · 2019-04-03T01:21:25Z

cmd/404-server-with-metrics/server-with-metrics.go

+		path := r.URL.Path
+		w.WriteHeader(http.StatusNotFound)
+		// We log 1 out of 4 requests to the logs (make it configurable by a flag??)
+		fmt.Fprintf(w, "reached NotFound backend, service rules not setup correctly for %s \n", path)


Since this will be visible in customer clusters. I'm not sure we should log that "service rules not setup correctly". It's possible (but probably highly unlikely) that they are using this backend in a meaningful way.

Re-worded the response to be more meaningful.

rramkumar1 · 2019-04-03T01:21:47Z

cmd/404-server-with-metrics/server-with-metrics.go

+		path := r.URL.Path
+		w.WriteHeader(http.StatusNotFound)
+		// We log 1 out of 4 requests to the logs (make it configurable by a flag??)
+		fmt.Fprintf(w, "reached NotFound backend, service rules not setup correctly for %s \n", path)


Nit: reached 404 backend

rramkumar1

Some more comments but in general, LGTM.

Adding @bowei for final review.

rramkumar1 · 2019-04-04T19:03:57Z

cmd/404-server-with-metrics/server-with-metrics.go

+	idleTimeout       = flag.Duration("idle_timeout", 10*time.Second, "Time in seconds to wait for the next request when keep-alives are enabled.")
+	idleLogTimer      = flag.Duration("idle_log_timeout", 1*time.Hour, "Timer for keep alive logger.")
+	logSampleRequests = flag.Float64("log_percent_requests", 0.1, "Fraction of http requests to log [0.0 to 1.0].")
+	isProd            = flag.Bool("is_prod", true, "Indicates if the server is running in production.")


I don't know how useful this flag is going to be in relation to the shutdown handler. Is there a compelling use case for it?

This is useful for a graceful shutdown if you are trying to test it in non-prod environment. Also, I need an extra handler in addition to the default handler for "/" to make sure that the counters were not incrementing.
I set the default value to be "true" so the shutdown handler will not be active unless you go out of your way to pass the flag to disable it.

rramkumar1 · 2019-04-04T19:04:53Z

cmd/404-server-with-metrics/server-with-metrics.go

 	hostName, err := os.Hostname()
 	if err != nil {
-		fmt.Fprintf(os.Stderr, "could not get the hostname: %v\n", err)
+		klog.Errorf("could not get the hostname: %v\n", err)


I think this should be a Fatalf since we do not want to proceed further.

Okay, the behavior of klog module is a little inconsistent with the standard golang logger as log.Fata() also exits by called os.Exit(1).
Whereas klog.Fata() does not do that.
I will add oskExit(1) after klog.FatalF.

rramkumar1 · 2019-04-04T19:05:22Z

cmd/404-server-with-metrics/server-with-metrics.go

 				os.Exit(0)
 			case http.ErrHandlerTimeout:
-				fmt.Fprintf(os.Stderr, "handler timedout: %v\n", err)
+				klog.Warningf("handler timed out: %v\n", err)
 			default:
 				// Should we Fatal() ?


Remove since it's now a Fatalf

yeah, but I need to add
os.Exit(1) as klog does not exit unlike standard golang log package.
https://golang.org/src/log/log.go?s=9437:9465#L302
vs
https://sourcegraph.com/github.com/kubernetes/ingress-gce/-/blob/vendor/k8s.io/klog/klog.go#L1208

rramkumar1 · 2019-04-04T19:08:58Z

cmd/404-server-with-metrics/README.md

+Performance testing consisted of using Apache "ab" and "curl" to send lots of requests and monitor the metrics on prometheus UI
+
+### Testing iterations
+* **Testing with curl command**


Can you make each testing iteration a sub-heading of "Performance tests" rather than a bullet. So for example, it would look like:

Performance tests

blah

Testing with curl command

blah

Testing with ab

and so on...

Sure, sounds good.
I made the changes as suggested.

bowei · 2019-04-22T17:34:38Z

can you squash the commits and separate it into two commits?

one commit with the actual code changes
another that just has the vendor changes

easy way to do this:

git remote update
git rebase -i upstream/master

edit to squash everything together (change pick to squash except for the first line)

then

git reset HEAD^ # soft reset
git add vendor/
git commit -m "update vendor"
git add
git commit

add the text

Added a new 404 server with metrics that supports the following:

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
How many requests it is serving
Serving latency
Add logging
Respong with a 404 status code and relevant message to every request
Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
Periodically if no traffic, just to say “I am alive”
Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM
Prometheus version 2.8.0

includes yml file for setting alerts and rates
Benchmark results

Tested with "ab" generating 20M packets over 2000 connections

to the commit

Rebuild it with newer Go Supports graceful shutdown Add metrics serving How many requests it is serving Serving latency Add logging Respong with a 404 status code and relevant message to every request Configurable sampling requests to a max # of logs/sec [0.0 to 1.0] Periodically if no traffic, just to say “I am alive” Tested the setup on a local desktop with model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz with 12 processor core 64GB RAM Prometheus version 2.8.0 includes yml file for setting alerts and rates Benchmark results Tested with "ab" generating 20M packets over 2000 connections

vbannai · 2019-04-25T03:40:19Z

can you squash the commits and separate it into two commits?

one commit with the actual code changes
another that just has the vendor changes

easy way to do this:

git remote update
git rebase -i upstream/master

edit to squash everything together (change pick to squash except for the first line)

then

git reset HEAD^ # soft reset
git add vendor/
git commit -m "update vendor"
git add
git commit

add the text
Added a new 404 server with metrics that supports the following:

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
How many requests it is serving
Serving latency
Add logging
Respong with a 404 status code and relevant message to every request
Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
Periodically if no traffic, just to say “I am alive”
Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM
Prometheus version 2.8.0

includes yml file for setting alerts and rates
Benchmark results

Tested with "ab" generating 20M packets over 2000 connections
to the commit

Squashed the commits as per the request into two commits:

Vendor related changes
Default 404-server-with-metrics
PTAL. Thanks

rramkumar1 · 2019-04-25T12:57:58Z

LGTM, will leave to @bowei for final approval.

bowei · 2019-04-25T16:26:20Z

/lgtm
/approve

k8s-ci-robot · 2019-04-25T16:26:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, vbannai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bowei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 2, 2019

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 2, 2019

k8s-ci-robot requested review from stewart-yu and sttts April 2, 2019 01:11

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2019

rramkumar1 added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2019

rramkumar1 self-assigned this Apr 2, 2019

vbannai force-pushed the 404-server branch from ed1c52c to 180b6eb Compare April 2, 2019 22:08

rramkumar1 reviewed Apr 3, 2019

View reviewed changes

vbannai force-pushed the 404-server branch 4 times, most recently from 693f9be to 2dcee04 Compare April 4, 2019 16:24

rramkumar1 reviewed Apr 4, 2019

View reviewed changes

rramkumar1 assigned bowei Apr 4, 2019

vbannai force-pushed the 404-server branch from 2dcee04 to 88bcbc8 Compare April 4, 2019 20:28

kubernetes deleted a comment Apr 4, 2019

Updated the vendor directory to support prometheus 0.9

260000d

vbannai force-pushed the 404-server branch from 88bcbc8 to 68ca615 Compare April 25, 2019 03:28

vbannai force-pushed the 404-server branch from 68ca615 to dc46548 Compare April 25, 2019 03:34

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2019

k8s-ci-robot merged commit 6ec568e into kubernetes:master Apr 25, 2019

Default 404 server with metrics #709

Default 404 server with metrics #709

Conversation

vbannai commented Apr 2, 2019 • edited

k8s-ci-robot commented Apr 2, 2019

k8s-ci-robot commented Apr 2, 2019

vbannai commented Apr 2, 2019

vbannai commented Apr 2, 2019

BenTheElder commented Apr 2, 2019

rramkumar1 commented Apr 2, 2019

vbannai commented Apr 2, 2019

BenTheElder commented Apr 2, 2019

vbannai commented Apr 2, 2019

vbannai commented Apr 2, 2019

BenTheElder commented Apr 2, 2019

BenTheElder commented Apr 2, 2019

rramkumar1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rramkumar1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Performance tests

Testing with curl command

Testing with ab

Choose a reason for hiding this comment

bowei commented Apr 22, 2019

vbannai commented Apr 25, 2019 • edited

rramkumar1 commented Apr 25, 2019

bowei commented Apr 25, 2019

k8s-ci-robot commented Apr 25, 2019

vbannai commented Apr 2, 2019 •

edited

vbannai commented Apr 25, 2019 •

edited