Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default 404 server with metrics #709

Merged
merged 2 commits into from
Apr 25, 2019
Merged

Conversation

vbannai
Copy link
Contributor

@vbannai vbannai commented Apr 2, 2019

Added a new 404 server with metrics that supports the following:

  • Rebuild it with newer Go
  • Supports graceful shutdown
  • Add metrics serving
    • How many requests it is serving
    • Serving latency
  • Add logging
    • Respong with a 404 status code and relevant message to every request
    • Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
    • Periodically if no traffic, just to say “I am alive”

Tested the setup on a local desktop with

  • model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
  • with 12 processor core
  • 64GB RAM

Prometheus version 2.8.0

  • includes yml file for setting alerts and rates

Benchmark results

  • Tested with "ab" generating 20M packets over 2000 connections

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 2, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @vbannai. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 2, 2019
@vbannai
Copy link
Contributor Author

vbannai commented Apr 2, 2019

CLA signed. Please check.

@vbannai
Copy link
Contributor Author

vbannai commented Apr 2, 2019

@BenTheElder do you mind looking at the bot that is blocking the pull request. I have already signed the CLA. Thanks.

@BenTheElder
Copy link
Member

sorry @vbannai, the cla/linuxfoundation status is failing which is from the linux foundation / CNCF, not SIG-Testing, it doesn't seem to think your account has signed the CLA.

I think you'll have to contact the help desk https://github.com/kubernetes/community/blob/master/CLA.md#troubleshooting kubernetes/kubernetes#27796 (comment)

@rramkumar1
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2019
@vbannai
Copy link
Contributor Author

vbannai commented Apr 2, 2019

I signed it

@rramkumar1 rramkumar1 added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 2, 2019
@BenTheElder
Copy link
Member

@rramkumar1 removing the label manually will not work.
@vbannai please check with the Linux foundation help desk. Their bot still does not seem to think your account has CLA

@vbannai
Copy link
Contributor Author

vbannai commented Apr 2, 2019

@BenTheElder : I think I was missing being a member of the Google corp. I have taken care of it now. Hopefully this should work now.

@vbannai
Copy link
Contributor Author

vbannai commented Apr 2, 2019

I think I am now authorized to contributed to CNCF.
Can we restart the check for CLA?

@rramkumar1 rramkumar1 self-assigned this Apr 2, 2019
@BenTheElder
Copy link
Member

/check-cla

@BenTheElder
Copy link
Member

CLA is green 🎉

there is a test failure remaining:

I0402 16:35:27.471] --- FAIL: TestBackendService (0.00s)
I0402 16:35:27.471] composite_test.go:41: BackendService should contain 36 fields. Got 29
I0402 16:35:27.472] FAIL
I0402 16:35:27.472] FAIL	k8s.io/ingress-gce/pkg/composite	0.150s

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/ingress-gce/709/pull-ingress-gce-test/1113116895336206340/

Copy link
Contributor

@rramkumar1 rramkumar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments.

Makefile Show resolved Hide resolved
@@ -0,0 +1,374 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Empty line

cmd/404-server-with-metrics/README Outdated Show resolved Hide resolved
limitations under the License.
*/

// A webserver that only serves a 404 page. Used as a default backend for ingress gce object for kubernetes cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "A webserver that only serves a 404 page. Used as a default backend for ingress-gce"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,210 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 2019

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

readHeaderTimeout = flag.Duration("read header timeout", 10*time.Second, "Time in seconds to read the request header before timing out.")
writeTimeout = flag.Duration("write timeout", 10*time.Second, "Time in seconds to write response before timing out.")
idleTimeout = flag.Duration("idle timeout", 10*time.Second, "Time in seconds to wait for the next request when keep-alives are enabled.")
maxJobs = flag.Int("max workers", 100, "Number of parallel/concurrent jobs to run.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be maxWorkers and also don't see it used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally planning to use maxJobs to restrict the number of simultaneous, but it turns out that won't help much as the go routines are spun up for each connection in ListenAndServe().
I will remove it.

// command line flags/arguments
port = flag.Int("port", 8080, "Port number to serve default backend 404 page.")
serverTimeout = flag.Duration("timeout", 5*time.Second, "Time in seconds to wait before forcefully terminating the server.")
readTimeout = flag.Duration("read timeout", 10*time.Second, "Time in seconds to read the entire request before timing out.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the flags here and below should have a "-" instead of the spaces right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that. I have changed the flag names with a "_" instead of "-" as that is what is used in Google3.

fmt.Fprintf(os.Stderr, "server shutting down or received shutdown: %v\n", err)
os.Exit(0)
case http.ErrHandlerTimeout:
fmt.Fprintf(os.Stderr, "handler timedout: %v\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: timed out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

path := r.URL.Path
w.WriteHeader(http.StatusNotFound)
// We log 1 out of 4 requests to the logs (make it configurable by a flag??)
fmt.Fprintf(w, "reached NotFound backend, service rules not setup correctly for %s \n", path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will be visible in customer clusters. I'm not sure we should log that "service rules not setup correctly". It's possible (but probably highly unlikely) that they are using this backend in a meaningful way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worded the response to be more meaningful.

path := r.URL.Path
w.WriteHeader(http.StatusNotFound)
// We log 1 out of 4 requests to the logs (make it configurable by a flag??)
fmt.Fprintf(w, "reached NotFound backend, service rules not setup correctly for %s \n", path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: reached 404 backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vbannai vbannai force-pushed the 404-server branch 4 times, most recently from 693f9be to 2dcee04 Compare April 4, 2019 16:24
Copy link
Contributor

@rramkumar1 rramkumar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments but in general, LGTM.

Adding @bowei for final review.

idleTimeout = flag.Duration("idle_timeout", 10*time.Second, "Time in seconds to wait for the next request when keep-alives are enabled.")
idleLogTimer = flag.Duration("idle_log_timeout", 1*time.Hour, "Timer for keep alive logger.")
logSampleRequests = flag.Float64("log_percent_requests", 0.1, "Fraction of http requests to log [0.0 to 1.0].")
isProd = flag.Bool("is_prod", true, "Indicates if the server is running in production.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how useful this flag is going to be in relation to the shutdown handler. Is there a compelling use case for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful for a graceful shutdown if you are trying to test it in non-prod environment. Also, I need an extra handler in addition to the default handler for "/" to make sure that the counters were not incrementing.
I set the default value to be "true" so the shutdown handler will not be active unless you go out of your way to pass the flag to disable it.

hostName, err := os.Hostname()
if err != nil {
fmt.Fprintf(os.Stderr, "could not get the hostname: %v\n", err)
klog.Errorf("could not get the hostname: %v\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a Fatalf since we do not want to proceed further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the behavior of klog module is a little inconsistent with the standard golang logger as log.Fata() also exits by called os.Exit(1).
Whereas klog.Fata() does not do that.
I will add oskExit(1) after klog.FatalF.

os.Exit(0)
case http.ErrHandlerTimeout:
fmt.Fprintf(os.Stderr, "handler timedout: %v\n", err)
klog.Warningf("handler timed out: %v\n", err)
default:
// Should we Fatal() ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove since it's now a Fatalf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but I need to add
os.Exit(1) as klog does not exit unlike standard golang log package.
https://golang.org/src/log/log.go?s=9437:9465#L302
vs
https://sourcegraph.com/github.com/kubernetes/ingress-gce/-/blob/vendor/k8s.io/klog/klog.go#L1208

Performance testing consisted of using Apache "ab" and "curl" to send lots of requests and monitor the metrics on prometheus UI

### Testing iterations
* **Testing with curl command**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make each testing iteration a sub-heading of "Performance tests" rather than a bullet. So for example, it would look like:

Performance tests

blah

Testing with curl command

blah

Testing with ab

and so on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.
I made the changes as suggested.

@bowei
Copy link
Member

bowei commented Apr 22, 2019

can you squash the commits and separate it into two commits?

one commit with the actual code changes
another that just has the vendor changes

easy way to do this:

git remote update
git rebase -i upstream/master

edit to squash everything together (change pick to squash except for the first line)

then

git reset HEAD^ # soft reset
git add vendor/
git commit -m "update vendor"
git add
git commit

add the text

Added a new 404 server with metrics that supports the following:

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
How many requests it is serving
Serving latency
Add logging
Respong with a 404 status code and relevant message to every request
Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
Periodically if no traffic, just to say “I am alive”
Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM
Prometheus version 2.8.0

includes yml file for setting alerts and rates
Benchmark results

Tested with "ab" generating 20M packets over 2000 connections

to the commit

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
How many requests it is serving
Serving latency
Add logging
Respong with a 404 status code and relevant message to every request
Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
Periodically if no traffic, just to say “I am alive”
Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM
Prometheus version 2.8.0

includes yml file for setting alerts and rates
Benchmark results

Tested with "ab" generating 20M packets over 2000 connections
@vbannai
Copy link
Contributor Author

vbannai commented Apr 25, 2019

can you squash the commits and separate it into two commits?

one commit with the actual code changes
another that just has the vendor changes

easy way to do this:

git remote update
git rebase -i upstream/master

edit to squash everything together (change pick to squash except for the first line)

then

git reset HEAD^ # soft reset
git add vendor/
git commit -m "update vendor"
git add
git commit

add the text

Added a new 404 server with metrics that supports the following:

Rebuild it with newer Go
Supports graceful shutdown
Add metrics serving
How many requests it is serving
Serving latency
Add logging
Respong with a 404 status code and relevant message to every request
Configurable sampling requests to a max # of logs/sec [0.0 to 1.0]
Periodically if no traffic, just to say “I am alive”
Tested the setup on a local desktop with

model name : Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
with 12 processor core
64GB RAM
Prometheus version 2.8.0

includes yml file for setting alerts and rates
Benchmark results

Tested with "ab" generating 20M packets over 2000 connections

to the commit

Squashed the commits as per the request into two commits:

  • Vendor related changes
  • Default 404-server-with-metrics
    PTAL. Thanks

@rramkumar1
Copy link
Contributor

LGTM, will leave to @bowei for final approval.

@bowei
Copy link
Member

bowei commented Apr 25, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, vbannai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2019
@k8s-ci-robot k8s-ci-robot merged commit 6ec568e into kubernetes:master Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants