EventStore on Azure Kubernetes (ACS)

Table of Contents

Prep - set up az
- install azure xplat cli
- Login and set subscription
Set up K8S
Perf Bench

Prep - set up `az`

install azure xplat cli

pip install azure-cli

az login
az account set -s <subid>

Set up K8S

create service principal

Create service principal

C:\Users\raghuramanr>az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/<masked>"
Retrying role assignment creation: 1/36
{
  "appId": "30521959-8e29-4855-9176-ede965cc8432",
  "displayName": "azure-cli-2017-11-14-04-56-25",
  "name": "http://azure-cli-2017-11-14-04-56-25",
  "password": "<masked>",
  "tenant": "7a33d93c-28b7-4b2f-af94-9c3c883b8c95"
}

Create cluster

Create cluster on Azure portal
download the template for later - saved as d:\downloads\azure-k8s-cluster.zip
Start cluster deployment
Get coffee.

Download cluster creds

C:\Users\raghuramanr>az acs kubernetes get-credentials --resource-group=kube-cluster --name=kube-cluster
Merged "k8smgmt" as current context in C:\Users\raghuramanr\.kube\config

Verify cluster is working

C:\Users\raghuramanr>kubectl get nodes
NAME                       STATUS    ROLES     AGE       VERSION
k8s-agentpool-31036649-0   Ready     agent     6m        v1.7.9
k8s-agentpool-31036649-1   Ready     agent     6m        v1.7.9
k8s-agentpool-31036649-2   Ready     agent     6m        v1.7.9
k8s-master-31036649-0      Ready     master    6m        v1.7.9

Connect to K8S Web UI

kubectl proxy
Browse to http://localhost:8001/ui

Add Persistent Storage (Azure Files)

Reference link: https://docs.microsoft.com/en-us/azure/aks/azure-files
Github Repo/branch - https://github.com/raghur/eventstore-kubernetes / aks-persistentdisk
1. Create storage account
  C:\Users\raghuramanr>az storage account create --n persistentstorage11342 -g kube-cluster --sku Standard_LRS
2. List keys
  C:\Users\raghuramanr>az storage account keys list -n persistentstorage11342 -g kube-cluster --output table KeyName Permissions Value --------- ------------- ---------------------------------------------------------------------------------------- key1 Full <masked> key2 Full <masked>
3. base64 encode keys and account name
  
  Don’t do this on cmd.exe - do it on a real linux box or wsl
4. Create the fileshares - I just used the web portal - esdisk-1.. esdisk-N
5. Create an azure secret as in ref link above - template is in fileshare/azure-secret.yml
6. Add the azure secret to your k8s cluster - kubectl apply -f fileshare/azure-secret.yml
7. Run cd scripts && generate-deployment.sh 3 for a 3 node cluster.
8. Create ES Pods: Run kubectl create -f .tmp/es_deployment_X.yaml files to create the ES nodes
9. Create a service fronting the PODS: Run kubectl create -f services/eventstore.yaml
10. Create a configmap for nginx: Run kubectl.exe create configmap nginx-es-pd-frontend-conf --from-file nginx/frontend.conf
11. Create the Nginx front end proxy Pod: Run kubectl create -f deployments/frontend-es.yaml
12. Create the Nginx service: Run kubectl create -f services/frontend-es.yaml

Perf Bench

Repo - https://github.com/raghur/eventstore-bench
- See src/config.js for changing endpoints

Prepping k8s on ACS for monitoring with Heapster, Grafana and InfluxDB

To collect pod utilization under load, we need heapster, grafana and influx db working as described here with setup instructions here. They however require some tweaks on ACS because the default ACS deployment includes heapster but not grafana and influx db. Due to this, the heapster node is not provided a sink (and so ineffective). To fix:

Clone the heapster repo - https://github.com/kubernetes/heapster

Follow this step in the guide:

$ kubectl create -f deploy/kube-config/influxdb/
$ kubectl create -f deploy/kube-config/rbac/heapster-rbac.yaml

Now fix up heapster
1. Open heapster on kubernetes dashboard (http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy/#!/deployment/kube-system/heapster?namespace=kube-system)
2. Click 'Edit'
3. Find the container with name: heapster and add a --sink=influxdb:http://monitoring-influxdb.kube-system.svc:8086
  
  Figure 1. Add the sink parameter to heapster
Now we need to make Grafana accessible from outside the cluster.
1. Edit the monitoring-grafana service (http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy/#!/service/kube-system/monitoring-grafana?namespace=kube-system) and add type: "LoadBalancer"
Once k8s updates the service, you should see an external IP - and browsing to http://<externalip>; should bring you to the Grafana dashboard.

Test Scenario

Each user creates a stream, adds 10 events, then reads the stream completely followed by reading each event individually.
Test run is 10 concurrent users repeating for 5 mins from a single client node (my machine)

Happy path - no node failures - 10 concurrent users

As expected, the podversion is able to serve 33% more requests though CPU utilization is a little higher since IO happens locally?

Test Results - client summary

Table 1. A 5 minute test with 10 concurrent users

PodVersion (local pod storage)

Persistent Disk (Azure file share)

    ✓ is status 201
    ✓ is status 200

    checks................: 100.00%
    data_received.........: 13 MB (45 kB/s)
    data_sent.............: 1.9 MB (6.5 kB/s)
    http_req_blocked......: avg=169.85µs max=123.74ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=163.3µs max=123.74ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=37.31ms max=384.74ms med=25.25ms min=11.02ms p(90)=64.17ms p(95)=70.53ms
    http_req_receiving....: avg=135.21µs max=112.45ms med=0s min=0s p(90)=966.6µs p(95)=1ms
    http_req_sending......: avg=44.73µs max=18.04ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=37.13ms max=383.74ms med=25.09ms min=11.02ms p(90)=64.17ms p(95)=70.31ms
    http_reqs.............: 79237 (264.12333333333333/s)
    vus...................: 10
    vus_max...............: 10

    ✓ is status 201
    ✗ is status 200
          0.02% (6/33058)

    checks................: 99.99%
    data_received.........: 11 MB (36 kB/s)
    data_sent.............: 1.5 MB (5.1 kB/s)
    http_req_blocked......: avg=192.75µs max=1.01s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=188.42µs max=1.01s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=47.04ms max=4.57s med=30.07ms min=11.01ms p(90)=83.87ms p(95)=99.23ms
    http_req_receiving....: avg=120.67µs max=72.19ms med=0s min=0s p(90)=489µs p(95)=1ms
    http_req_sending......: avg=32.91µs max=2ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=46.88ms max=4.57s med=29.11ms min=10.99ms p(90)=83.39ms p(95)=98.46ms
    http_reqs.............: 63163 (210.54333333333332/s)
    vus...................: 10
    vus_max...............: 10

Test Results - CPU utilization

Table 2. A 5 minute test with 10 concurrent users

PodVersion (local pod storage)	Persistent Disk (Azure file share)

Happy path - no node failures - 100 concurrent users

Test Results - client summary

Table 3. A 5 minute test with 100 concurrent users

PodVersion (local pod storage)

Persistent Disk (Azure file share)

    ✓ is status 201
    ✓ is status 200

    checks................: 100.00%
    data_received.........: 43 MB (144 kB/s)
    data_sent.............: 6.3 MB (21 kB/s)
    http_req_blocked......: avg=885.64µs max=3.06s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=878.71µs max=3.06s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=117.77ms max=3.56s med=105.27ms min=13.03ms p(90)=158.42ms p(95)=184.48ms
    http_req_receiving....: avg=910µs max=1.82s med=0s min=0s p(90)=0s p(95)=1ms
    http_req_sending......: avg=23.87µs max=7.04ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=116.84ms max=3.56s med=104.29ms min=13.03ms p(90)=157.42ms p(95)=182.48ms
    http_reqs.............: 252400 (841.3333333333334/s)
    vus...................: 100
    vus_max...............: 100

    ✓ is status 201
    ✓ is status 200

    checks................: 100.00%
    data_received.........: 33 MB (109 kB/s)
    data_sent.............: 4.6 MB (15 kB/s)
    http_req_blocked......: avg=1.05ms max=9.1s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=1.04ms max=9.1s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=149.99ms max=7.14s med=121.29ms min=12.03ms p(90)=219.57ms p(95)=281.74ms
    http_req_receiving....: avg=1.21ms max=3.67s med=0s min=0s p(90)=0s p(95)=1ms
    http_req_sending......: avg=22.99µs max=10.02ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=148.75ms max=6.08s med=120.31ms min=12.03ms p(90)=218.58ms p(95)=279.74ms
    http_reqs.............: 198334 (661.1133333333333/s)
    vus...................: 100
    vus_max...............: 100

Test Results - CPU utilization

Table 4. A 5 minute test with 100 concurrent users

PodVersion (local pod storage)	Persistent Disk (Azure file share)

Happy path - no node failures - 1000 concurrent users

Now we start seeing a bunch of errors - however, these were client timeouts so I’m not exactly sure if things broke at the server end. The pattern continues though - POD version serves more reqs/s at a slightly higher CPU utilization.

I should probably run a couple of nodes to drive traffic and do that - but that means reading more k6.io documentation which I’d rather not ATM

POD Version

Persistent Disk Version

# POD version
    ✗ is status 201
          0.44% (229/52608)
    ✗ is status 200
          0.65% (332/51389)

    checks................: 99.46%
    data_received.........: 32 MB (267 kB/s)
    data_sent.............: 5.7 MB (47 kB/s)
    http_req_blocked......: avg=50.02ms max=21.13s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=49.89ms max=21.09s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=868.85ms max=1m0s med=188.52ms min=106.3ms p(90)=1.42s p(95)=2.66s
    http_req_receiving....: avg=173.69ms max=59.56s med=0s min=0s p(90)=0s p(95)=69.15ms
    http_req_sending......: avg=34.97µs max=1.2s med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=695.13ms max=59.74s med=188.47ms min=106.3ms p(90)=1.22s p(95)=2.14s
    http_reqs.............: 103996 (866.6333333333333/s)
    vus...................: 1000
    vus_max...............: 1000

# persistentdisk version - more failures
    ✗ is status 200
          1.26% (573/45500)
    ✗ is status 201
          1.03% (482/46886)

    checks................: 98.86%
    data_received.........: 34 MB (282 kB/s)
    data_sent.............: 6.1 MB (51 kB/s)
    http_req_blocked......: avg=65.61ms max=21.03s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=65.33ms max=21.01s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=1.06s max=1m0s med=459.24ms min=95.22ms p(90)=1.73s p(95)=3.02s
    http_req_receiving....: avg=158.52ms max=59.74s med=0s min=0s p(90)=0s p(95)=1.02ms
    http_req_sending......: avg=740.5µs max=19.39s med=0s min=1ms p(90)=0s p(95)=0s
    http_req_waiting......: avg=907.79ms max=59.52s med=445.19ms min=95.22ms p(90)=1.58s p(95)=2.58s
    http_reqs.............: 92386 (769.8833333333333/s)
    vus...................: 1000
    vus_max...............: 1000

A real-world test with node failures

So for this, I think I’m going to run a 500 user test for 5 mins on each configuration and then randomly kill pods during the test.

The POD version will get a new node which will have to catch up to the cluster master since it will start off with empty storage.

The Persistent Disk version OTOH, has data intact - so the moment a node comes up, it should just carry on. IMO, in this test, we should see the Persistent Disk version do better.

The results

Interesting to say the least. The persistent disk version did not a do a lot better as expected (or, said the other way round, the pod version recovered pretty quickly on pod failure). There are slightly more failures on the pod version, but not a whole lot - we’re talking .03% difference. The persistent disk version pulled ahead by a small factor for once (20req/s) but that’s it.

Caveat

In this case, pod failures were probably far enough to not matter - ie pod1 was deleted and pod1' came online and caught up before pod2 was deleted. If both pods went offline in quick succession, then data loss is a real possibility.

POD Version

Persistent Disk Version

Pod es-223* was deleted at 1m
Pod es-223* was deleted at 3m18s

# POD version
    ✗ is status 201
          0.15% (184/126755)
    ✗ is status 200
          0.19% (265/136609)

    checks................: 99.83%
    data_received.........: 51 MB (170 kB/s)
    data_sent.............: 7.9 MB (26 kB/s)
    http_req_blocked......: avg=8.44ms max=21s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=8.43ms max=21s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=539.67ms max=1m0s med=198.5ms min=147.39ms p(90)=952.55ms p(95)=1.55s
    http_req_receiving....: avg=57.89ms max=59.55s med=0s min=0s p(90)=0s p(95)=1.03ms
    http_req_sending......: avg=23.15µs max=3.5ms med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=481.75ms max=59.1s med=196.52ms min=147.39ms p(90)=891.36ms p(95)=1.47s
    http_reqs.............: 263364 (877.88/s)
    vus...................: 500
    vus_max...............: 500

Pod es-1* was deleted at 1m
Pod es-3* was deleted at 3m18s

# persistentdisk version
# deleted pods at 1m mark and 3m18s mark
    ✗ is status 201
          0.12% (148/123644)
    ✗ is status 200
          0.13% (173/133111)

    checks................: 99.87%
    data_received.........: 50 MB (167 kB/s)
    data_sent.............: 7.8 MB (26 kB/s)
    http_req_blocked......: avg=11.62ms max=21.02s med=0s min=0s p(90)=0s p(95)=0s
    http_req_connecting...: avg=11.56ms max=21.01s med=0s min=0s p(90)=0s p(95)=0s
    http_req_duration.....: avg=553.44ms max=1m0s med=272.74ms min=80.21ms p(90)=948.51ms p(95)=1.49s
    http_req_receiving....: avg=58.1ms max=59.76s med=0s min=0s p(90)=0s p(95)=1.02ms
    http_req_sending......: avg=562.74µs max=21.96s med=0s min=0s p(90)=0s p(95)=0s
    http_req_waiting......: avg=494.78ms max=59.37s med=267.69ms min=79.23ms p(90)=902.37ms p(95)=1.37s
    http_reqs.............: 256755 (855.85/s)
    vus...................: 500
    vus_max...............: 500

Figure 2. This pod was not deleted

Figure 3. This pod was not deleted

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
src		src
.gitignore		.gitignore
README.adoc		README.adoc
docinfo.html		docinfo.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

src

src

.gitignore

.gitignore

README.adoc

README.adoc

docinfo.html

docinfo.html

Repository files navigation

EventStore on Azure Kubernetes (ACS)

Prep - set up `az`

install azure xplat cli

Set up K8S

create service principal

Create cluster

Download cluster creds

Connect to K8S Web UI

Add Persistent Storage (Azure Files)

Perf Bench

Prepping k8s on ACS for monitoring with Heapster, Grafana and InfluxDB

Test Scenario

Happy path - no node failures - 10 concurrent users

Test Results - client summary

Test Results - CPU utilization

Happy path - no node failures - 100 concurrent users

Test Results - client summary

Test Results - CPU utilization

Happy path - no node failures - 1000 concurrent users

A real-world test with node failures

The results

About

Releases

Packages

Languages

raghur/eventstore-bench

Folders and files

Latest commit

History

Repository files navigation

EventStore on Azure Kubernetes (ACS)

Prep - set up az

install azure xplat cli

Login and set subscription

Set up K8S

create service principal

Create cluster

Download cluster creds

Connect to K8S Web UI

Add Persistent Storage (Azure Files)

Perf Bench

Prepping k8s on ACS for monitoring with Heapster, Grafana and InfluxDB

Test Scenario

Happy path - no node failures - 10 concurrent users

Test Results - client summary

Test Results - CPU utilization

Happy path - no node failures - 100 concurrent users

Test Results - client summary

Test Results - CPU utilization

Happy path - no node failures - 1000 concurrent users

A real-world test with node failures

The results

About

Topics

Resources

Stars

Watchers

Forks

Languages

Prep - set up `az`