Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document expected locket related logs #770

Open
ameowlia opened this issue Jun 22, 2023 · 2 comments
Open

Document expected locket related logs #770

ameowlia opened this issue Jun 22, 2023 · 2 comments

Comments

@ameowlia
Copy link
Member

ameowlia commented Jun 22, 2023

I am currently debugging an issue around locket and I was trying to figure out which logs are part of normal operation. It would great to document this better in the future. For now I am going to document it here.

Setup for testing "lock" locks

  • I have two VMs where BBS is deployed
    • control/9e859b27-ac41-4e39-ba59-bb87d520d987 👈 This VM started first. So the bbs on it has the lock.
    • control/fe89652e-4e59-4efc-b75c-88cc029ed816 👈 This VM started second. So the bbs on it does not have the lock.
# control/9e859b27-ac41-4e39-ba59-bb87d520d987
# bbs.stdout.log
# this bbs has the lock, so it is active. it is constantly logging about the work it is doing.
...
{"timestamp":"2023-06-22T19:13:05.362719089Z","level":"info","source":"bbs","message":"bbs.converge-lrps.db-converge-lrps.complete","data":{"session":"1875.2"}}
{"timestamp":"2023-06-22T19:13:05.366102972Z","level":"info","source":"bbs","message":"bbs.executing-convergence.converge-lrps-done","data":{"session":"1874"}}
...
# control/fe89652e-4e59-4efc-b75c-88cc029ed816
# bbs.stdout.log
# this bbs does not have the lock, so it is not active. this last log line is from when it attempted to grab the lock, but failed because a different bbs held the lock.
# at this point it is not actively logging because it is not doing any work.
# note that it tries to make a lock with owner 'fe89652e-4e59-4efc-b75c-88cc029ed816' (this VM), 
# but the lock already exists with owner '9e859b27-ac41-4e39-ba59-bb87d520d987' (the other VM with the bbs that has the lock).

{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock",
"data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1},
"lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}
# locket.stdout.log
# There is a matching log in locket from where the bbs on control/fe89652e-4e59-4efc-b75c-88cc029ed81 attempted to grab the lock and failed
{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock",
"data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1},
"lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}
# In the locket.lock db table there is one lock for bbs. And the owner is the VM guid of the VM where the bbs that has the lock is running.
# The 'modified_index' is constantly being sequentially updated.

mysql> select * from locket.locks where path = "bbs";
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
| path | owner                                | value | type | modified_index | modified_id                          | ttl  |
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
| bbs  | 9e859b27-ac41-4e39-ba59-bb87d520d987 |       | lock |           4352 | d6d8b4f2-8d1e-4ba0-71ca-fdbf6b0bad58 |   15 |
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+

Changing who has locks

If the bbs that currently has the lock restarts, then it will give up the lock and the other bbs will claim the lock. I did this by running monit restart bbs on control/9e859b27-ac41-4e39-ba59-bb87d520d987.

# locket.stdout.log
# The bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' gives up the lock
{"timestamp":"2023-06-22T19:16:05.620930708Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","session":"1774728.1","type":"lock","type-code":1}}

# the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' successfully claims the lock.
{"timestamp":"2023-06-22T19:16:06.089322785Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"7e2a16d0-56a7-40dc-5d04-ca7275364cc2","session":"1774735.1","type":"lock","type-code":1}}

# the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' is restarted and attempts to claim the lock
{"timestamp":"2023-06-22T19:16:20.492246371Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}}

# the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' fails to claim the lock, because the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' has the lock
{"timestamp":"2023-06-22T19:16:20.493343559Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}}
{"timestamp":"2023-06-22T19:16:20.495265778Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"bbs","modified-index":4431,"request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2","type":"lock"}}
# the lock in the db is now updated with the new owner.

mysql> select * from locket.locks where path = "bbs";
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
| path | owner                                | value | type | modified_index | modified_id                          | ttl  |
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
| bbs  | fe89652e-4e59-4efc-b75c-88cc029ed816 |       | lock |             98 | cba003ea-608f-4cdf-539e-0304896dc313 |   15 |
+------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
1 row in set (0.00 sec)
# control/9e859b27-ac41-4e39-ba59-bb87d520d987
# bbs.stdout.log
# this bbs has successfully restarted.
{"timestamp":"2023-06-22T19:16:07.128143147Z","level":"info","source":"bbs","message":"bbs.locket-lock.started","data":{"lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

# this bbs fails to claim the lock because the other bbs has the lock
{"timestamp":"2023-06-22T19:16:07.136452192Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"lock-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"efb3d6c6-9585-4efb-6f74-126dcfccb99c","session":"4","ttl_in_seconds":15}}
# control/fe89652e-4e59-4efc-b75c-88cc029ed816
# bbs.stdout.log

# this bbs grabs the lock as soon as it is up for grabs
{"timestamp":"2023-06-22T19:16:06.091053490Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1},"session":"4",
"ttl_in_seconds":15}}

# once it grabs the lock it gets to work!
{"timestamp":"2023-06-22T19:16:06.092710053Z","level":"info","source":"bbs","message":"bbs.set-lock-held-metron-notifier.started","data":{"session":"5"}}
{"timestamp":"2023-06-22T19:16:06.093315811Z","level":"info","source":"bbs","message":"bbs.task-completion-workpool.starting","data":{"session":"1"}}
...
@ameowlia
Copy link
Member Author

ameowlia commented Jun 22, 2023

Setup for testing "presence" locks.

Presence locks are registered by the rep on each diego cell. It puts information about the diego cell into the lock db so BBS can know which diego cells are available to run workloads.

I have two diego cells with GUIDs: 0e776c46-fd44-48f4-83ea-06b1dd13ca4b and 8ba76ac6-4809-4255-b59b-2e3878bf23d1.

# locket.lock db table
# there is one presence lock per diego cell, where the path is the GUID of the diego cell VM.

mysql> select path, type from locks;
+--------------------------------------+----------+
| path                                 | type     |
+--------------------------------------+----------+
| 0e776c46-fd44-48f4-83ea-06b1dd13ca4b | presence |
| 8ba76ac6-4809-4255-b59b-2e3878bf23d1 | presence |
| auctioneer                           | lock     |
| bbs                                  | lock     |
| cc-deployment-updater                | lock     |
| policy-server-asg-syncer             | lock     |
| routing_api_lock                     | lock     |
| tps_watcher                          | lock     |
+--------------------------------------+----------+
8 rows in set (0.00 sec)

Changing the presence locks

  1. run monit retstart rep on diego cell 0e776c46-fd44-48f4-83ea-06b1dd13ca4b. This simulates an upgrade situation where the rep is restarted.
# rep.stdout.log

# rep exits
{"timestamp":"2023-06-22T20:19:38.920038544Z","level":"info","source":"rep","message":"rep.exited","data":{}}

# rep restarts 
{"timestamp":"2023-06-22T20:19:40.263986195Z","level":"info","source":"rep","message":"rep.wait-for-garden.ping-garden","data":{"initialTime:":"2023-06-22T20:19:40.263953792Z","session":"1","wait-time-ns:":30876}}

...
# rep gets the lock again
{"timestamp":"2023-06-22T20:19:42.926917888Z","level":"info","source":"rep","message":"rep.locket-lock.acquired-lock","data":{"lock":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","value":"{
\"cell_id\":\"0e776c46-fd44-48f4-83ea-06b1dd13ca4b\",\"rep_address\":\"http://10.0.4.8:1800\",\"zone\":\"us-central1-f\",\"capacity\":{\"memory_mb\":12977,\"disk_mb\":104349,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"prel
oaded\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://0e776c46-fd44-48f4-83ea-06b1dd13ca4b.cell.service.cf.inte
rnal:1801\"}","type":"presence","type_code":2},"session":"12","ttl_in_seconds":15}}
# locket.stdout.log

# lock is released when rep is restarted
{"timestamp":"2023-06-22T20:19:38.914625536Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","session":"1815120.1","type":"presence","type-code":2}}

# lock is acquired again
{"timestamp":"2023-06-22T20:19:42.926185907Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","request-uuid":"3293a3eb-3c86-4965-457b-3c54cd3c4199","session":"1815164.1","type":"presence","type-code":2}}

# for some reason it tries to grab the lock again (?) but can't (?). This confusing error seems to be a part of the normal process.
{"timestamp":"2023-06-22T20:19:53.862699986Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.862867442Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.863737078Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2","type":"presence"}}

@winkingturtle-vmw
Copy link
Contributor

@ameowlia Should we keep this issue open, close it, or move this into a markdown file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants