Skip to content
This repository has been archived by the owner on Dec 1, 2021. It is now read-only.

Liquibase leaves dangling lock on startup #517

Open
mook-as opened this issue Aug 22, 2019 · 4 comments
Open

Liquibase leaves dangling lock on startup #517

mook-as opened this issue Aug 22, 2019 · 4 comments

Comments

@mook-as
Copy link

mook-as commented Aug 22, 2019

Hi there!

We're getting an error on startup, where autoscaler-metrics can't start because liquibase attempts to acquire a lock, but its locking implementation means there's a chance of it dangling with no owner:

consul agent is not needed
Starting Liquibase at Fri, 16 Aug 2019 07:06:19 UTC (version 3.6.3 built at 2019-01-29 11:34:48)
Unexpected error running Liquibase: Could not acquire change log lock.  Currently locked by autoscaler-metrics-1.autoscaler-metrics-set.scf.svc.cluster.local (10.244.2.10) since 8/16/19, 5:26 AM
liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by autoscaler-metrics-1.autoscaler-metrics-set.scf.svc.cluster.local (10.244.2.10) since 8/16/19, 5:26 AM
	at liquibase.lockservice.StandardLockService.waitForLock(StandardLockService.java:230)
	at liquibase.Liquibase.update(Liquibase.java:184)
	at liquibase.Liquibase.update(Liquibase.java:179)
	at liquibase.integration.commandline.Main.doMigration(Main.java:1220)
	at liquibase.integration.commandline.Main.run(Main.java:199)
	at liquibase.integration.commandline.Main.main(Main.java:137)

Steps to reproduce:

(This was on SCF, but as far as I can tell this is an issue with liquibase, not how you run it.)

  1. Start a CF cluster, and have autoscaler installed.
  2. Restart the metricscollector job, and watch its output. (In my case, the autoscaler-metrics pod.)
  3. Once it shows Starting Liquibase, forcibly terminate the process.
  4. Let it restart, and see that it gets stuck.

Details

It appears that liquibase implements locking by inserting a row in a database, instead of taking a database lock (row-level or table-level). See this stackoverflow question for a similar situation, and recovery where manual intervention in the DB is required. I think it might actually be trying to use transactions, though; just removing the "lock" got it to recover.

I'm not sure what you could do to fix this (other than trying to fix it upstream in liquibase — unfortunately my Java is terrible). But I figured I should file it at the minimum.

Thanks!

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/168061254

The labels on this github issue will be updated when the story is started.

@cdlliuy
Copy link
Contributor

cdlliuy commented Aug 23, 2019

@mook-as ,thanks!yes,we noticed this problem and tracked through cloudfoundry/app-autoscaler-release#207.
A PR cloudfoundry/app-autoscaler-release#209 is raised to resolve this issue, but still under review.

@cdlliuy
Copy link
Contributor

cdlliuy commented Aug 23, 2019

Furthermore, @mook-as , we did the force termination as well to reproduce the issue, but do you happen to notice any other approach to trigger the failure?
The liquibase related pre-start job is there for a quite long time, but it never failed in bosh or scf deployment previously...
If "force termination" is the only trigger, I am quite curious which change in scf will kill on pre-start job ... Any cue from your side?

@mook-as
Copy link
Author

mook-as commented Aug 26, 2019

Yeah, I'm not sure why we're seeing forced termination (actually, I think it's just that the DB connection died); I suspect it's just us bumping cf-deployment leading to a larger footprint in the rest of the deployment, overloading the system slightly and making the autoscaler bits slower?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants