registry: remove units from etcd registry upon DestroyUnit() #1510

dongsupark · 2016-03-18T12:25:04Z

So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: #1456
Fixes: #1290
Reference: #1291

/cc @kayrus @wuqixuan

tixxdz · 2016-03-18T13:00:04Z

registry/job.go

@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
 	}

 	// TODO(jonboulle): add unit reference counting and actually destroying Units
+


@jonboulle could you please remember what was the use case that made you add this comment ? "unit reference counting" is there some optimization involved or something else ? thanks :-)

The problem is that the delete operation needs to be atomic (essentially a
compare and delete, "delete this unit file IFF no other units reference
it"), but that information is not stored in one place in etcd at the
moment. That's where a reference count would come in.

I haven't looked at the PR code yet but - how does it deal with this
scenario of a race between destroying a unit and creating a new unit with
an identical unit file hash?

On Fri, Mar 18, 2016 at 2:00 PM Djalal Harouni notifications@github.com
wrote:

In registry/job.go
#1510 (comment):

@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
}

// TODO(jonboulle): add unit reference counting and actually destroying Units

@jonboulle https://github.com/jonboulle could you please remember what
was the use case that made you add this comment ? "unit reference counting"
is there some optimization involved or something else ? thanks :-)

—
You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/coreos/fleet/pull/1510/files/0be1141333baa76d27b0560f96614a371350a663#r56651729

@jonboulle

The problem is that the delete operation needs to be atomic (essentially a compare and delete, "delete this unit file IFF no other units reference it"), but that information is not stored in one place in etcd at the moment. That's where a reference count would come in.
I haven't looked at the PR code yet but - how does it deal with this scenario of a race between destroying a unit and creating a new unit with an identical unit file hash?

Yeah, I see your point. This PR is basically not about changing the current design of fleet. I don't think I'm capable of doing it. Currently I don't have a better idea about how to do refcounting to prevent such a race between creating and destroying a unit.
But on the other hand, I think that the delete operation needs to be somehow implemented, from the perspective of usability. Thus I had to make a compromise, writing such a lengthy verification function before removing a unit. Of course it's not perfect, but that's probably one of doable things.

The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.

@antrik

The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.

If anyone could provide a better patch for this issue, I would be happy to close this PR in favor of the new one. Until then, I'd still think this PR is the only one I can come up with. OTOH given the fact that the 'TODO' comment was written 2 years ago, I don't expect anything to suddenly happen in the near future.

@dongsu, fair enough! thanks for the patches.

tixxdz · 2016-03-18T13:56:33Z

@dongsupark is a functional test possible for this case ?

dongsupark · 2016-03-18T14:14:37Z

@tixxdz First of all, I don't think it's possible to test such an extreme case that 2 arbitrary units end up having the same sha1 hash. Theoretically it would be possible, but how would it be practically look like...?
But it's probably possible to write a basic test for submitting & destroying multiple units, to verify that a correct unit has been actually removed. functional/unit_action_test.go doesn't seem to have such tests.

tixxdz · 2016-03-18T14:16:43Z

@dongsupark yes of course! yeh I was referencing "verify if the unit was actually removed" parts, thanks!

jonboulle · 2016-03-18T14:16:46Z

Creating units with the same hash is easy. echo blabla > foo.service; cp
foo.service bar.service; fleetctl submit foo.service bar.service

On Fri, Mar 18, 2016 at 3:14 PM Dongsu Park notifications@github.com
wrote:

@tixxdz https://github.com/tixxdz First of all, I don't think it's
possible to test such an extreme case that 2 arbitrary units end up having
the same sha1 hash. Theoretically it would be possible, but how would it be
practically look like...?
But it's probably possible to write a basic test for submitting &
destroying multiple units, to verify that a correct unit has been actually
removed. functional/unit_action_test.go doesn't seem to have such tests.

—
You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
#1510 (comment)

tixxdz · 2016-03-18T14:24:24Z

@jonboulle thx!

dongsupark · 2016-03-22T16:31:27Z

@tixxdz

is a functional test possible for this case ?

I added a new functional test TestUnitDestroyFromRegistry, which does not work correctly. On semaphore, etcdctl ls /_coreos.com/fleet/job/hello.service/object fails, because there's no such file. Funny enough, everything works fine during my local test. Not sure what's going on.

dongsupark · 2016-03-23T10:58:56Z

So finally I managed to get the new functional test TestUnitDestroyFromRegistry working. It has failed so far because of a wrong keyspace, which should have been /fleet_functional/smoke instead of the default /_coreos.com/fleet. Solution was to export cluster.Keyspace() to make it usable from functional tests correctly. (I also fixed other minor bugs, but I'll not get into the detail...)

Though the functional test still does not succeed completely. Other tests like TestNodeShutdown fail randomly, but I think that has nothing to do with my test. That's probably because of the on-going maintenance work from Semaphore.

antrik · 2016-03-23T13:32:15Z

functional/unit_action_test.go

+		t.Fatalf("Unable to submit fleet unit: %v", err)
+	}
+	var stdout string
+	stdout, _, err = cluster.Fleetctl(m, "list-units", "--no-legend")


Don't define the variable separately: use stdout, _, err := ... instead.

dongsupark · 2016-03-24T10:04:09Z

@antrik ok, I changed minor things as you suggested, and force-pushed.

BTW after rebasing onto the current master, fleet suddenly stopped to build, because commit 64ec3a4 removed HashFromHexString. So I had to resurrect the function in registry/job.go. Now it's working.

kayrus · 2016-03-31T17:27:43Z

@dongsupark function was restored. that was temporarily behavior. sorry for confusing.

So far each command "fleetctl destroy unit" has removed job entries from the etcd registry, under /_coreos.com/fleet/job. But it has not removed its unit file, under /_coreos.com/fleet/unit. As a result, fleet left lots of garbages in the etcd registry, so users had to manually clean them up. So this patch gets unit contents deleted actually from etcd registry when DestroyUnit() gets called. To avoid potential hash collisions, it first fetches a list of units from registry, to check there's any duplicated entry. Only if no duplicated unit is found, fleetd actually deletes the unit from registry. Fixes: coreos#1456 Fixes: coreos#1290 Reference: coreos#1291

Introduce new helpers for running etcdctl for functional tests. Also export a method Cluster.Keyspace() to get it called by functional tests.

A new test TestUnitDestroyFromRegistry() checks for a submitted unit being actually deleted from the etcd registry. To compare the old unit body with the one registered in the etcd registry, we need to go through several steps for queries to etcd.

tixxdz reviewed Mar 18, 2016
View reviewed changes

dongsupark force-pushed the dongsu/etcd-destroy-unit branch 6 times, most recently from 110eb5f to 0ef648c Compare March 22, 2016 16:08

dongsupark force-pushed the dongsu/etcd-destroy-unit branch 3 times, most recently from 628ee52 to c254991 Compare March 23, 2016 10:46

antrik reviewed Mar 23, 2016
View reviewed changes

dongsupark force-pushed the dongsu/etcd-destroy-unit branch 2 times, most recently from 5a0c768 to 8948883 Compare March 24, 2016 09:47

tixxdz mentioned this pull request Mar 29, 2016

fleet: add replace unit support #1509

Merged

dongsupark force-pushed the dongsu/etcd-destroy-unit branch from 8948883 to 2193e1d Compare April 4, 2016 09:54

dongsupark force-pushed the master branch from 150d30c to 4002bf5 Compare August 16, 2016 08:54

dongsupark force-pushed the master branch 7 times, most recently from 3b60e93 to 875d938 Compare August 23, 2016 11:00

dongsupark force-pushed the master branch 2 times, most recently from bdc94e8 to fa5aa3a Compare August 30, 2016 12:26

Dongsu Park added 3 commits August 31, 2016 16:24

functional: support running etcdctl, as well as using Cluster.Keyspace

bc26df1

Introduce new helpers for running etcdctl for functional tests. Also export a method Cluster.Keyspace() to get it called by functional tests.

dongsupark force-pushed the dongsu/etcd-destroy-unit branch from 9621433 to 60c9c80 Compare August 31, 2016 14:25

dongsupark force-pushed the master branch 3 times, most recently from 20a3e96 to 3aaa1ab Compare November 10, 2016 15:24

dongsupark force-pushed the master branch 2 times, most recently from eb6872f to 365565e Compare November 24, 2016 15:35

dongsupark force-pushed the master branch 6 times, most recently from 39a99ba to 44591b0 Compare December 15, 2016 19:48

jonboulle added priority/Pmaybe and removed priority/Pmaybe labels Dec 22, 2016

dongsupark force-pushed the master branch 3 times, most recently from 0132632 to 6974811 Compare February 8, 2017 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

registry: remove units from etcd registry upon DestroyUnit() #1510

registry: remove units from etcd registry upon DestroyUnit() #1510

dongsupark commented Mar 18, 2016

tixxdz Mar 18, 2016

jonboulle Mar 18, 2016

dongsupark Mar 18, 2016

antrik Mar 23, 2016

dongsupark Mar 24, 2016

tixxdz Mar 29, 2016

tixxdz commented Mar 18, 2016

dongsupark commented Mar 18, 2016

tixxdz commented Mar 18, 2016

jonboulle commented Mar 18, 2016

tixxdz commented Mar 18, 2016

dongsupark commented Mar 22, 2016

dongsupark commented Mar 23, 2016

antrik Mar 23, 2016

dongsupark commented Mar 24, 2016

kayrus commented Mar 31, 2016

		@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
		}

		// TODO(jonboulle): add unit reference counting and actually destroying Units

registry: remove units from etcd registry upon DestroyUnit() #1510

Are you sure you want to change the base?

registry: remove units from etcd registry upon DestroyUnit() #1510

Conversation

dongsupark commented Mar 18, 2016

tixxdz Mar 18, 2016

Choose a reason for hiding this comment

jonboulle Mar 18, 2016

Choose a reason for hiding this comment

dongsupark Mar 18, 2016

Choose a reason for hiding this comment

antrik Mar 23, 2016

Choose a reason for hiding this comment

dongsupark Mar 24, 2016

Choose a reason for hiding this comment

tixxdz Mar 29, 2016

Choose a reason for hiding this comment

tixxdz commented Mar 18, 2016

dongsupark commented Mar 18, 2016

tixxdz commented Mar 18, 2016

jonboulle commented Mar 18, 2016

tixxdz commented Mar 18, 2016

dongsupark commented Mar 22, 2016

dongsupark commented Mar 23, 2016

antrik Mar 23, 2016

Choose a reason for hiding this comment

dongsupark commented Mar 24, 2016

kayrus commented Mar 31, 2016