Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

registry: remove units from etcd registry upon DestroyUnit() #1510

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dongsupark
Copy link
Contributor

So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: #1456
Fixes: #1290
Reference: #1291

/cc @kayrus @wuqixuan

@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
}

// TODO(jonboulle): add unit reference counting and actually destroying Units

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle could you please remember what was the use case that made you add this comment ? "unit reference counting" is there some optimization involved or something else ? thanks :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the delete operation needs to be atomic (essentially a
compare and delete, "delete this unit file IFF no other units reference
it"), but that information is not stored in one place in etcd at the
moment. That's where a reference count would come in.

I haven't looked at the PR code yet but - how does it deal with this
scenario of a race between destroying a unit and creating a new unit with
an identical unit file hash?

On Fri, Mar 18, 2016 at 2:00 PM Djalal Harouni notifications@github.com
wrote:

In registry/job.go
#1510 (comment):

@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
}

// TODO(jonboulle): add unit reference counting and actually destroying Units

@jonboulle https://github.com/jonboulle could you please remember what
was the use case that made you add this comment ? "unit reference counting"
is there some optimization involved or something else ? thanks :-)


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/coreos/fleet/pull/1510/files/0be1141333baa76d27b0560f96614a371350a663#r56651729

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle

The problem is that the delete operation needs to be atomic (essentially a compare and delete, "delete this unit file IFF no other units reference it"), but that information is not stored in one place in etcd at the moment. That's where a reference count would come in.
I haven't looked at the PR code yet but - how does it deal with this scenario of a race between destroying a unit and creating a new unit with an identical unit file hash?

Yeah, I see your point. This PR is basically not about changing the current design of fleet. I don't think I'm capable of doing it. Currently I don't have a better idea about how to do refcounting to prevent such a race between creating and destroying a unit.
But on the other hand, I think that the delete operation needs to be somehow implemented, from the perspective of usability. Thus I had to make a compromise, writing such a lengthy verification function before removing a unit. Of course it's not perfect, but that's probably one of doable things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antrik

The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.

If anyone could provide a better patch for this issue, I would be happy to close this PR in favor of the new one. Until then, I'd still think this PR is the only one I can come up with. OTOH given the fact that the 'TODO' comment was written 2 years ago, I don't expect anything to suddenly happen in the near future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongsu, fair enough! thanks for the patches.

@tixxdz
Copy link
Contributor

tixxdz commented Mar 18, 2016

@dongsupark is a functional test possible for this case ?

@dongsupark
Copy link
Contributor Author

@tixxdz First of all, I don't think it's possible to test such an extreme case that 2 arbitrary units end up having the same sha1 hash. Theoretically it would be possible, but how would it be practically look like...?
But it's probably possible to write a basic test for submitting & destroying multiple units, to verify that a correct unit has been actually removed. functional/unit_action_test.go doesn't seem to have such tests.

@tixxdz
Copy link
Contributor

tixxdz commented Mar 18, 2016

@dongsupark yes of course! yeh I was referencing "verify if the unit was actually removed" parts, thanks!

@jonboulle
Copy link
Contributor

Creating units with the same hash is easy. echo blabla > foo.service; cp
foo.service bar.service; fleetctl submit foo.service bar.service

On Fri, Mar 18, 2016 at 3:14 PM Dongsu Park notifications@github.com
wrote:

@tixxdz https://github.com/tixxdz First of all, I don't think it's
possible to test such an extreme case that 2 arbitrary units end up having
the same sha1 hash. Theoretically it would be possible, but how would it be
practically look like...?
But it's probably possible to write a basic test for submitting &
destroying multiple units, to verify that a correct unit has been actually
removed. functional/unit_action_test.go doesn't seem to have such tests.


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
#1510 (comment)

@tixxdz
Copy link
Contributor

tixxdz commented Mar 18, 2016

@jonboulle thx!

@dongsupark dongsupark force-pushed the dongsu/etcd-destroy-unit branch 6 times, most recently from 110eb5f to 0ef648c Compare March 22, 2016 16:08
@dongsupark
Copy link
Contributor Author

@tixxdz

is a functional test possible for this case ?

I added a new functional test TestUnitDestroyFromRegistry, which does not work correctly. On semaphore, etcdctl ls /_coreos.com/fleet/job/hello.service/object fails, because there's no such file. Funny enough, everything works fine during my local test. Not sure what's going on.

@dongsupark dongsupark force-pushed the dongsu/etcd-destroy-unit branch 3 times, most recently from 628ee52 to c254991 Compare March 23, 2016 10:46
@dongsupark
Copy link
Contributor Author

So finally I managed to get the new functional test TestUnitDestroyFromRegistry working. It has failed so far because of a wrong keyspace, which should have been /fleet_functional/smoke instead of the default /_coreos.com/fleet. Solution was to export cluster.Keyspace() to make it usable from functional tests correctly. (I also fixed other minor bugs, but I'll not get into the detail...)

Though the functional test still does not succeed completely. Other tests like TestNodeShutdown fail randomly, but I think that has nothing to do with my test. That's probably because of the on-going maintenance work from Semaphore.

t.Fatalf("Unable to submit fleet unit: %v", err)
}
var stdout string
stdout, _, err = cluster.Fleetctl(m, "list-units", "--no-legend")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't define the variable separately: use stdout, _, err := ... instead.

@dongsupark dongsupark force-pushed the dongsu/etcd-destroy-unit branch 2 times, most recently from 5a0c768 to 8948883 Compare March 24, 2016 09:47
@dongsupark
Copy link
Contributor Author

@antrik ok, I changed minor things as you suggested, and force-pushed.

BTW after rebasing onto the current master, fleet suddenly stopped to build, because commit 64ec3a4 removed HashFromHexString. So I had to resurrect the function in registry/job.go. Now it's working.

@kayrus
Copy link
Contributor

kayrus commented Mar 31, 2016

@dongsupark function was restored. that was temporarily behavior. sorry for confusing.

Dongsu Park added 3 commits August 31, 2016 16:24
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
Introduce new helpers for running etcdctl for functional tests.
Also export a method Cluster.Keyspace() to get it called by functional
tests.
A new test TestUnitDestroyFromRegistry() checks for a submitted unit
being actually deleted from the etcd registry. To compare the old unit
body with the one registered in the etcd registry, we need to go
through several steps for queries to etcd.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants