Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LambdaCD does not start when when persistent state is corrupted #192

Open
abendt opened this issue Aug 27, 2018 · 5 comments
Open

LambdaCD does not start when when persistent state is corrupted #192

abendt opened this issue Aug 27, 2018 · 5 comments

Comments

@abendt
Copy link

abendt commented Aug 27, 2018

we use LambdaCD with file-based persistence.
Sometimes during shutdown it seems that file is corrupted. Afterwards LambdaCD does not start anymore:

Aug 27 12:29:58 tyr-ci-01 java[1878]: Exception in thread "main" java.lang.NumberFormatException: null
Aug 27 12:29:58 tyr-ci-01 java[1878]: at java.lang.Integer.parseInt(Integer.java:542)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at java.lang.Integer.parseInt(Integer.java:615)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.util.internal.sugar$parse_int.invokeStatic(sugar.clj:5)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.util.internal.sugar$parse_int.invoke(sugar.clj:4)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$build_number_from_path.invokeStatic(default_pipeline_state_persistence.clj:45)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$build_number_from_path.invoke(default_pipeline_state_persistence.clj:44)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_pipeline_structure_edn.invokeStatic(default_pipeline_state_persistence.clj:96)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_pipeline_structure_edn.invoke(default_pipeline_state_persistence.clj:95)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$map$fn__5587.invoke(core.clj:2747)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.LazySeq.sval(LazySeq.java:40)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.LazySeq.seq(LazySeq.java:49)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.Cons.next(Cons.java:39)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.RT.next(RT.java:706)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$next__5108.invokeStatic(core.clj:64)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7852.invokeStatic(protocols.clj:169)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7852.invoke(protocols.clj:124)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7807$G__7802__7816.invoke(protocols.clj:19)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7835.invokeStatic(protocols.clj:75)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7835.invoke(protocols.clj:75)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core.protocols$fn__7781$G__7776__7794.invoke(protocols.clj:13)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$reduce.invokeStatic(core.clj:6748)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$into.invokeStatic(core.clj:6815)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.core$into.invoke(core.clj:6807)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_build_datas.invokeStatic(default_pipeline_state_persistence.clj:105)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state_persistence$read_build_datas.invoke(default_pipeline_state_persistence.clj:101)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state$new_default_pipeline_state.invokeStatic(default_pipeline_state.clj:76)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.internal.default_pipeline_state$new_default_pipeline_state.doInvoke(default_pipeline_state.clj:73)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at clojure.lang.RestFn.invoke(RestFn.java:410)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.core$assemble_pipeline.invokeStatic(core.clj:42)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd.core$assemble_pipeline.invoke(core.clj:37)
Aug 27 12:29:58 tyr-ci-01 java[1878]: at lambdacd_pipeline.wishlistui.wishlistui$wishlistui_pipeline.invokeStatic(wishlistui.clj:56)

we fixed the problem by deleting the workspace. Maybe there are some ways this could be improved within LambdaCD? e.g. ignoring a previous build when it's file cannot be read.

@flosell
Copy link
Owner

flosell commented Aug 28, 2018

It's definitely possible to do, even though I usually prefer to fail fast as it's more explicit to the user: "oh, my state is corrupted" vs "hmm, somehow one of my builds disappeared".

That aside, I'd like to understand how we got into this state in the first place. From looking at the stack trace and the code it looks like there were directories like build-something-thats-not-a-number in the home-directory and I'm wondering how they got there.

If this happens again, can you have a look into the home directory and post an ls?

flosell added a commit that referenced this issue Sep 1, 2018
…he directory name afterwards to prevent startup failures due to invalid build directory names (#192)
flosell added a commit that referenced this issue Sep 1, 2018
@flosell
Copy link
Owner

flosell commented Sep 1, 2018

I just looked at the code a bit more and found it definitely inconsistent. It looked at all directories starting with build- but then expected the build-number after the dash.
That should fixed now and is released in 0.14.2 so I'm closing this issue for now. If the problem re-appears, feel free to re-open.

I'd still be curious how such directories ended up there so if you find out, please drop a note, maybe there's another bug hiding somewhere.

@flosell flosell closed this as completed Sep 1, 2018
@abendt
Copy link
Author

abendt commented Sep 5, 2018

@flosell we just upgraded to 0.14.2. However it does not seem to resolve the issue.
Directory listing:

build-31 build-32 build-33 build-34 build-35 build-36 build-37 build-38 build-39 build-40 lambdacd730621214759903273 lambdacd-artifacts

flosell added a commit that referenced this issue Sep 22, 2018
… be more robust against paths with invalid build numbers (related to #192)
@flosell
Copy link
Owner

flosell commented Sep 22, 2018

Hi @abendt, sorry for the late reply, was busy with a few other things lately.

I looked into your problem again but couldn't find a way to reproduce this problem or understand why it's happening. However, I refactored the code to make it easier to reason about and possibly more robust.
I'll release this as 0.14.3, have a look if this fixes the problem. If it does, could you set logging to DEBUG level and post messages that contain doesn't seem to contain a valid build number? Maybe we'll find out this way which files are responsible.

@flosell flosell reopened this Sep 22, 2018
@abendt
Copy link
Author

abendt commented Sep 22, 2018

Will do. Thanks you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants