Save failing cases between test runs #20

tmbb · 2017-08-13T12:48:30Z

Hypothesis (the inspiration for all my feature requests), saves the failing test cases in a database, so that they can be tried in other test runs. That way you're confident that the generated example that failed before will be testes in the next builds. Link: http://hypothesis.readthedocs.io/en/latest/database.html

I think we should do this too. The database is not very complex, it's just some files in hidden directories (I'm not familiar with the implementation details). Just to be clear, I'm not talking about repeating the whole test suite for the failing tests, just repeating the failing examples. This probably requires storing the random seed just before running a new test. Can StreamData do it?

jeffkreeftmeijer · 2017-08-27T19:53:34Z

Although I think storing failing test cases would be a great fit for StreamData (making sure you don’t “lose” failing cases), I don’t think this project is the place to add such a feature.

Since ExUnit’s seeds are used to generate the test data, we wouldn’t have to store whole test cases. Instead, ~~as @tmbb already mentioned~~, we should be fine just storing ExUnit's seed as a first step. However, since that is a fairly specific (and possibly confusing) feature, this might be better as a separate library for now.

PS: Since I love a good challenge, I tinkered with this idea for a bit and came up with bad_seed. It's the simplest thing that works; it stores the last failing seed in test/.bad_seed and keeps using it until it's green.

fishcakez · 2017-08-27T19:59:57Z

I don't think storing the seeds would not be enough because over time the test generators could change and the seed would nolonger produce same data.

…

On 27 Aug 2017 12:53 pm, "Jeff Kreeftmeijer" ***@***.***> wrote: Although I think storing failing test cases would be a great fit for StreamData (making sure you don’t “lose” failing cases), I don’t think this project is the place to add such a feature. Since ExUnit’s seeds are used to generate the test data, we wouldn’t have to store whole test cases. Instead, as @tmbb <https://github.com/tmbb> already mentioned, we should be fine just storing ExUnit's seed. So, this feature request might be better suited for ExUnit itself. However, since this is a fairly specific (and possibly confusing) feature, this might be better as a separate library for now. PS: Since I love a good challenge, I tinkered with this idea for a bit and came up with bad_seed <https://github.com/jeffkreeftmeijer/bad_seed>. It's the simplest thing that works; it stores the last failing seed in test/.bad_seed and keeps using it until it's green. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB6JTST8Do_gkbM2UYNGx6XT2S5djexBks5scck_gaJpZM4O1rbl> .

tmbb · 2017-08-27T20:00:36Z

as @tmbb already mentioned, we should be fine just storing ExUnit's seed

This is NOT what I said! Saving the random seed is useful, but it limits you to running the whole test suite. When I talked about saving the seed I had in mind taking the state of the random number generator at the time where the failing example was tested and saving that. Later, you'd only rerun that test case and then try some new data.

Your library is interesting, but it's not the same thing.

tmbb · 2017-08-27T20:03:46Z

@fishcakez That's true, but if the generators change, then you actually have very few guarantees... You'd have to serialize the values themselves and not the seed, and I don't know if that's possible for all values. Saving the seed (as I've described above) seems like a good compromise.

josevalim · 2017-08-27T20:23:07Z

Bad seed is definitely interesting and in case we can't figure out exactly how to do the saving and loading, it is the minimum we can do to get started.

@jeffkreeftmeijer you could save the bad seed in "_build/test" something.

jeffkreeftmeijer · 2017-08-27T20:31:57Z

@tmbb I agree that storing the seeds per example is definitely a nicer solution, and I wasn’t suggesting my library does the same thing. It's just a quick stab at the problem.

@josevalim better idea indeed. Will look into that.

fishcakez · 2017-08-27T20:37:28Z

@tmbb my main interest is in state machine testing, which I think only appears in the Erlang quickcheck libraries. In those situations it is likely that generators would change over time but the input would be tested for validity as part of a precondition check when running the test. In this situation saving the seed is actually unhelpful because it is not providing what it is intended to, and the regression test desired won't be run.

I think this comes into play generally, that while interesting it has edge cases that mean the same test isnt going to be run unless you can guarantee all the test code, or at least generation is the same.

tmbb · 2017-08-27T20:42:12Z

@jeffkreeftmeijer Ok, I just wanted to make sure I was communicating my idea correctly :)

@josevalim If I'm reading the code correctly, this function is probably a good place to add some code that saves the seed, test config, size, etc (whatever you need to make the test reproducible; I'm not familiar enough with the code to know exactly what information is needed).

EDIT: You still need some extra code to run that specific example.

tmbb · 2017-08-27T20:44:31Z

In those situations it is likely that generators would change over time

@fishcakez I'm not familiar with state machine testing, but wouldn't you be able to generate the state transitions from the random seed?

fishcakez · 2017-08-27T20:58:58Z

but wouldn't you be able to generate the state transitions from the random seed?

Unfortunately not if the generators change. The tests work by generating a list of commands to run against the state machine. Before running the test the list of commands is checked to be a valid set of commands using a model. If the commands are valid it is run against the system. Its likely that you the generators will change over time but you will want to run the same regression tests of commands that used to fail.

For example when a feature is added you would likely need to extend your model, and so the generator for the list of commands would change. Therefore the seed would generate a different list of commands to the one you intended to test. When replaying a previous list of commands the precondition check is still carried out, so its known if the regression test is still valid in the model. If the generator doesn't change then it will still produce the same list of commands with the same seed.

I was trying to give a real life example, where storing the seed would not work. However I think in simpler cases the same occurs as soon as the generator changes. Therefore I think we would only want to keep the seeds around if the generator does not change. However then if the generator does change you still want to be able to run the regression test. If we keep the seed we end up testing the wrong thing, and if we delete the seed then we lose the regression test.

I think this means that we would need to always store the generated term and not seed.

tmbb · 2017-08-27T21:09:13Z

@fishcakez Yes, I think you're right. In that case you'll be rewriting the generator. I wonder if the hypothesis bytestring approach (from Python) could help you here... Probably not, unless you're careful when writing your generators (I can think of some possibilities). But their advice is that if the regression is important or hard to find you should save the term manually. Translating into StreamData, you should gather the interesting examples in a normal ExUnit test case.

fishcakez · 2017-08-27T21:22:10Z

@tmbb we are able to serialize any pure datastructure to disk, the impure ones too but it wont reproduce the old side effects if that makes a difference.

I am not just concerned about the long term important ones but short term testing to. If we store the seed the user can easily end up testing the wrong input unless we can know the generators didn't change. If we can't provide this guarantee then I don't think we should provide the feature (storing seed) as users will find it doesn't work as intended - or even worse they won't discover it when it occurs!

fishcakez · 2017-08-27T21:44:58Z

Just to be clear, I was referring to StreamData providing this feature, not another library that isn't specifically targeting StreamData.

fishcakez · 2017-08-28T15:32:16Z

After speaking to @josevalim, we could get the best of both by reusing test seeds when running mix --stale. This would mean ppl can run tests with same seed until they pass, then they may get a new seed when it comes back to being tested again. If the generator changes then it's fine because still generate new values. This works nicely because the seed will not last through different builds but will work when really you want it to.

Given the speed of property tests most users would be wanting to use stale anyway, if they aren't in general.

alfert · 2017-09-21T20:49:31Z

I just saw this issue today, so my comments are perhaps a little bit late.

I just implemented the same feature for prop_check, which uses PropEr as backend. In PropEr, you get the counter example has a result from a failing test and there is a function for executing the property again with this counter example. So my solution is to write the counter example together with the property identifier to disk and re-execute the property later with that particular counter example.

To execute the counter example within StreamData, I the generators of the property must not be executed, this could be communicated towards check as a new parameter and a branch within check that deals with executing the counter example. In property, the management of counter examples needs to be added. For me, it is unclear how variables/arguments of the property are set when re-applying a counter example. Is this cared for anywhere? I would assume that it should happen somewhere in compile or compile_clauses, but I am unsure whether this is already implemented.

And just for the record: @fishcakez PropEr and prop_check support state machine testing as well.

tmbb · 2017-10-21T12:25:18Z

Just to go back to this issue of storing bad examples. In practice, erlang terms can often be serializeable (even functions). I now think that if this is ever going to be implemented, both the seed and value should be saved when possible.

whatyouhide added Kind:Feature Request Kind:Discussion labels Aug 13, 2017

whatyouhide changed the title ~~Saving failing cases between test runs~~ Save failing cases between test runs Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save failing cases between test runs #20

Save failing cases between test runs #20

tmbb commented Aug 13, 2017

jeffkreeftmeijer commented Aug 27, 2017 •

edited

fishcakez commented Aug 27, 2017 via email

tmbb commented Aug 27, 2017

tmbb commented Aug 27, 2017

josevalim commented Aug 27, 2017

jeffkreeftmeijer commented Aug 27, 2017

fishcakez commented Aug 27, 2017 •

edited

tmbb commented Aug 27, 2017 •

edited

tmbb commented Aug 27, 2017 •

edited

fishcakez commented Aug 27, 2017

tmbb commented Aug 27, 2017

fishcakez commented Aug 27, 2017

fishcakez commented Aug 27, 2017

fishcakez commented Aug 28, 2017

alfert commented Sep 21, 2017

tmbb commented Oct 21, 2017

Save failing cases between test runs #20

Save failing cases between test runs #20

Comments

tmbb commented Aug 13, 2017

jeffkreeftmeijer commented Aug 27, 2017 • edited

fishcakez commented Aug 27, 2017 via email

tmbb commented Aug 27, 2017

tmbb commented Aug 27, 2017

josevalim commented Aug 27, 2017

jeffkreeftmeijer commented Aug 27, 2017

fishcakez commented Aug 27, 2017 • edited

tmbb commented Aug 27, 2017 • edited

tmbb commented Aug 27, 2017 • edited

fishcakez commented Aug 27, 2017

tmbb commented Aug 27, 2017

fishcakez commented Aug 27, 2017

fishcakez commented Aug 27, 2017

fishcakez commented Aug 28, 2017

alfert commented Sep 21, 2017

tmbb commented Oct 21, 2017

jeffkreeftmeijer commented Aug 27, 2017 •

edited

fishcakez commented Aug 27, 2017 •

edited

tmbb commented Aug 27, 2017 •

edited

tmbb commented Aug 27, 2017 •

edited