Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re: Memory leak problem (tensorflow backend) #343

Closed
Adnane017 opened this issue Jul 9, 2019 · 26 comments
Closed

re: Memory leak problem (tensorflow backend) #343

Adnane017 opened this issue Jul 9, 2019 · 26 comments
Assignees
Labels
topic: performance something to do with system performance

Comments

@Adnane017
Copy link

Hi,

Thanks for your quick answer to the issue # 342. Unfortunately, the option "clear_tf_session = True" doesn't help. I had already tried it without success. I also tried to put gc.collect() at the start of the function where you build the model.

Finally, I played with some toy example (dataset with 6000 training examples), and I noticed that after the end of the iterations, the memory used by python gets very large in comparison with the memory usage at the start even if I remove all the variables in the workspace and do gc.collect().

I've attached my code for reference. It is a simple problem where I try to predict the verb tense in a sentence (7 possible tenses) from the surrounding words (limited to the 10000 most frequent words).

Cheers,

FNN_tense1Verb_paramsearch.zip

@Adnane017
Copy link
Author

Any suggestions?

@mikkokotila mikkokotila added the investigation gathering information label Jul 9, 2019
@mikkokotila
Copy link
Contributor

One thing that immediately sticks out from your model, is that in each iteration you create the generator. That might have to do something with your problem.

I suggest looking at #11 for examples on how you might want to do it instead.

@Adnane017
Copy link
Author

Thanks for pointing out this potential problem. I've put the generator inside the function that builds the model because it is using one of the grid parameters (batch_size). I'll try to see if your suggestion helps by fixing the batch_size for now.

@Adnane017
Copy link
Author

Hi,

Sorry I could not work on this earlier. There is definitely a memory leak, and your suggestion to call the generator once outside the talos function helps reduce its impact. It seems that each time I call the generator class, the memory load increases more in comparison with the last call. However, if I remove the generator call from the talos function, I can no longer tune the batch size. One thing that helped a little bit is to use one worker only:

out = model.fit_generator(generator = train_gen,
validation_data = valid_gen,
epochs = params['epochs'],
use_multiprocessing = True,
verbose = 0,
workers = 1)

By doing so, I was able to run a few more iterations with talos, although the iterations were slower. But at a certain point, the program just stops producing outputs. Any thoughts?

@mikkokotila
Copy link
Contributor

Have you profiled your code in a loop without Talos? Just run it as a standalone Keras model with everything else the same, but run it in a for loop and see what happens.

@Adnane017
Copy link
Author

Not yet, but I'll do it and let you know. I am also thinking that maybe this has to do with Keras rather than talos, but let's see.

Thanks

@mikkokotila
Copy link
Contributor

Thanks. I will be very interested in knowing what you learn. I know there are a lot of issues people have reported regarding the use of generators in Keras models. Which is consistent with my own experience.

@mikkokotila mikkokotila added the topic: keras relates with keras backend label Jul 24, 2019
@mikkokotila
Copy link
Contributor

@Adnane017 any updates on this?

@Adnane017
Copy link
Author

Hi @mikkokotila. Really sorry for the delay in sending an update on this (and that you are pushing for a solution for this instead of me doing it). I was practically unavailable in the last two weeks (off to a conference). I noticed too that many Keras users have reported memory leaks. Someone told me he experienced the same problem with TensorFlow, so it might even have to do with TensorFlow rather than Keras or Talos .

I will run the test you suggested and let you know later this week.

Many thanks again.

@Jair-Ai
Copy link

Jair-Ai commented Aug 3, 2019

I have the same problem, the first iteration happens very fast, but the video memory does not empty and the second iteration starts already with the full video memory, and then begins the nightmare taking up to 3 days to improve the hyperparameters, already researched the internet and really don't think the solution

@mikkokotila
Copy link
Contributor

@TraderJair sorry to hear that you have joined the sizable group of people with frustration regarding using generators with Keras on TensorFlow backend.

In the new Talos docs there is an example you can try. Try it first with the example as it is, and if everything goes fine (works very well even in low end laptop), the replace with your own data (which I assume is much bigger).

@Adnane017, could you try the same.

@Adnane017
Copy link
Author

Back to this:

Have you profiled your code in a loop without Talos? Just run it as a standalone Keras model with everything else the same, but run it in a for loop and see what happens.

I have now run some tests with a toy example using (1) keras, (2) keras with manual memory release (del model + clear_session()) and (3) Talos with the clear_tf_session option 'on'. My model is a basic LSTM run on 1000 sentences 24 times. To profile memory usage, I used the memory_profiler package. Here are the results in graphs:

(1) Keras without memory release:
Figure_profiling_keras_noreset

(2) Keras with memory release:
Figure_profiling_keras_withclear

(3) Talos with clear_tf_session:
Figure_profiling_talos_withreset

It is clear that manually releasing memory when working with Keras helps. With Talos, there is definitely a memory leak (at least with the way I am using it), but the leak is smaller after a few iterations. Of course, all this depends on the model parameters in each iteration, which in turn determines the size of the model python will be working on. I am suspecting that what is causing the problem at large scale are those abrupt increases in memory usage after each new iteration.

I have yet to try your new suggestion. It seems different to what I used in my example. I'll post the results once done.

@Adnane017
Copy link
Author

Unfortunately, following the same structure as in the example did not help in my case (see the graph). For info, before running the new simulations, I updated Talos to the daily Github version (I noticed that you no longer offer the option to clear the TF session).
Figure_profiling_talos_newgenerator_andtalosversion

@mikkokotila
Copy link
Contributor

@Adnane017 This is really wonderful work. I love it :)

The name is changed to clear_session now, so that we support all the backends for Keras and not just TensorFlow.

Can you share the text output from cProfile or similar, so we can see exactly what is eating the memory. That way there is no doubt. Ideally, given you already have put time to this, if you can create a repo with the codes you have and the proc/similar text outputs, and I will then repeat the exact same in two different systems.

About your experiment, did I understand correct that you have done the following:

  • you have configured a model which is the same for Talos and Keras
  • that model is configured exactly the same for both cases
  • you run it in a loop (and in Talos by just changing a vanity hyperparameter) n times

i.e. the three graphs above are "apples-to-apples".

Performance is definitely a key concern/focus, and we shall get to the bottom of this! :)

Thanks again so far. Great work.

@mikkokotila
Copy link
Contributor

I updated Talos to the daily Github version (I noticed that you no longer offer the option to clear the TF session).

Can you try with clear_session=True and see how that looks like.

@mikkokotila
Copy link
Contributor

mikkokotila commented Aug 6, 2019

...having looked at both of the graphs you had provided, neither of them imply a memory leak. The characteristics of a memory leak is continuously building memory pressure, as is with Keras example without manual release. That does not seem to be the case with either of the graphs with Talos. I kind of of "know" there is no memory leak because I have thousands of hours of use myself, zero reported issues on it, and my background is on high performance industry systems that handle hundreds of billions of events per day, and considerable attention have gone to this topic, including extensive profiling of memory use.

What I'm alarmed about here, is that there is an added memory cost in comparison to Keras, which should be marginal at worst (which is not the case in your experiment). So if it indeed is that your experiment is apples to apples, and not that Talos changes the parameters for example, but Keras version does not, then this will definitely become a top priority in terms of fixes.

@mikkokotila
Copy link
Contributor

mikkokotila commented Aug 7, 2019

@Adnane017 I'm working on this currently, and indeed, with a SequenceGenerator there is a slight memory buildup in Talos vs. Keras using exactly same release method in both. This was a 5 minute test with mnist dataset and CNN. Will repeat once where all run in the same run, and then increase the time to 30 minutes and see what happens. This test is done pretty objectively given the ins and outs of non performance lab setting.

This might take a couple of days time as I also want to repeat the test with other OS (I'm on Mac OSX) as well as without data generator. After that I should be able to publish conclusive results, with hopefully also a fix for whatever is causing this.

Regarding the added memory cost your plots imply, can you share the params dictionary you are sharing? I can't replicate that, but instead see that Keras with memory release and Talos with clear_session both consume the same base level of memory, on top of which there is a slight gradual build up in Talos.

@mikkokotila mikkokotila added priority: HIGH highest priority and removed topic: keras relates with keras backend labels Aug 7, 2019
@mikkokotila
Copy link
Contributor

mikkokotila commented Aug 7, 2019

Turns out I was wrong. Thanks for persistence in this matter @Adnane017 :)

Experiment

Here I've used the MNIST dataset with simple CNN, and batch_size 256 for 2 epochs. The cycle is repeated 10 times without changing anything else. Keras release means exactly the process that is used in Talos when Scan(...clear_session=True...).

Results

Talos with generator

Keras with generator

Talos without generator

Keras without generator

This is very clear, there is indeed a memory leak, albeit a small one. The issue is of course that it's not uncommon for a researcher to have a very large dataset, and an experiment that runs for several days at a time.

Next Steps

  • Try simple means to identify what is causing the buildup
  • If simple means fail, repeat the above experiment with Talos <0.4.6 before moving to ParamSpace based parameter management

@Adnane017
Copy link
Author

@mikkokotila to respond to these 3 questions (and to some of your subsequent comments):

About your experiment, did I understand correct that you have done the following:

  • you have configured a model which is the same for Talos and Keras
  • that model is configured exactly the same for both cases
  • you run it in a loop (and in Talos by just changing a vanity hyperparameter) n times
    i.e. the three graphs above are "apples-to-apples".

I used the same model configuration for both Keras and Talos (LSTM with the same input and output dimensions), but did not use exactly the same hyperparameters. For Talos, I allowed the parameters to vary - probably the most relevant parameters for the comparison are 'number of epochs' ([1, 5, 10]) and 'batch size' ([10, 25, 50, 100, 200]). For keras, I used a fixed combination of parameters (epochs = 10 and batch_size = 50). So I think Talos had an advantage here, at least from run-time perspective, since for many simulations it run for only 1 or 5 epochs. Now looking at the text generated by the memory_profiler (unfortunately I did not save it so cannot share it with you), I noticed that the memory use/leak is higher for large batches, which I think is pointing to a problem with the generator.

I will soon run some simulations at large scale with a basic program that does grid search with Keras directly to see if it can run without problems. That will tell us for sure whether there is something in the current version of Talos that is causing the memory leak or whether it is purely due to Keras.

@Adnane017
Copy link
Author

@mikkokotila Really great stuffs! You've got the answer to what I wanted to do. But that is weird because you only need two lines, which you already have: 'clear_session()' and 'del model', right? Looking forward to knowing what is causing the problem.

@mikkokotila
Copy link
Contributor

Mystery solved

Inside the scan_object (resulting from Scan()) there is something called saved_weights, which in accord with its name is the saved weights for that model. Basically its a really many dimensional numpy array. The bigger / more complex the network, the more dimensions and more weights. For the case I have used above, the following is true:

Increase in size of scan_object.saved_weights after 1 rounds
Recursive approach #1 : 11083416
Recursive approach #2 : 11083808

I've used two different approaches for recursive memory size of the stored object, and it matches exactly with the increase in memory between each round. So it is in fact by-design functionality of Talos. We store the weights so we can evaluate and pick the winning model once the experiment is over, and deploy it if need to.

In the example case, with MNIST dataset and the simple CNN, the increase is roughly 11mb/round. In other words, 1gb for every 100 rounds. That is the cost for storing those weights.

I think generally bigger models run on bigger systems, and end up running for lesser number of rounds, so this would generally be a non-issue. Especially when the standard working of Keras is as we had witnessed it to be in the tests.

I guess there are a couple things that could be done anyhow, as I can see how this would be an invisible issue particularly for those less savvy with system performance, who might never look at performance indicators of their systems i.e. we can't simply leave it be.

The most obvious thing is to add an argument Scan(...save_weights=True) and if it's set to False then that's it, Talos will have exactly the same memory profile as Keras with manual release/cleanup.

Another, more nuanced approach, is to have Scan(...save_weights_if...) that accepts a lambda function including a value and a string indicating a metric or loss.

@mikkokotila mikkokotila self-assigned this Aug 7, 2019
mikkokotila added a commit that referenced this issue Aug 7, 2019
- Related with #343 it's now possible to avoid saving model weights in `scan_object`, which might be desirable for very long runs with very large networks, due to the memory cost of keeping the weights throughout the experiment.
- fixed a small bug in `AutoParams` where choosing `network=False` resulted in 'dense' to be split into characters
- `max_param_values` is now optional in `AutoScan` and instead created the issue to handle the underlying problem properly #367
- fixed all the tests accordingly to the change in `AutoScan` arguments
@mikkokotila mikkokotila added fixed-daily-dev and removed priority: HIGH highest priority investigation gathering information labels Aug 7, 2019
@mikkokotila mikkokotila added the topic: performance something to do with system performance label Aug 7, 2019
@mikkokotila
Copy link
Contributor

mikkokotila commented Aug 8, 2019

Talos with Scan(...save_weights=False...)

In short summary, when you have save_weights=True the memory profile looks exactly same like Keras with manual memory release i.e. the best imaginable scenario.

@Adnane017 thanks again for your help with this and for pointing it out. Saving the weights in the scan_object is a relatively recent update, and this is a very good reminder that memory profiling and system performance benchmarking has to be baked into local testing. Related with this I've opened #371 .

Have a great day as well!

@Adnane017
Copy link
Author

Happy ending then :-).

Can't you save the weight matrices in the hard drive instead of keeping them in RAM? For example, I know that they do a similar thing in this package: https://pypi.org/project/pyndl/ pyndl.ndl.ndl. For at least one function, they store temporary files in the hard drive when the model is being trained (see the function pyndl.ndl.ndl - https://github.com/quantling/pyndl/blob/master/pyndl/ndl.py).

Also, when will you release a stable version on Pypi that includes the fix? The software engineering team at our University are very strict about the software they can install on the high computing server (they will only install packages that are stable and do not conflict with existing packages).

Many thanks for fixing the problem!

@mikkokotila
Copy link
Contributor

@Adnane017 thanks for the pointer. I guess we could do that with hdf5 in the Keras way, and then load them back once the experiment is ended. That does make sense. I'll create a ticket for it, but it might take time to do it as we now have taken the pressure out of this issue.

Regarding stability, I would generally Talos is very stable in the grander scheme of things. If you go through the tickets you will see that there are very few actual bug reports for many months already, and we do make good effort to keep things like that. I think that's quite rare generally, but specifically in this field. Specifically with pypi, I think the v.0.6.3 will be the first v.0.6 to make it to pypi, which is probably a week or two away. Later this week it will become master, and usually the practice is to have a buffer between master and production (pypi), at least a week.

@Adnane017
Copy link
Author

No worries. I'll wait for the next release then. What I meant is that they won't accept to install a version through Github and that they will check if it doesn't conflict with other packages. For the time I have been using the package, I agree I didn't encounter any problems other than the memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: performance something to do with system performance
Projects
None yet
Development

No branches or pull requests

4 participants