Skip to content
This repository has been archived by the owner on Sep 1, 2023. It is now read-only.

Saving and loading a model repeatedly causes it to break #3820

Closed
melon3r opened this issue Mar 21, 2018 · 10 comments · Fixed by #3826
Closed

Saving and loading a model repeatedly causes it to break #3820

melon3r opened this issue Mar 21, 2018 · 10 comments · Fixed by #3826

Comments

@melon3r
Copy link

melon3r commented Mar 21, 2018

Hi!

I'm feeding data to a model in small batches, saving the model to disk at the end of each batch, and loading it again for the next one. After a few batches, the model stops working and throws the following error when calling model.run(input):

Traceback (most recent call last):
  File "./anomalies.py", line 63, in <module>
    result = model.run(input)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/frameworks/opf/htm_prediction_model.py", line 448, in run
    inferences = self._anomalyCompute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/frameworks/opf/htm_prediction_model.py", line 696, in _anomalyCompute
    self._getAnomalyClassifier().compute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/engine/__init__.py", line 433, in compute
    return self._region.compute()
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/bindings/engine_internal.py", line 1499, in compute
    return _engine_internal.Region_compute(self)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/bindings/regions/PyRegion.py", line 184, in guardedCompute
    return self.compute(inputs, DictReadOnlyWrapper(outputs))
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 326, in compute
    self._classifyState(record)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 405, in _classifyState
    self._addRecordToKNN(state)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/regions/knn_anomaly_classifier_region.py", line 490, in _addRecordToKNN
    knn.learn(pattern, category, rowID=rowID)
  File "/home/dani/.local/lib/python2.7/site-packages/nupic/algorithms/knn_classifier.py", line 537, in learn
    inputPattern = numpy.dot(self._vt, inputPattern - self._mean)
ValueError: operands could not be broadcast together with shapes (65536,) (0,)

Here's the code used to load and store the model:

with open(model_file, 'r') as f:
    model = HTMPredictionModel.readFromFile(f)
with open(model_file, 'w') as f:
    model.writeToFile(f)

I've tried using a model generated from a previous batch and skipping some batches of data, to find out if it was the data that was somehow generating a bad model, but after the same number of batches, no matter their contents, I get to a broken model again. Thus, I suspect a bug is being triggered at readFromFile or writeToFile (or maybe I'm just doing it wrong).

This is with Python 2.7.9, and nupic 1.0.3 from pypi.

@rhyolight
Copy link
Member

Hey @lscheinkman and @scottpurdy, this might be another report similar to #3783.

@melon3r Can you perhaps attach some code we can run to replicate this?

@ghost
Copy link

ghost commented Mar 22, 2018

@melon3r Can you try this...it's working fine for our project. We also found that you can compress the binary data here quite a bit...

from nupic.frameworks.opf.htm_prediction_model import HTMPredictionModel

    def serialize_htm(htm_model):
        proto = HTMPredictionModel.getSchema()
        builder = proto.new_message()
        htm_model.write(builder)
        return builder.to_bytes_packed() //returns binary data of htm_model

    def deserialize_htm(htm_buffer):
        proto = HTMPredictionModel.getSchema()
        reader = proto.from_bytes_packed(htm_buffer)
        return HTMPredictionModel.read(reader) //returns htm_model from binary data

Also, there is a #3805 minor bug in Nupic now where if you attempt to serialize and deserialize without processing any samples in between it will error out.

@melon3r
Copy link
Author

melon3r commented Mar 22, 2018

Hey @kyle-sorensen, thank you for the tip, but it didn't work out for me. The model breaks at the exact same point.

@melon3r Can you perhaps attach some code we can run to replicate this?

@rhyolight I'll try to build a small script to reproduce it and share it ;)

@rhyolight
Copy link
Member

Thanks @melon3r. Numenta engineer @lscheinkman is working on updating our regression test suite so that we serialize our models in the middle of running the NAB data set, then continue after de-serialization. We hope to see this test fail so we can fix the issue and update the source code. Your script might still be helpful, so please continue with it if you can.

@melon3r
Copy link
Author

melon3r commented Mar 26, 2018

I found the "issue". 🤦‍♂️

Trying to replicate it I found it was always failing at the same record, the 2184th, with this config in the model parameters: 'autoDetectWaitRecords': 2184

I just copied if from the HotGym example, so I don't even understand it... Can you help?

@rhyolight
Copy link
Member

@melon3r Can you try either removing it from the configuration or (if that doesn't work) making it extremely large? Then try again? If it works at least we know what to fix.

@rhyolight rhyolight reopened this Mar 26, 2018
@melon3r
Copy link
Author

melon3r commented Mar 28, 2018

Hi @rhyolight,

Removing it from the configuration gave it a default value of 4000. I could configure it to be very high, but I don't think that's how it's supposed to be run on production? Are models not supposed to run indefinitely?

What's this configuration actually doing? Debugging the error I found that after processing this number of records, flow changes and it starts doing something with a knn anomaly classification region, which it didn't before. What's the difference between the process before and after this threshold is reached?

@rhyolight
Copy link
Member

rhyolight commented Mar 28, 2018

It has to do with something unrelated to HTM. It is a legacy setting that is just causing trouble, and we should remove it. It is not affecting how the HTM runs, it's just expressing a bug. Set it to 999999999.

@melon3r
Copy link
Author

melon3r commented Apr 4, 2018

Alright, thanks. 999999999 that makes for 1900 years of records, at one record per minute so I guess it'll be good :)

@melon3r melon3r closed this as completed Apr 4, 2018
@rhyolight
Copy link
Member

@lscheinkman found that this was still happening when he starting writing more tests for #3808.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants