Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

Closed
SennriSyunnga opened this issue Aug 24, 2021 · 4 comments
Assignees
Labels
comp:ops OPs related issues TF 1.15 for issues seen on TF 1.15 type:bug Bug

Comments

@SennriSyunnga
Copy link

SennriSyunnga commented Aug 24, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, i will attach below.
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 && Window 10 1909
  • TensorFlow installed from (source or binary): Java Maven
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>libtensorflow</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>libtensorflow_jni_gpu</artifactId>
            <version>1.15.0</version>
        </dependency>
  • TensorFlow version (use command below): 1.15.0

Describe the current behavior
I try to load a Saved Model from keras. It all works well till I try to run the session——
Strange thing occurs: the code just can't continue and throw no exception.
When i use 'try catch finally' style instead of 'try with resource' style, i finally got such error message below:

java.lang.IllegalStateException: Error while reading resource variable dense_2/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_2/kernel)
	 [[{{node dense_2/MatMul/ReadVariableOp}}]]

And what's more, though I got the error message, the graph instance can't close itself,
when I paused the test code in idea, i found the code stop at the Object.wait() method,
which means that the graph.refcount kept 1 value all the time.
The code couldn't escape from graph.close() method.
To prove the correctness of saved model, I try to load model in python code like below:

import tensorflow as tf
import numpy as np

export_path = "./test/";

input = np.random.random((1, 30));

with tf.Session(graph=tf.Graph()) as sess:
    loaded = tf.saved_model.loader.load(sess, ["serve"], export_path)
    graph = tf.get_default_graph()
    # print(graph.get_operations())
    x = sess.graph.get_tensor_by_name('rp_input:0')
    y = sess.graph.get_tensor_by_name('dense_2/Sigmoid:0')
    scores = sess.run(y,
                      feed_dict={x: input});
    print("predict: %d" % (np.argmax(scores, 1)));

It works well and print predict result, in that case, I think the problem may not lie in the model. (maybe?)
I tried hard to find solution or workaround on stackoverflow and issues here,
I saw several similar problems to mine, but they all occurs in python, such as :
#28287
and
#22362
the second issues seems most alike, but the model export method is different.
Standalone code to reproduce the issue
Here is my model:
model.zip
Here is the test code, because it fails all over the time, i ommit the code to close the resources.

    public void test_09_justTestAPI() {
        float[] a = new float[]{1.53672f, 2.047399f, 1.42194f, 1.494959f, -0.69123f, -0.39482f, 0.236573f, 0.733827f, -0.531855f, -0.973978f, 1.704854f, 2.085134f, 1.615931f, 1.723842f, 0.102458f, -0.017833f, 0.693043f, 1.263669f, -0.217664f, -1.058611f, 1.300499f, 2.260938f, 1.156857f, 1.291565f, -0.42401f, -0.069758f, 0.252202f, 0.808431f, -0.189161f, -0.490556f};
        long[] shape = new long[]{1, 30};
        try {
            SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve");
            Graph graph = savedModelBundle.graph();
            Tensor<Float> data = Tensor.create(shape, FloatBuffer.wrap(a));
            Session session = new Session(graph);
            Session.Runner runner = session.runner()
                    .feed("rp_input", data)
                    .fetch("dense_2/Sigmoid");
            float[][] res = new float[1][1];
            Tensor<?> out = runner.run().get(0);
            out.copyTo(res); // <artifactId>commons-io</artifactId>
            BigDecimal pro = BigDecimal.valueOf(res[0][0]);
        } catch (Exception e) {
            throw e;
        }
    }

Other info / logs
The model is produced by webank federal learining program,
In their code, the model is build by code using keras api:

def _load_model(nn_struct_json):
    return tf.keras.models.model_from_json(nn_struct_json, custom_objects={})

The json content is definded like:

      "nn_define": {
        "class_name": "Sequential",
        "config": {
          "name": "sequential",
          "layers": [
            {
              "class_name": "RepeatVector",
              "config": {
                "name":"rp",
                "n":1
              }
            },
            {
              "class_name": "LSTM",
              "config": {
                "name":"lstm",
                "units":32
              }
            },
            {
              "class_name": "Dense",
              "config": {
                "name": "dense",
                "trainable": true,
                "dtype": "float32",
                "units": 64,
                "activation": "relu"
              }
            },
            {
              "class_name": "Dense",
              "config": {
                "name": "dense_2",
                "trainable": true,
                "dtype": "float32",
                "units": 1,
                "activation": "sigmoid"
              }
            }
          ]
        },
        "keras_version": "2.2.4-tf",
        "backend": "tensorflow"
      }

the model is saved by code below:

    def export_model(self):
        with tempfile.TemporaryDirectory() as tmp_path:
            # try:
            #     tf.keras.models.save_model(self._model, filepath=tmp_path, save_format="tf")
            # except NotImplementedError:
            #     import warnings
            #     warnings.warn('Saving the model as SavedModel is still in experimental stages. '
            #                   'trying tf.keras.experimental.export_saved_model...')
            tf.keras.experimental.export_saved_model(self._model, saved_model_path=tmp_path)

            model_bytes = _zip_dir_as_bytes(tmp_path)

        return model_bytes

You can check the code in this link :
https://github.com/FederatedAI/FATE/blob/master/python/federatedml/nn/backend/tf_keras/nn_model.py
In case, here is the log of my test code:

2021-08-24 15:11:00.368680: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: .
2021-08-24 15:11:00.377471: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2021-08-24 15:11:00.382175: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-08-24 15:11:00.390552: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2021-08-24 15:11:00.409095: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: .
2021-08-24 15:11:00.416363: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 47658 microseconds.

I really stuck on this problem.
I would appreciate it if someone can help me out, many thanks!

@mohantym mohantym added TF 1.15 for issues seen on TF 1.15 comp:ops OPs related issues labels Aug 24, 2021
@mohantym
Copy link
Contributor

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Aug 24, 2021
@SennriSyunnga
Copy link
Author

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

Thank you for your help!
I keep the java maven tensorflow version equal to that of FATE project,
And I didnt realize that maven repository have higher version artifacts than 1.15.0 until you told me.
I read https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java first, the maven 1.15.0 link in the 'Quickstart' part really misled me.
I will be appreciate it if someone can update the information in that page.
I 'll have a try to the 2.0+ version artifact and give you the result as soon as I can.

@mohantym mohantym removed the stat:awaiting response Status - Awaiting response from author label Aug 25, 2021
@SennriSyunnga
Copy link
Author

SennriSyunnga commented Aug 25, 2021

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

Thank you for your help!
I keep the java maven tensorflow version equal to that of FATE project,
And I didnt realize that maven repository have higher version artifacts than 1.15.0 until you told me.
I read https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java first, the maven 1.15.0 link in the 'Quickstart' part really misled me.
I will be appreciate it if someone can update the information in that page.
I 'll have a try to the 2.0+ version artifact and give you the result as soon as I can.

I fixed this problem by learning from this issue[https://github.com/tensorflow/java/issues/365] after using a higher version api:

        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow-core-api</artifactId>
            <version>0.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow-core-api</artifactId>
            <version>0.3.1</version>
            <classifier>linux-x86_64</classifier>
        </dependency>

In this version of artifact, I finnaly can use session.init in Java, like other ones who solved the problem did in python
Then I correct my code like this: (WATCH OUT! PLEASE SEE THE UPDATE PART BEBLOW!

public void test_10_justTestAPI() {
        float[] a = new float[]{1.53672f, 2.047399f, 1.42194f, 1.494959f, -0.69123f, -0.39482f, 0.236573f, 0.733827f, -0.531855f, -0.973978f, 1.704854f, 2.085134f, 1.615931f, 1.723842f, 0.102458f, -0.017833f, 0.693043f, 1.263669f, -0.217664f, -1.058611f, 1.300499f, 2.260938f, 1.156857f, 1.291565f, -0.42401f, -0.069758f, 0.252202f, 0.808431f, -0.189161f, -0.490556f};
        try (SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve")) {
            FloatNdArray m = StdArrays.ndCopyOf(new float[][]{a});
            try (TFloat32 data = TFloat32.tensorOf(m)) {
                try (Session session = savedModelBundle.session()) {
                    SignatureDef modelInfo = savedModelBundle.metaGraphDef().getSignatureDefMap().get("serving_default");
                    Map<String, TensorInfo> inputs = modelInfo.getInputsMap();
                    String inputName = null;
                    for (Map.Entry<String, TensorInfo> input : inputs.entrySet()) {
                        TensorInfo ti = input.getValue();
                        inputName = ti.getName();
                        break;
                    }
                    String outputName = null;
                    Map<String, TensorInfo> outputs = modelInfo.getOutputsMap();
                    for (Map.Entry<String, TensorInfo> output : outputs.entrySet()) {
                        outputName = output.getValue().getName();
                        break;
                    }
                    session.run("init");
                    Session.Runner runner = session.runner();
                    runner.feed(inputName, data)
                            .fetch(outputName);
                    try (TFloat32 out = (TFloat32) runner.run().get(0)) {
                        FloatNdArray matrix = StdArrays.ndCopyOf(new float[1][1]);
                        out.copyTo(matrix);
                        FloatDataBuffer floatDataBuffer = DataBuffers.ofFloats(1);
                        matrix.read(floatDataBuffer);
                        float[] res = new float[1];
                        floatDataBuffer.read(res);
                        BigDecimal pro = BigDecimal.valueOf(res[0]);
                    }
                }
            }
        }catch (Exception e) {
            logger.error(e);
            throw e;
        }
    }

Soon after that, I found out the reason for why graph instance failed to close:
I use the graph of savedModelBundle to construct a new session

            SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve");
            Graph graph = savedModelBundle.graph();
            Tensor<Float> data = Tensor.create(shape, FloatBuffer.wrap(a));
            Session session = new Session(graph);// ← this line

In the correct way, I should use savedModelBundlesession() directly.
I dont know whether this behavior is normal or not, but hope my experience can help others.

Update
I find 'session.run("init")' is unnecessary and may be harmful.
When you run("init") once , you willl get a totally differernt predict result……
So, it not a good practice. Don't follow it.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues TF 1.15 for issues seen on TF 1.15 type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants