【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

SennriSyunnga · 2021-08-24T07:05:06Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, i will attach below.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 && Window 10 1909
TensorFlow installed from (source or binary): Java Maven

        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>libtensorflow</artifactId>
            <version>1.15.0</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>libtensorflow_jni_gpu</artifactId>
            <version>1.15.0</version>
        </dependency>

TensorFlow version (use command below): 1.15.0

Describe the current behavior
I try to load a Saved Model from keras. It all works well till I try to run the session——
Strange thing occurs: the code just can't continue and throw no exception.
When i use 'try catch finally' style instead of 'try with resource' style, i finally got such error message below:

java.lang.IllegalStateException: Error while reading resource variable dense_2/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/dense_2/kernel)
	 [[{{node dense_2/MatMul/ReadVariableOp}}]]

And what's more, though I got the error message, the graph instance can't close itself,
when I paused the test code in idea, i found the code stop at the Object.wait() method，
which means that the graph.refcount kept 1 value all the time.
The code couldn't escape from graph.close() method.
To prove the correctness of saved model, I try to load model in python code like below:

import tensorflow as tf
import numpy as np

export_path = "./test/";

input = np.random.random((1, 30));

with tf.Session(graph=tf.Graph()) as sess:
    loaded = tf.saved_model.loader.load(sess, ["serve"], export_path)
    graph = tf.get_default_graph()
    # print(graph.get_operations())
    x = sess.graph.get_tensor_by_name('rp_input:0')
    y = sess.graph.get_tensor_by_name('dense_2/Sigmoid:0')
    scores = sess.run(y,
                      feed_dict={x: input});
    print("predict: %d" % (np.argmax(scores, 1)));

It works well and print predict result, in that case, I think the problem may not lie in the model. (maybe?)
I tried hard to find solution or workaround on stackoverflow and issues here,
I saw several similar problems to mine, but they all occurs in python, such as :
#28287
and
#22362
the second issues seems most alike, but the model export method is different.
Standalone code to reproduce the issue
Here is my model:
model.zip
Here is the test code, because it fails all over the time, i ommit the code to close the resources.

    public void test_09_justTestAPI() {
        float[] a = new float[]{1.53672f, 2.047399f, 1.42194f, 1.494959f, -0.69123f, -0.39482f, 0.236573f, 0.733827f, -0.531855f, -0.973978f, 1.704854f, 2.085134f, 1.615931f, 1.723842f, 0.102458f, -0.017833f, 0.693043f, 1.263669f, -0.217664f, -1.058611f, 1.300499f, 2.260938f, 1.156857f, 1.291565f, -0.42401f, -0.069758f, 0.252202f, 0.808431f, -0.189161f, -0.490556f};
        long[] shape = new long[]{1, 30};
        try {
            SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve");
            Graph graph = savedModelBundle.graph();
            Tensor<Float> data = Tensor.create(shape, FloatBuffer.wrap(a));
            Session session = new Session(graph);
            Session.Runner runner = session.runner()
                    .feed("rp_input", data)
                    .fetch("dense_2/Sigmoid");
            float[][] res = new float[1][1];
            Tensor<?> out = runner.run().get(0);
            out.copyTo(res); // <artifactId>commons-io</artifactId>
            BigDecimal pro = BigDecimal.valueOf(res[0][0]);
        } catch (Exception e) {
            throw e;
        }
    }

Other info / logs
The model is produced by webank federal learining program,
In their code, the model is build by code using keras api:

def _load_model(nn_struct_json):
    return tf.keras.models.model_from_json(nn_struct_json, custom_objects={})

The json content is definded like:

      "nn_define": {
        "class_name": "Sequential",
        "config": {
          "name": "sequential",
          "layers": [
            {
              "class_name": "RepeatVector",
              "config": {
                "name":"rp",
                "n":1
              }
            },
            {
              "class_name": "LSTM",
              "config": {
                "name":"lstm",
                "units":32
              }
            },
            {
              "class_name": "Dense",
              "config": {
                "name": "dense",
                "trainable": true,
                "dtype": "float32",
                "units": 64,
                "activation": "relu"
              }
            },
            {
              "class_name": "Dense",
              "config": {
                "name": "dense_2",
                "trainable": true,
                "dtype": "float32",
                "units": 1,
                "activation": "sigmoid"
              }
            }
          ]
        },
        "keras_version": "2.2.4-tf",
        "backend": "tensorflow"
      }

the model is saved by code below:

    def export_model(self):
        with tempfile.TemporaryDirectory() as tmp_path:
            # try:
            #     tf.keras.models.save_model(self._model, filepath=tmp_path, save_format="tf")
            # except NotImplementedError:
            #     import warnings
            #     warnings.warn('Saving the model as SavedModel is still in experimental stages. '
            #                   'trying tf.keras.experimental.export_saved_model...')
            tf.keras.experimental.export_saved_model(self._model, saved_model_path=tmp_path)

            model_bytes = _zip_dir_as_bytes(tmp_path)

        return model_bytes

You can check the code in this link :
https://github.com/FederatedAI/FATE/blob/master/python/federatedml/nn/backend/tf_keras/nn_model.py
In case, here is the log of my test code:

2021-08-24 15:11:00.368680: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: .
2021-08-24 15:11:00.377471: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2021-08-24 15:11:00.382175: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-08-24 15:11:00.390552: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2021-08-24 15:11:00.409095: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: .
2021-08-24 15:11:00.416363: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 47658 microseconds.

I really stuck on this problem.
I would appreciate it if someone can help me out, many thanks!

The text was updated successfully, but these errors were encountered:

mohantym · 2021-08-24T19:38:00Z

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

SennriSyunnga · 2021-08-25T02:45:23Z

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

Thank you for your help!
I keep the java maven tensorflow version equal to that of FATE project,
And I didnt realize that maven repository have higher version artifacts than 1.15.0 until you told me.
I read https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java first, the maven 1.15.0 link in the 'Quickstart' part really misled me.
I will be appreciate it if someone can update the information in that page.
I 'll have a try to the 2.0+ version artifact and give you the result as soon as I can.

SennriSyunnga · 2021-08-25T08:58:40Z

Hi @SennriSyunnga !we see that you are using old version of Tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.6 version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Thanks!

Thank you for your help!
I keep the java maven tensorflow version equal to that of FATE project,
And I didnt realize that maven repository have higher version artifacts than 1.15.0 until you told me.
I read https://github.com/tensorflow/tensorflow/tree/master/tensorflow/java first, the maven 1.15.0 link in the 'Quickstart' part really misled me.
I will be appreciate it if someone can update the information in that page.
I 'll have a try to the 2.0+ version artifact and give you the result as soon as I can.

I fixed this problem by learning from this issue[https://github.com/tensorflow/java/issues/365] after using a higher version api:

        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow-core-api</artifactId>
            <version>0.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>tensorflow-core-api</artifactId>
            <version>0.3.1</version>
            <classifier>linux-x86_64</classifier>
        </dependency>

In this version of artifact, I finnaly can use session.init in Java, like other ones who solved the problem did in python
Then I correct my code like this: (WATCH OUT! PLEASE SEE THE UPDATE PART BEBLOW!）

public void test_10_justTestAPI() {
        float[] a = new float[]{1.53672f, 2.047399f, 1.42194f, 1.494959f, -0.69123f, -0.39482f, 0.236573f, 0.733827f, -0.531855f, -0.973978f, 1.704854f, 2.085134f, 1.615931f, 1.723842f, 0.102458f, -0.017833f, 0.693043f, 1.263669f, -0.217664f, -1.058611f, 1.300499f, 2.260938f, 1.156857f, 1.291565f, -0.42401f, -0.069758f, 0.252202f, 0.808431f, -0.189161f, -0.490556f};
        try (SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve")) {
            FloatNdArray m = StdArrays.ndCopyOf(new float[][]{a});
            try (TFloat32 data = TFloat32.tensorOf(m)) {
                try (Session session = savedModelBundle.session()) {
                    SignatureDef modelInfo = savedModelBundle.metaGraphDef().getSignatureDefMap().get("serving_default");
                    Map<String, TensorInfo> inputs = modelInfo.getInputsMap();
                    String inputName = null;
                    for (Map.Entry<String, TensorInfo> input : inputs.entrySet()) {
                        TensorInfo ti = input.getValue();
                        inputName = ti.getName();
                        break;
                    }
                    String outputName = null;
                    Map<String, TensorInfo> outputs = modelInfo.getOutputsMap();
                    for (Map.Entry<String, TensorInfo> output : outputs.entrySet()) {
                        outputName = output.getValue().getName();
                        break;
                    }
                    session.run("init");
                    Session.Runner runner = session.runner();
                    runner.feed(inputName, data)
                            .fetch(outputName);
                    try (TFloat32 out = (TFloat32) runner.run().get(0)) {
                        FloatNdArray matrix = StdArrays.ndCopyOf(new float[1][1]);
                        out.copyTo(matrix);
                        FloatDataBuffer floatDataBuffer = DataBuffers.ofFloats(1);
                        matrix.read(floatDataBuffer);
                        float[] res = new float[1];
                        floatDataBuffer.read(res);
                        BigDecimal pro = BigDecimal.valueOf(res[0]);
                    }
                }
            }
        }catch (Exception e) {
            logger.error(e);
            throw e;
        }
    }

Soon after that, I found out the reason for why graph instance failed to close:
I use the graph of savedModelBundle to construct a new session

            SavedModelBundle savedModelBundle = SavedModelBundle.load(".", "serve");
            Graph graph = savedModelBundle.graph();
            Tensor<Float> data = Tensor.create(shape, FloatBuffer.wrap(a));
            Session session = new Session(graph);// ← this line

In the correct way, I should use savedModelBundlesession() directly.
I dont know whether this behavior is normal or not, but hope my experience can help others.

Update
I find 'session.run("init")' is unnecessary and may be harmful.
When you run("init") once , you willl get a totally differernt predict result……
So, it not a good practice. Don't follow it.

google-ml-butler · 2021-08-25T08:58:42Z

Are you satisfied with the resolution of your issue?
Yes
No

SennriSyunnga added the type:bug Bug label Aug 24, 2021

google-ml-butler bot assigned mohantym Aug 24, 2021

mohantym added TF 1.15 for issues seen on TF 1.15 comp:ops OPs related issues labels Aug 24, 2021

mohantym added the stat:awaiting response Status - Awaiting response from author label Aug 24, 2021

mohantym removed the stat:awaiting response Status - Awaiting response from author label Aug 25, 2021

SennriSyunnga closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

SennriSyunnga commented Aug 24, 2021 •

edited

mohantym commented Aug 24, 2021

SennriSyunnga commented Aug 25, 2021

SennriSyunnga commented Aug 25, 2021 •

edited

google-ml-butler bot commented Aug 25, 2021

【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

【JavaAPI】IllegalStateException happended while running a model loading from SavedModel and the graph instance cant close itself #51648

Comments

SennriSyunnga commented Aug 24, 2021 • edited

mohantym commented Aug 24, 2021

SennriSyunnga commented Aug 25, 2021

SennriSyunnga commented Aug 25, 2021 • edited

google-ml-butler bot commented Aug 25, 2021

SennriSyunnga commented Aug 24, 2021 •

edited

SennriSyunnga commented Aug 25, 2021 •

edited