Problem with Python AlphaZero using Keras 3 #1206

lanctot · 2024-04-13T10:28:33Z

Using Ubuntu 24.04 and Python 3.12, it seems like model_test.py is failing using Keras 3.1.1:

(venv) lanctot@nitro-exp:~/open_spiel/open_spiel/python/algorithms/alpha_zero$ python model_test.py 
             .
             .
             .
[       OK ] ModelTest.test_model_learns_simple0 ('mlp')
[ RUN      ] ModelTest.test_model_learns_simple1 ('conv2d')
[  FAILED  ] ModelTest.test_model_learns_simple1 ('conv2d')
[ RUN      ] ModelTest.test_model_learns_simple2 ('resnet')
[  FAILED  ] ModelTest.test_model_learns_simple2 ('resnet')
======================================================================
ERROR: test_model_learns_optimal1 ('conv2d') (__main__.ModelTest)
ModelTest.test_model_learns_optimal1 ('conv2d')
test_model_learns_optimal('conv2d')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 101, in test_model_learns_optimal
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 235, in _define_graph
    torso = cascade(torso, [
            ^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_optimal2 ('resnet') (__main__.ModelTest)
ModelTest.test_model_learns_optimal2 ('resnet')
test_model_learns_optimal('resnet')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 101, in test_model_learns_optimal
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_simple1 ('conv2d') (__main__.ModelTest)
ModelTest.test_model_learns_simple1 ('conv2d')
test_model_learns_simple('conv2d')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 60, in test_model_learns_simple
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 235, in _define_graph
    torso = cascade(torso, [
            ^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_simple2 ('resnet') (__main__.ModelTest)
ModelTest.test_model_learns_simple2 ('resnet')
test_model_learns_simple('resnet')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 60, in test_model_learns_simple
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

----------------------------------------------------------------------
Ran 6 tests in 3.435s

FAILED (errors=4)
(venv) lanctot@nitro-exp:~/open_spiel/open_spiel/python/algorithms/alpha_zero$

The text was updated successfully, but these errors were encountered:

lanctot · 2024-04-13T10:29:39Z

Tagging original author @tewalds, but we could use some help to port these algorithms to the new Keras version.

DaanS8 · 2024-05-12T17:00:22Z

Had the same issue, solved it by replacing the function batch_norm in open_spiel/python/algorithms/alpha_zero/model.py with:

def batch_norm(training, updates, name):
    def batch_norm_layer(x):
        bn = tfkl.BatchNormalization(name=name, trainable=True)
        return bn(x)
    return batch_norm_layer

I am far from an expert on this topic, and this code was suggested by GenAI.
Use with caution. Hopefully this helps with solving the issue.

lanctot · 2024-05-12T17:05:40Z

Interesting... Thanks! I will check this with the experts on my side 😅

tacertain · 2024-05-12T23:07:21Z

Been trying to puzzle through this as it's also blocking me. The error referenced above I believe is because the value of the training variable is a TF placeholder, so a Tensor, and Keras is expecting it to be a python bool. So by not passing it in the replacement code above, you don't have that problem. However, unless there's magic going on to handle whether it's in training or inference mode, I don't think you're going to get the same results. Note the warning in the comments of batch_norm:

    # This emits a warning that training is a placeholder instead of a concrete
    # bool, but seems to work anyway.

I think the warning is now an error.

Secondarily, there's no updates method on layers any more. I can't even find where that used to be a valid method. Unfortunately, "updates" is not a very precise term for searching.

All I know about Keras I've learned in the past few hours, but my fear at this point is that the current code is based on TF v1 and Keras v1 and there are major changes that will need to be made to bring it up-to-date (and the AI-suggested code is valid code but doesn't capture all the functionality needed for alpha_zero). Hopefully I'm wrong, but I don't think I'm going to try to make any more progress without some more insight.

tacertain · 2024-05-21T20:01:54Z

@lanctot I have been working on changing alpha_zero.py into Keras 3 code, and leaving it as backend-agnostic as possible (i.e. using as few TF-specific calls as possible). I think that I'm close, but am running into the python-greedy-import problem as reflected in #1122. How worthwhile is this work? I've gotten I think all the learning I'm going to get out of the coding part - I just want something that runs AlphaZero now. Should I just use the C++ version? I can keep trying to get the python version to work in Keras 3, but does anybody but me really care? Also, if I do want to get it done for v1.5, what's the deadline?

lanctot · 2024-05-22T00:56:07Z

@tacertain A working Python implementation of AlphaZero is something we will certainly figure out a way to keep in the long-run. It's definitely not going away.. as in: we almost surely won't remove it, only as a really last resort. It serves as a very good user-friendly first look at AlphaZero and is highly valuable for that reason.

Right now, Keras 3 is still kinda new, most people are not even using Python 3.12 yet so I don't think Keras 3 is default-installed anywhere (or if so, it's very recent...) so this is on the low end of the current priority list. If Keras is causing problems specifically, we could simply just move to Pytorch/JAX. I'd prefer to keep with Keras if possible -- it's quite neat that it acts as a wrapper to all the ML frameworks. The problem is we don't have much expertise with it internally.

That said, this Python AZ is really intended for a first-time user or tinkerer and does not scale as well as the LibTorch-based C++ implementation (or distributed ones found in other frameworks like RLLib or muzero-general -- which still support OpenSpiel games). So to your question of which one to use -- if you're comfortable with C++ and PyTorch, I'd say use that as it's more scalable. But it is also single-machine, so that will be a limitation for larger games.

On v1.5, that depends on what my schedule looks like. I was hoping to release before mid-June, but most of what I wanted to get in is there already.. just a few last minute larger PRs that have been in the queue for some time (including your Quoridor fixes). I'm not expecting any of these known issues with Keras 3 to get fixed by then.. it's too soon and there's not enough wide use of Python 3.12 / Keras 3 yet to warrant all the effort right now. I put warnings in those files to warn users that there are problems with pointers to these threads. I did that with the longer-term goal of fixing (or removing) them by next release. But of course, if they managed to get fixed before then, great, we'll include them 👍 but don't want to rush them in either.

Edit: Also, if we got it working shortly after v1.5 we can issue a minor release fix (e.g. v1.5.1) that PyPI / colab then get the fixed AlphaZero that could be used with Keras 3. The releases now require so little effort that I'd be happy to do that.

lanctot added the help wanted Extra attention is needed label Apr 13, 2024

lanctot changed the title ~~Problem with Python AlphaZero using Keras 3.1.1~~ Problem with Python AlphaZero using Keras 3 Apr 13, 2024

lanctot mentioned this issue Apr 24, 2024

Preparation for OpenSpiel 1.5: Add warnings to algorithms with known issues. #1213

Merged

tacertain mentioned this issue May 10, 2024

Failure in alpha_zero.py #1225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with Python AlphaZero using Keras 3 #1206

Problem with Python AlphaZero using Keras 3 #1206

lanctot commented Apr 13, 2024

lanctot commented Apr 13, 2024

DaanS8 commented May 12, 2024

lanctot commented May 12, 2024

tacertain commented May 12, 2024

tacertain commented May 21, 2024

lanctot commented May 22, 2024 •

edited

Problem with Python AlphaZero using Keras 3 #1206

Problem with Python AlphaZero using Keras 3 #1206

Comments

lanctot commented Apr 13, 2024

lanctot commented Apr 13, 2024

DaanS8 commented May 12, 2024

lanctot commented May 12, 2024

tacertain commented May 12, 2024

tacertain commented May 21, 2024

lanctot commented May 22, 2024 • edited

lanctot commented May 22, 2024 •

edited