Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Python AlphaZero using Keras 3 #1206

Open
lanctot opened this issue Apr 13, 2024 · 6 comments
Open

Problem with Python AlphaZero using Keras 3 #1206

lanctot opened this issue Apr 13, 2024 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@lanctot
Copy link
Collaborator

lanctot commented Apr 13, 2024

Using Ubuntu 24.04 and Python 3.12, it seems like model_test.py is failing using Keras 3.1.1:

(venv) lanctot@nitro-exp:~/open_spiel/open_spiel/python/algorithms/alpha_zero$ python model_test.py 
             .
             .
             .
[       OK ] ModelTest.test_model_learns_simple0 ('mlp')
[ RUN      ] ModelTest.test_model_learns_simple1 ('conv2d')
[  FAILED  ] ModelTest.test_model_learns_simple1 ('conv2d')
[ RUN      ] ModelTest.test_model_learns_simple2 ('resnet')
[  FAILED  ] ModelTest.test_model_learns_simple2 ('resnet')
======================================================================
ERROR: test_model_learns_optimal1 ('conv2d') (__main__.ModelTest)
ModelTest.test_model_learns_optimal1 ('conv2d')
test_model_learns_optimal('conv2d')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 101, in test_model_learns_optimal
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 235, in _define_graph
    torso = cascade(torso, [
            ^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_optimal2 ('resnet') (__main__.ModelTest)
ModelTest.test_model_learns_optimal2 ('resnet')
test_model_learns_optimal('resnet')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 101, in test_model_learns_optimal
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_simple1 ('conv2d') (__main__.ModelTest)
ModelTest.test_model_learns_simple1 ('conv2d')
test_model_learns_simple('conv2d')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 60, in test_model_learns_simple
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 235, in _define_graph
    torso = cascade(torso, [
            ^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

======================================================================
ERROR: test_model_learns_simple2 ('resnet') (__main__.ModelTest)
ModelTest.test_model_learns_simple2 ('resnet')
test_model_learns_simple('resnet')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lanctot/venv/lib/python3.12/site-packages/absl/testing/parameterized.py", line 322, in bound_param_test
    return test_method(self, testcase_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 60, in test_model_learns_simple
    model = build_model(game, model_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model_test.py", line 50, in build_model
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/lanctot/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/lanctot/venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lanctot/venv/lib/python3.12/site-packages/tensorflow/python/framework/tensor_shape.py", line 1440, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

----------------------------------------------------------------------
Ran 6 tests in 3.435s

FAILED (errors=4)
(venv) lanctot@nitro-exp:~/open_spiel/open_spiel/python/algorithms/alpha_zero$ 
@lanctot lanctot added the help wanted Extra attention is needed label Apr 13, 2024
@lanctot
Copy link
Collaborator Author

lanctot commented Apr 13, 2024

Tagging original author @tewalds, but we could use some help to port these algorithms to the new Keras version.

@lanctot lanctot changed the title Problem with Python AlphaZero using Keras 3.1.1 Problem with Python AlphaZero using Keras 3 Apr 13, 2024
@DaanS8
Copy link

DaanS8 commented May 12, 2024

Had the same issue, solved it by replacing the function batch_norm in open_spiel/python/algorithms/alpha_zero/model.py with:

def batch_norm(training, updates, name):
    def batch_norm_layer(x):
        bn = tfkl.BatchNormalization(name=name, trainable=True)
        return bn(x)
    return batch_norm_layer

I am far from an expert on this topic, and this code was suggested by GenAI.
Use with caution. Hopefully this helps with solving the issue.

@lanctot
Copy link
Collaborator Author

lanctot commented May 12, 2024

Interesting... Thanks! I will check this with the experts on my side 😅

@tacertain
Copy link
Contributor

Been trying to puzzle through this as it's also blocking me. The error referenced above I believe is because the value of the training variable is a TF placeholder, so a Tensor, and Keras is expecting it to be a python bool. So by not passing it in the replacement code above, you don't have that problem. However, unless there's magic going on to handle whether it's in training or inference mode, I don't think you're going to get the same results. Note the warning in the comments of batch_norm:

    # This emits a warning that training is a placeholder instead of a concrete
    # bool, but seems to work anyway.

I think the warning is now an error.

Secondarily, there's no updates method on layers any more. I can't even find where that used to be a valid method. Unfortunately, "updates" is not a very precise term for searching.

All I know about Keras I've learned in the past few hours, but my fear at this point is that the current code is based on TF v1 and Keras v1 and there are major changes that will need to be made to bring it up-to-date (and the AI-suggested code is valid code but doesn't capture all the functionality needed for alpha_zero). Hopefully I'm wrong, but I don't think I'm going to try to make any more progress without some more insight.

@tacertain
Copy link
Contributor

@lanctot I have been working on changing alpha_zero.py into Keras 3 code, and leaving it as backend-agnostic as possible (i.e. using as few TF-specific calls as possible). I think that I'm close, but am running into the python-greedy-import problem as reflected in #1122. How worthwhile is this work? I've gotten I think all the learning I'm going to get out of the coding part - I just want something that runs AlphaZero now. Should I just use the C++ version? I can keep trying to get the python version to work in Keras 3, but does anybody but me really care? Also, if I do want to get it done for v1.5, what's the deadline?

@lanctot
Copy link
Collaborator Author

lanctot commented May 22, 2024

@tacertain A working Python implementation of AlphaZero is something we will certainly figure out a way to keep in the long-run. It's definitely not going away.. as in: we almost surely won't remove it, only as a really last resort. It serves as a very good user-friendly first look at AlphaZero and is highly valuable for that reason.

Right now, Keras 3 is still kinda new, most people are not even using Python 3.12 yet so I don't think Keras 3 is default-installed anywhere (or if so, it's very recent...) so this is on the low end of the current priority list. If Keras is causing problems specifically, we could simply just move to Pytorch/JAX. I'd prefer to keep with Keras if possible -- it's quite neat that it acts as a wrapper to all the ML frameworks. The problem is we don't have much expertise with it internally.

That said, this Python AZ is really intended for a first-time user or tinkerer and does not scale as well as the LibTorch-based C++ implementation (or distributed ones found in other frameworks like RLLib or muzero-general -- which still support OpenSpiel games). So to your question of which one to use -- if you're comfortable with C++ and PyTorch, I'd say use that as it's more scalable. But it is also single-machine, so that will be a limitation for larger games.

On v1.5, that depends on what my schedule looks like. I was hoping to release before mid-June, but most of what I wanted to get in is there already.. just a few last minute larger PRs that have been in the queue for some time (including your Quoridor fixes). I'm not expecting any of these known issues with Keras 3 to get fixed by then.. it's too soon and there's not enough wide use of Python 3.12 / Keras 3 yet to warrant all the effort right now. I put warnings in those files to warn users that there are problems with pointers to these threads. I did that with the longer-term goal of fixing (or removing) them by next release. But of course, if they managed to get fixed before then, great, we'll include them 👍 but don't want to rush them in either.

Edit: Also, if we got it working shortly after v1.5 we can issue a minor release fix (e.g. v1.5.1) that PyPI / colab then get the fixed AlphaZero that could be used with Keras 3. The releases now require so little effort that I'd be happy to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants