Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from

Conversation

bing-j
Copy link

@bing-j bing-j commented Mar 17, 2024

Hello! I wrote some strategies that use armed bandit algorithms. Originally, I only wanted to implement the epsilon-greedy strategy, but I now plan on extending this effort and implementing all the algorithms mentioned in the multi-armed bandit chapter of Sutton's Reinforcement Learning: an Introduction (I added the reference to the bibliography). So the branch name is no longer very representative; I added both Greedy and EpsilonGreedy on this branch.

Greedy:
Always chooses the action that has the highest average/expected "reward" (score), calculated from its own previous turns. The reward function is updated incrementally and optionally recency weighted, and initial expected rewards of each action default to zero if not modified through a parameter.

EpsilonGreedy:
Mostly works like Greedy (with p=1-e), but sometimes acts randomly (with p=e).

These strategies are described in detail in the textbook mentioned above as well.

As I've mentioned on gitter, I was unable to find any strategies that implement these algorithms, although I did find some similar ones. For example, Adaptive() works similarly to Greedy() without weights, but has a hard coded initial sequence, and uses raw sum of scores to choose the optimal play instead of average score. (Side note: the comments in Adaptive().strategy() indicate that it was intended to use the highest average; this may be an error in the code!) If similar strategies already exist, and/or there's any modifications I need to make in the code, please let me know!

Cheers :)

Copy link
Member

@marcharper marcharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, looks interesting! Most of the feedback is just on matching style and improving the comments.

axelrod/strategies/_strategies.py Outdated Show resolved Hide resolved
"manipulates_state": False,
}

UNIFORM = np.inf # constant that replaces weight when rewards aren't weighted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another conceivable value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to -1.0, and changed other places in the code to refer to this constant for consistency. This does mean that if a user uses recency_wieght=-1.0 at time of creation, it will be treated as not recency weighted (instead of an out of range value limited to 0.0 as in previous implementations).

axelrod/strategies/armed_bandits.py Outdated Show resolved Hide resolved
axelrod/strategies/armed_bandits.py Outdated Show resolved Hide resolved

class EpsilonGreedy(Greedy):
"""
Has a 1 - epsilon probability of behaving like Greedy(), and plays randomly otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greedy() --> Greedy

Can you elaborate more on "plays randomly otherwise"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Has a 1 - epsilon probability of behaving like Greedy; otherwise, randomly choose to cooperate or defect."

axelrod/tests/strategies/test_armed_bandits.py Outdated Show resolved Hide resolved
@marcharper
Copy link
Member

Looks like we broke the test invocator with some recent commits, I'll try to fix it. You'll need to update one of the doc tests to indicate that two new strategies have been added.

@bing-j bing-j requested a review from marcharper April 22, 2024 19:45
@marcharper
Copy link
Member

Thanks for the updates. The test that's failing is:

======================================================================
FAIL: test_strategy (axelrod.tests.strategies.test_meta.TestNMWEDeterministic.test_strategy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_meta.py", line 636, in test_strategy
    self.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 580, in versus_test
    test_match.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 665, in versus_test
    self.assertEqual((i, play), (i, expected_play))
AssertionError: Tuples differ: (2, D) != (2, C)

First differing element 1:
D
C

- (2, D)
?     ^ 

+ (2, C)
?     ^

This is happening because there are some ensemble strategies and the behavior of one of them has changed with the addition of these new strategies. You can run these tests with something like

python -m unittest axelrod.tests.unit.test_meta.py

I think in this case you just need to update the expected output that has changed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants