Gridworld

Java JDK 8, JavaFX

Pseudocode

Initialize Q(s,a) arbitrarily and e(s,a) = 0, for all s,a
Repeat (for each episode):
	Initialize s,a
	Repeat (for each step of episode):
		Take action a, observe r,s’
		Choose a’ from s’ using policy derived from Q (e.g., ϵ-greedy)
		δ ← r + γ Q(s’,a’) – Q(s,a)
		e(s,a) ← e(s,a) + 1
		For all s,a:
			Q(s,a) ← Q(s,a) + α δ e(s,a)
			e(s,a) ← γ λ e(s,a)
		s ← s’ ; a ← a’
	until s is terminal

Description

Given infinite possibilities, an AI agent learns the optimal path towards its goal from any starting point using SARSA-λ. The algorithm works by initializing a 2D grid with each square containing 4 arbitrarily small weights representing UP, DOWN, LEFT, RIGHT directions. Our gamma discount factor begins at 0.90 and our lambda value is static with the value 0.00000005. We then decide to either explore or exploit within our given state Q(s,a) by stochastically selecting from decaying odds(tmp < gamma) ? explore() : exploit(). For each episode until we reach a terminal state we take an action (decided earlier using explore/exploit), to observe the reward in the next state, s'. Then we choose a next action a' from the next state s' using a greedy policy derived from Q. In other words we make our next move based on the maximum value of the state s'. We then calculate a delta by factoring in the appropriate reward value (gathered from s') added to the gamma discount factor multiplied by the difference of Q(s',a') - Q(s,a). We then update the corresponding eligibility value for the weight in Q(s,a). Now we update our entire gridworld table. First we update all the weights by taking each weight in Q(s,a) and adding to it our learning alpha multiplied by the delta we calculated earlier multiplied by the corresponding eligibility value e(s,a). Secondly, we update our eligibility values by multiplying each eligibility value by gamma multiplied lambda. The last step is to now move our AI agent to the next state Q(s',a'). Keep in mind, Q(s',a') was determined early on using either exploration or exploitation.

We keep performing the above algorithm, until state s' reaches a terminal state, delinated by state s being the goal location, out of bounds, or an obstacle. This terminal state will be checked after we move s <- s' ; a <- a'. If we hit a terminal state, we update gamma by subtracting lambda from it (gamma -= lambda). The purpose of this allows our AI agent early on to explore and gather more data about its environment. As the algorithm progresses, the AI learns more about its environment and as such its actions will reflect that by exploiting on what it has learned (the newly updated weights).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
res		res
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Gridworld 10x10 Sample Output (no obstacles).txt		Gridworld 10x10 Sample Output (no obstacles).txt
Gridworld ColorMap 25x25 (obstacles = true) - AFTER.PNG		Gridworld ColorMap 25x25 (obstacles = true) - AFTER.PNG
Gridworld ColorMap 25x25 (obstacles = true) - BEFORE.PNG		Gridworld ColorMap 25x25 (obstacles = true) - BEFORE.PNG
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

res

res

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

Gridworld 10x10 Sample Output (no obstacles).txt

Gridworld 10x10 Sample Output (no obstacles).txt

Gridworld ColorMap 25x25 (obstacles = true) - AFTER.PNG

Gridworld ColorMap 25x25 (obstacles = true) - AFTER.PNG

Gridworld ColorMap 25x25 (obstacles = true) - BEFORE.PNG

Gridworld ColorMap 25x25 (obstacles = true) - BEFORE.PNG

README.md

README.md

Repository files navigation

Gridworld

Java JDK 8, JavaFX

Pseudocode

Description

About

Releases

Packages

Languages

Hysperr/gridworld

Folders and files

Latest commit

History

Repository files navigation

Gridworld

Java JDK 8, JavaFX

Pseudocode

Description

About

Topics

Resources

Stars

Watchers

Forks

Languages