Optimization-and-Search

Implementation and visualization of optimization algorithms.
please add MAthJax Plugin for Github to your browser.

1. Numerical Optimization

Here we are trying to solve simple quadratic problem.
$$\arg \text{min } \frac{1}{2}x^TQx + b^Tx$$ $$Q \geq 0, x \in R^n$$

The animation(left) is tested on N=10, (right) for n=2;

Gradient Descent

using first-order gradient and learning rate

Conjugate Descent

$x^{k+1} := x^k + a_kd^k$
using line search to compute the step size $\alpha$
$$a_k = \frac{\nabla f(x^k)^Td^k}{(d^k)^TQd^k}$$
find conjugate direction
$$d^{k+1} = -\nabla f(x^{k+1}) + \beta_kd^k$$
$$ \beta_k = \frac{\nabla f(x^{k+1})^T\nabla f(x^{k+1})}{\nabla f(x^k)^T\nabla f(x^k)} \text{ (FR)}$$

Newton Method

using second-order gradient
$x^{k+1} := x^k + d^k$
$d^k = -[\nabla^2 f(x^k)]^{-1}\nabla f(x^k)$

2. Stochastic Search

Here we try to use stochastic search to solve TSP problems.

Simulated Annealing

Cross-Entropy Methods

cross entropy using less steps to get converged

Search Gradient

Here trying to find lowest position. Since for TSP, I cannot find $\nabla_\theta \log (p_\theta (z_i))$. If you have any idea please be free to comment. Thanks!

3. Classical Search

A* search algorithm (A star search algorithm)

The code is writen according to the slide in CSE257, UCSD.

minimax search

4. Dynamic Programming

Value iteration

The implementation here, the reward for all the tile is -0.04, and credit tile is 4, obstable tile is -5. You can just press space key to change the [credit] or [obstable] choice and click left to add credit/obstable tile and remove them.

We can see we get the optimal policy before our utility function converges.

Policy iteration

5.

Monte Carlo Policy Evaluation

Policy: $\pi (s_i) = a_i$
Trajectories: $s_0, s_1, s_2, .., s_T$
Rewards: $R(s_0) + \gamma R(s_1) + .. + \gamma^TR(s_T)$

Follow policy and get a lot of samples of trajectories, and estimate the expectation $$V^{\pi}(s_0) = \mathbb{E}\pi [\sum{i=0}^T\gamma^iR(s_i)]$$

Temporal Difference Policy Evaluation

The difference between TD and MC method is that TD could update in every step and do not necessarily wait for a whole trajectory to generate.

Note that the learning rate $\alpha$ should decrease with the number of visits to a state, in this case the method is guaranteed to converge to the true action values by the law of large numbers.

Tabular Q learning

Off policy method $$V(s) = \max_a (\mathbb{E}[R(s)+\gamma V(s'|s,a)])$$

6.

Deep Q learning

In this case I used Deep Q network to train a self-driving car. The environment is created by NeuralNine (github), thought he used NEAT package to train the car.
The demo is below: you can modify the network and train your own car!
Random case:

DQL case:

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ClassicalSearch		ClassicalSearch
DeepQLearning		DeepQLearning
DynamicProgramming		DynamicProgramming
Optimization		Optimization
StochasticSearch		StochasticSearch
images		images
LICENSE		LICENSE
README.md		README.md

License

alwaysbyx/Optimization-and-Search

Folders and files

Latest commit

History

Repository files navigation

Optimization-and-Search

Table of Contents

1. Numerical Optimization

Gradient Descent

Conjugate Descent

Newton Method

2. Stochastic Search

Simulated Annealing

Cross-Entropy Methods

Search Gradient

3. Classical Search

A* search algorithm (A star search algorithm)

minimax search

4. Dynamic Programming

Value iteration

Policy iteration

5.

Monte Carlo Policy Evaluation

Temporal Difference Policy Evaluation

Tabular Q learning

6.

Deep Q learning

7.

Monte Carlo Tree Search

8.

DPLL

About

Topics

Resources

License

Stars

Watchers

Forks

Languages