### So, firstly, what is Q-Learning?

If we go for a shopping, goods list maybe is been prepared before. And we really know the reason why we add them to the list. Why we know that?

Haven't thought that? Yeah, that's why you think it's hard to learn Deep Reinforce Learning. That is we want to make the 'clever' program knows things like us. We can get the reason by one's tasty cooking, so we buy the food again in the next time.

So, Q-Learning is a kind of algorithms to help machine learn from environment (usually value feedback) by a Q-table. Q-table is machine's experience by many times attempts.

Actually we know there's a table (called R-table) in environment and machine have the similar Q-table. The machine receives the feedback and update its table.

### How it works

Actually we need to split it as several process, it's under specific state and can make action. Sure we still don't know what will happen when making a strategic decision but we can get the value in current environment.

List these all as follows:

$R_{t+1}$ : A value feedback which machine don't know but will get in the next state after action

$S_{t}$ : Current state

$A_t$ : What we will do in current state

$Q_t(S_t,A_t)$ : How I feel in current state and action

$Q_{t+1}(S_{t+1},a)$ : How I feel in next state and unknown action

So, we have to make a decision to get the feedback from the environment to update our Q-value in current state. Hardly to understand?

It's just to do something bravely and get some experience to help we do decision in the next similar state. We need $R_{t+1}$ to update $Q_t(S_t,A_t)$.

And as know, if we still have no decision in something, we may check if there's some useful data in our experience. We need $Q_{t+1}(S_{t+1},a)$ to update $Q_t(S_t,A_t)$. And don't forget it's just as a reference. So you need a factor ${\gamma}$ to decrease the effect.

Now we need to back to one person, utility is better to help you or the absolute value? So we need a difference between the $Q_t(S_t,A_t)$ and $\gamma Q_{t+1}(S_{t+1},a)$.

We assume $\gamma Q_{t+1}(S_{t+1},a)$ is bigger than $Q_t(S_t,A_t)$, there is:

$Q_t(S_t,A_t) = Q_t(S_t,A_t) + R_{t+1} + \gamma Q_{t+1}(S_{t+1},a) - Q_t(S_t,A_t)$

But, really enough? Maybe not. Not the each strategy decision is right, we need a factor ${\alpha}$ to decrease the probability of mistake. Like...this:

$Q_t(S_t,A_t) = Q_t(S_t,A_t) + \alpha ( R_{t+1} + \gamma Q_{t+1}(S_{t+1},a) - Q_t(S_t,A_t))$

That seems perfect, but what have been forgotten? Yeah, that's "a" in $Q_{t+1}(S_{t+1},a)$, to get the experience rapidly, maybe we can get the max value for each possible action "a". The final really perfect Q-Learning Algorithm core theory:

$Q_t(S_t,A_t) = Q_t(S_t,A_t) + \alpha ( R_{t+1} + \gamma max_aQ_{t+1}(S_{t+1},a) - Q_t(S_t,A_t))$

### How to make it

Two examples given by teacher, let's see the first:

It looks like a chessboard. But its name is "Maze Navigation". As the name say, it has terminal location and start. We need to find a way to not avoid in gray grid.

That's the logical code which is designed for solve this kind of problems:

And I do the verification by python:

There are some rules in the map:

1. Gray grid: can't be reached, but get -10 (utility)
2. White grid: can be reached, and get 0 (utility)
3. Green grid: can be reached, and it's terminal, get 100 (utility)

It's funny to see the robot find the way...

And here's the second sample:

It's more complex than the previous map, and there are four kinds of grids:

1. Gray grid: can't be reached, but get -10 (utility), the same as previous
2. White grid: can be reached, and get 0 (utility)
3. Green grid: can be reached, and it's terminal, get 300 (utility)
4. Red grid: can't be reached, but get -100(utility), it's a big trap...

I do the program to run for it, it's a picture with ".gif" format. Maybe you can see the animation.

### Thanks

OK, that's all I do. And there's code file in the zip. Maybe you need to read "Readme.txt" before you try to run it.