Q-Learning's learning note & sample show

非常简单且很没有道理的讲解,请注意不要被我带偏~

So, firstly, what is Q-Learning?

If we go for a shopping, goods list maybe is been prepared before. And we really know the reason why we add them to the list. Why we know that?

Haven't thought that? Yeah, that's why you think it's hard to learn Deep Reinforce Learning. That is we want to make the 'clever' program knows things like us. We can get the reason by one's tasty cooking, so we buy the food again in the next time.

So, Q-Learning is a kind of algorithms to help machine learn from environment (usually value feedback) by a Q-table. Q-table is machine's experience by many times attempts.

Actually we know there's a table (called R-table) in environment and machine have the similar Q-table. The machine receives the feedback and update its table.

How it works

Actually we need to split it as several process, it's under specific state and can make action. Sure we still don't know what will happen when making a strategic decision but we can get the value in current environment.

List these all as follows:

Rt+1R_{t+1} : A value feedback which machine don't know but will get in the next state after action

StS_{t} : Current state

AtA_t : What we will do in current state

Qt(St,At)Q_t(S_t,A_t) : How I feel in current state and action

Qt+1(St+1,a)Q_{t+1}(S_{t+1},a) : How I feel in next state and unknown action

So, we have to make a decision to get the feedback from the environment to update our Q-value in current state. Hardly to understand?

It's just to do something bravely and get some experience to help we do decision in the next similar state. We need Rt+1R_{t+1} to update Qt(St,At)Q_t(S_t,A_t).

And as know, if we still have no decision in something, we may check if there's some useful data in our experience. We need Qt+1(St+1,a)Q_{t+1}(S_{t+1},a) to update Qt(St,At)Q_t(S_t,A_t). And don't forget it's just as a reference. So you need a factor γ{\gamma} to decrease the effect.

Now we need to back to one person, utility is better to help you or the absolute value? So we need a difference between the Qt(St,At)Q_t(S_t,A_t) and γQt+1(St+1,a)\gamma Q_{t+1}(S_{t+1},a).

We assume γQt+1(St+1,a)\gamma Q_{t+1}(S_{t+1},a) is bigger than Qt(St,At)Q_t(S_t,A_t), there is:

Qt(St,At)=Qt(St,At)+Rt+1+γQt+1(St+1,a)Qt(St,At)Q_t(S_t,A_t) = Q_t(S_t,A_t) + R_{t+1} + \gamma Q_{t+1}(S_{t+1},a) - Q_t(S_t,A_t)

But, really enough? Maybe not. Not the each strategy decision is right, we need a factor α{\alpha} to decrease the probability of mistake. Like...this:

Qt(St,At)=Qt(St,At)+α(Rt+1+γQt+1(St+1,a)Qt(St,At))Q_t(S_t,A_t) = Q_t(S_t,A_t) + \alpha ( R_{t+1} + \gamma Q_{t+1}(S_{t+1},a) - Q_t(S_t,A_t))

That seems perfect, but what have been forgotten? Yeah, that's "a" in Qt+1(St+1,a)Q_{t+1}(S_{t+1},a), to get the experience rapidly, maybe we can get the max value for each possible action "a". The final really perfect Q-Learning Algorithm core theory:

Qt(St,At)=Qt(St,At)+α(Rt+1+γmaxaQt+1(St+1,a)Qt(St,At))Q_t(S_t,A_t) = Q_t(S_t,A_t) + \alpha ( R_{t+1} + \gamma max_aQ_{t+1}(S_{t+1},a) - Q_t(S_t,A_t))

How to make it

Two examples given by teacher, let's see the first:

迷宫1

It looks like a chessboard. But its name is "Maze Navigation". As the name say, it has terminal location and start. We need to find a way to not avoid in gray grid.

That's the logical code which is designed for solve this kind of problems:

image-20201022234010847 image-20201022234042597

And I do the verification by python:

image-20201022234353535

There are some rules in the map:

  1. Gray grid: can't be reached, but get -10 (utility)
  2. White grid: can be reached, and get 0 (utility)
  3. Green grid: can be reached, and it's terminal, get 100 (utility)

It's funny to see the robot find the way...

And here's the second sample:

迷宫2

It's more complex than the previous map, and there are four kinds of grids:

  1. Gray grid: can't be reached, but get -10 (utility), the same as previous
  2. White grid: can be reached, and get 0 (utility)
  3. Green grid: can be reached, and it's terminal, get 300 (utility)
  4. Red grid: can't be reached, but get -100(utility), it's a big trap...

I do the program to run for it, it's a picture with ".gif" format. Maybe you can see the animation.

憨憨2

Thanks

OK, that's all I do. And there's code file in the zip. Maybe you need to read "Readme.txt" before you try to run it.

赞赏