Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Reinforcement Learning: an Easy Introduction to Value Iteration | by Carl Bettosi | Sep, 2023

admin by admin
September 11, 2023
in Artificial Intelligence


Solving the example using Value Iteration

VI should make even more sense once we complete an example problem, so let’s get back to our golf MDP. We have formalised this as an MDP but currently, the agent doesn’t know the best strategy when playing golf, so let’s solve the golf MDP using VI.

We’ll start by defining our model parameters using fairly standard values:

γ = 0.9 // discount factor
θ = 0.01 // convergence threshold

We will then initialise our value table to 0 for states in S:

// value table

V(s0) = 0
V(s1) = 0
V(s2) = 0

We can now start in the outer loop:

Δ = 0

And three passes of the inner loop for each state in S:

// Bellman update rule
// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

v = 0

// we have only looked at one action here as only one is available from s0
// we know that the others are not possible and would therefore sum to 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 0)]

V(s0) = max[0] = 0

// Delta update rule
// Δ ← max(Δ,| v - V(s)|)

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 0|] = 0

//******************* state s1 *******************//

v = 0

// there are 2 available actions here

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 0) +
0.1 * (0 + 0.9 * 0),
0.1 * (0 + 0.9 * 0) +
0.9 * (10 + 0.9 * 0)]

V(s1) = max[0, 9] = 9

Δ = max[Δ, |v - v(s1)|] = max[0, |0 - 9|] = 9

//******************* state s2 *******************//

// terminal state with no actions

This gives us the following update to our value table:

V(s0) = 0
V(s1) = 9
V(s2) = 0

We don’t need to worry about s2 as this is a terminal state, meaning no actions are possible here.

We now break out the inner loop and continue the outer loop, performing a convergence check on:

Δ < θ = 9 < 0.01 = False

Since we have not converged, we do a second iteration of the outer loop:

Δ = 0

And another 3 passes of the inner loop, using the updated value table:

//******************* state s0 *******************//

v = 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 9)]

V(s0) = max[7.29] = 7.29

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 7.29|] = 7.29

//******************* state s1 *******************//

v = 9

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 7.29) +
0.1 * (0 + 0.9 * 9),
0.1 * (0 + 0.9 * 9) +
0.9 * (10 + 0.9 * 0)]

V(s1) = max(6.7149, 9.81) = 9.81

Δ = max[Δ, |v - v(s1)|] = max[7.29, |9 - 9.81|] = 7.29

//******************* state s2 *******************//

// terminal state with no actions

At the end of the second iteration, our values are:

V(s0) = 7.29
V(s1) = 9.81
V(s2) = 0

Check convergence once again:

Δ < θ = 7.29 < 0.01 = False

Still no convergence, so we continue the same process as above until Δ < θ. I won’t show all the calculations, the above two are enough to understand the process.

After 6 iterations, our policy converges. This is our values and convergence rate as they change over each iteration:

Iteration   V(s0)        V(s1)        V(s2)        Δ             Converged
1 0 9 0 9 False
2 7.29 9.81 0 7.29 False
3 8.6022 9.8829 0 1.3122 False
4 8.779447 9.889461 0 0.177247 False
5 8.80061364 9.89005149 0 0.02116664 False
6 8.8029969345 9.8901046341 0 0.0023832945 True

Now we can extract our policy:

// Policy extraction rule
// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

// we know there is only one possible action from s0, but let's just do it anyway

π(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))

π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) +
0.9 * (0 + 0.9 * 9.8901046341)]

π(s0) = argmax[8.80325447773]

π(s0) = hit to green

//******************* state s1 *******************//

π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) +
0.1 * (0 + 0.9 * 9.8901046341),
0.1 * (0 + 0.9 * 9.8901046341) +
0.9 * (10 + 0.9 * 0)]

π(s1) = argmax[8.02053693401, 9.89010941707]

π(s1) = hit in hole

Our final policy is:

π(s0) = hit to green
π(s1) = hit to hole
π(s2) = terminal state (no action)

So, when our agent is in the Ball on fairway state (s0), the best action is to hit to green. This seems pretty obvious since that is the only available action. However, in s1, where there are two possible actions, our policy has learned to hit in hole. We can now give this learned policy to other agents who want to play golf!

And there you have it! We have just solved a very simple RL problem using Value Iteration.



Source link

Previous Post

Developments around Scene Graph Generation part3(Machine Learning) | by Monodeep Mukherjee | Sep, 2023

Next Post

Mobile Robotics in Logistics, Warehousing and Delivery

Next Post

Mobile Robotics in Logistics, Warehousing and Delivery

Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart

Simple Linear Regression Menggunakan Python | by Echo Simanjuntak | Sep, 2023

Related Post

Artificial Intelligence

Genius Cliques: Mapping out the Nobel Network | by Milan Janosov | Sep, 2023

by admin
October 1, 2023
Machine Learning

Detecting Anomalies with Z-Scores: A Practical Approach | by Akash Srivastava | Oct, 2023

by admin
October 1, 2023
Machine Learning

What are SWIFT Payments and How Does It Work?

by admin
October 1, 2023
Artificial Intelligence

Speed up your time series forecasting by up to 50 percent with Amazon SageMaker Canvas UI and AutoML APIs

by admin
October 1, 2023
Edge AI

Unleashing LiDAR’s Potential: A Conversation with Innovusion

by admin
October 1, 2023
Artificial Intelligence

16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

by admin
September 30, 2023

© Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.