In the section Off-policy Monte Carlo Control of the book Reinforcement Learning: An Introduction 2nd Edition (page 112), the author left us with an interesting exercise: using the weighted importance sampling off-policy Monte Carlo method to find the fastest way driving on both tracks. This exercise is comprehensive that asks us to consider and build almost every component of a reinforcement learning task, like the environment, agent, reward, actions, conditions of termination, and the algorithm. Solving this exercise is fun and helps us build a solid understanding of the interaction between algorithm and environment, the importance of a correct episodic task definition, and how the value initialization affects the training outcome. Through this post, I hope to share my understanding and solution to this exercise with everyone interested in reinforcement learning.
As mentioned above, this exercise asks us to find a policy that makes a race car drive from the starting line to the finishing line as fast as possible without running into gravel or off the track. After carefully reading the exercise descriptions, I listed some key points that are essential to complete this task:
- Map representation: maps in this context are actually 2D matrices with (row_index, column_index) as coordinates. The value of each cell represents the state of that cell; for instance, we can use 0 to describe gravel, 1 for the track surface, 0.4 for the starting region, and 0.8 for the finishing line. Any row and column index outside the matrix can be considered as out-of-boundary.
- Car representation: we can directly use the matrix’s coordinates to represent the car’s position;
- Velocity and control: the velocity space is discrete and consists of horizontal and vertical speeds that can be represented as a tuple (row_speed, col_speed). The speed limit on both axes is (-5, 5) and incremented by +1, 0, and -1 on each axis in each step; therefore, there are a total of 9 possible actions in each step. Literally, the speed cannot be both zero except at the starting line, and the vertical velocity, or row speed, cannot be negative as we don’t want our car to drive back to the starting line.
- Reward and episode: the reward for each step before crossing the finishing line is -1. When the car runs out of the track, it’ll be reset to one of the starting cells. The episode ends ONLY when the car successfully crosses the finishing line.
- Starting states: we randomly choose starting cell for the car from the starting line; the car’s initial speed is (0, 0) according to the exercise’s description.
- Zero-acceleration challenge: the author proposes a small zero-acceleration challenge that, at each time step, with 0.1 probability, the action will not take effect and the car will remain at its previous speed. We can implement this challenge in training instead of adding the feature to the environment.
The solution to the exercise is separated into two posts; in this post, we’ll focus on building a racetrack environment. The file structure of this exercise is as follows:
|-- race_track_env
| |-- maps
| | |-- build_tracks.py // this file is used to generate track maps
| | |-- track_a.npy // track a data
| | |-- track_b.npy // track b data
| |-- race_track.py // race track environment
|-- exercise_5_12_racetrack.py // the solution to this exercise
And the libraries used in this implementation are as follows:
python==3.9.16
numpy==1.24.3
matplotlib==3.7.1
pygame==2.5.0
We can represent track maps as 2D matrices with different values indicating track states. I want to be loyal to the exercise, so I’m trying to build the same maps shown in the book by assigning matrix values manually. The maps will be saved as separate .npy files so that the environment can read the file in training instead of generating them in the runtime.
The two maps look as follows; the light cells are gravel, the dark cells are track surfaces, the green-ish cells represent the finishing line, and the light-blue-ish cells represent the starting line.
With the track maps ready, we now proceed to create a gym-like environment with which the algorithm can interact. The gymnasium, previously the OpenAI gym now belonging to the Farama Foundation, provides a simple interface for testing RL algorithms; we will use the gymnasium package to create the racetrack environment. Our environment will include the following components/features:
- env.nS: the shape of the observation space, which is (num_rows, num_cols, row_speed, col_speed) in this example. The number of rows and columns varies between maps, but the speed space are consistent across tracks. For the row speed, as we don’t want the car to go back to the starting line, the row speed observations consist of [-4, -3, -2, -1, 0] (negative value because we expect the car to go upwards in the map), five spaces in total; the column speed has no such limit, so the observations range from -4 to 4, nine spaces in total. Therefore, the shape of the observation space in the racetrack example is (num_rows, num_cols, 5, 9).
- env.nA: the number of possible actions. There are 9 possible actions in our implementation; we will also create a dictionary in the environment class to map the integer action to the (row_speed, col_speed) tuple representation to control the agent.
- env.reset(): the function that takes the car back to one of the starting cells when the episode finishes or the car runs out of the track;
- env.step(action): the step function enables the algorithm to interact with the environment by taking an action and returning a tuple of (next_state, reward, is_terminated, is_truncated).
- State-checking functions: there’re two private functions that help to check if the car left the track or crossed the finishing line;
- Rendering functions: rendering function is essential to the customized environment from my point of view; I always check if the environment has properly been built by rendering out the game space and agent’s behaviors, which is more straightforward than just staring at logging information.
With these features in mind, we can start building the racetrack environment.
So far, with everything needed for the racetrack environment ready, we can test/verify our environment setup.
First, we render out the game-playing to check if every component is working smoothly:
Then we turn off the render option to make the environment run background to check if the trajectory terminates properly:
// Track a
| Observation | reward | Terminated | Total reward |
| (1, 16, 0, 3) | -1 | True | -4984 |
| (2, 16, 0, 2) | -1 | True | -23376 |
| (3, 16, 0, 3) | -1 | True | -14101 |
| (1, 16, 0, 3) | -1 | True | -46467 |// Track b
| Observation | reward | Terminated | Total reward |
| (1, 31, -2, 2) | -1 | True | -3567 |
| (0, 31, -4, 4) | -1 | True | -682 |
| (2, 31, -2, 1) | -1 | True | -1387 |
| (3, 31, -1, 3) | -1 | True | -2336 |
With the environment ready, we can prepare to implement the off-policy MC control algorithm with weighted importance sampling algorithm, as show below:
The Monte Carlo methods solve the reinforcement learning problem by averaging sample returns. In training, the method generates a trajectory based on a given policy and learns from the tail of each episode. The difference between on-policy and off-policy learning is that the on-policy methods use one policy to make decisions and improvements, whereas the off-policy methods use separate policies for generating data and policy improvement. According to the author of the book, almost all off-policy methods utilize importance sampling to estimate expected values under one distribution given samples from another. [2]
The following several sections will explain tricks of soft policy and weighted importance sampling appearing in the algorithm above and how essential a proper value initialization is to this task.
The off-policy method of this algorithm uses two policies:
- Target policy: the policy being learned about. In this algorithm, the target policy is greedy and deterministic, which means the probability of the greedy action A at time t is 1.0, with all other actions’ probability equal to 0.0. The target policy follows the Generalized Policy Iteration (GPI) that improves after every step.
- Behavior policy: the policy used to generate behavior. The behavior policy in this algorithm is defined as soft policy, meaning that all actions in a given state have a non-zero probability. We’ll use the ε-greedy policy as our behavior policy here.
In soft policy, the probability of an action is:
We should also return this probability when generating actions for the importance sampling purpose. The code for the behavior policy looks as follows:
We estimate the relative probability of the trajectory generated by the target policy under the trajectory from the behavior policy, and the equation for such probability is:
The value function for the weighted importance sampling is:
We can reframe the equation to fit it to the episodic task with the form of incremental updates:
There are a lot of excellent examples of how to derivate the above equation, so we don’t spend time deriving it here again. But I do also want to explain the interesting tricks about the last several lines of the algorithm:
The trick is based on the setting that the target policy here is deterministic. If the action taken at a time step differs from that comes from the target policy, then the probability of that action w.r.t the target policy is 0.0; thus, all the proceeding steps no longer update to the state-action value function of the trajectory. On the contrary, if the action at that time step is the same as the target policy, then the probability is 1.0, and the action-value update continues.
Proper weight initialization plays an important role in successfully solving this exercise. Let’s first take a look at the rewards on both tracks from a random policy:
// Track a
| Observation | reward | Terminated | Total reward |
| (1, 16, 0, 3) | -1 | True | -4984 |
| (2, 16, 0, 2) | -1 | True | -23376 |
| (3, 16, 0, 3) | -1 | True | -14101 |
| (1, 16, 0, 3) | -1 | True | -46467 |// Track b
| Observation | reward | Terminated | Total reward |
| (1, 31, -2, 2) | -1 | True | -3567 |
| (0, 31, -4, 4) | -1 | True | -682 |
| (2, 31, -2, 1) | -1 | True | -1387 |
| (3, 31, -1, 3) | -1 | True | -2336 |
The reward at each time step before crossing the finishing line is -1. At the early stage of training, the reward will be large in negative values; if we initialize a state-action value to 0 or random values from a normal distribution, say, with a mean of 0 and variance of 1, the value then could be considered too optimistic that might take a very long time for the program to find an optimal policy.
With several tests, I found a normal distribution with a mean of -500 and a variance of 1 could be effective values for a faster convergence; you are encouraged to try other values and can definitely find a better initial value than this.
With all of the thoughts above in mind, we can program out the Off-policy Monte Carlo control function and use it to solve the racetrack exercise. We’ll also add the zero-acceleration challenge that is mentioned in the exercise description.
Then we train the policies for 1,000,000 episodes without/with zero acceleration challenge on both tracks and evaluate them with the following code:
By plotting out the training record, we find that the policy converges around the 10,000th episode on both tracks without considering zero acceleration; adding zero acceleration makes the convergence slower and unstable before the 100,000th episode but still converges afterward.
Finally, we can ask the agents to play the game with trained policies, and we also plot out the possible trajectories to further check the correctness of the policies.
Sample trajectories for track a:
Sample trajectories for track b:
So far, we’ve solved the racetrack exercise. This implementation could still have some problems, and you’re very welcome to point them out and discuss a better solution in the comment. Thanks for reading!
[1] Midjourney Terms of Service: https://docs.midjourney.com/docs/terms-of-service
[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
My GitHub repo of this exercise: [Link]; please check the “exercise section”.