Build a taxi driving agent in a post-apocalyptic world using Reinforcement Learning | Machine Learning from Scratch (Part VIII)

TL;DR Build a simple MDP for self-driving taxi problem. Pick up passengers, avoid danger and drop them off at a specified location. Build an agent and solve the problem using Q-learning.

You wake up. It is a sunny day. Your partner is still asleep next to you. You take a minute to admire the beauty and even crack a smile.

Your stomach is rumbling, so you jump on your feet and look around. Jackpot! You’re looking at the leftovers of the anniversary dinner you two had last night. You take a minute to scratch some private spot of yours. Yep, what an amazing morning! You have 1/3 of a bean can, expired only 6 months ago. It is delicious!

Although you feel bad about not leaving any food for the one sleeping in your bed, you dress up and prepare for work. “What are you going to do? Not eat and work!?” - those thoughts don’t seem to help anymore. Time to hop in the car!

You try to reach Johny via the radio but no luck. Oh well, Jimmy is missing since last week, his twin brother might be too. Strange, you can’t remember the last time your eyes were feeling wet.

A lot has changed since the event. The streets are getting more dangerous every week, but people still need transportation. You receive a pickup request and have a look at the anomaly map. You accept hesitantly. Your map was updated 2 months ago.

You got the job done and got credit for a half can. That couple was really generous! That leaves time for tinkering with the idea of making self-driving taxis. You even got a name - SafeCab. You read a lot of about this crazy guy called Elon Nusk that was trying to build those fully autonomous vehicles right before the event occurred.

You start scratching your head. Is this possible or just a fantasy? After all, if you get this done, you might afford to eat every other day. Enter Reinforcement Learning.

Complete source code in Google Colaboratory Notebook

Reinforcement Learning

Reinforcement Learning (RL) is concerned with producing algorithms (agents) that try to achieve some predefined goal. The achievement of this objective is dependent on choosing a set of actions - receiving rewards for the good ones and punishment for the bad ones - the reinforcement bit comes from this. The agent act, in environments that have a state, gives rewards, and a set of actions.

Deep Reinforcement Learning (using Deep Neural Networks for choosing actions) achieved some great things lately:

Markov Decision Processes

Markov Decision Process (MDP) is a mathematical formulation of the RL problem. MDPs satisfy the Markov property:

Markov property - the current state completely represents the state of the environment (world). That is, the future depends only on the present.

An MDP can be defined by $(S, A, R, P, \gamma)$ where:

• $S$ - set of possible states
• $A$ - set of possible actions
• $R$ - probability distribution of reward given (state, action) pair
• $P$ - probability distribution over how likely any of the states is to be the new states, given (state, action) pair. Also known as transition probability.
• $\gamma$ - reward discount factor

The discount factor $\gamma$ allows us to inject the heuristic that 100 bucks now are more valuable than 100 bucks in 30 days. A discount factor of 1 would make future rewards worth just as much as immediate rewards.

Another reason to discount future rewards:

In the long run we are all dead - J. M. Keynes

Learning

Here is how learning happens in RL context:

for time step $t=0$ until done:

1. The environment gives your agent a state
2. Your agent chooses an action (from a set of possible ones)
3. The environment gives a reward along with a new state
4. Continue until the goal or other condition is met

What is the objective of all this? Find a function $\pi^*$, known as optimal policy, that maximizes the cumulative discounted reward:

$\sum_{t \geq 0}\gamma^t r_t$

where $r_t$ is the reward received at step $t$ and $\gamma^t$ is the discount factor at step $t$.

A policy $\pi$ is a function that maps state $s$ to action $a$, that our agent believes is the best given that state.

Value function

The value function gives you the maximum expected future reward the agent will get, starting from some state $s$ and following some policy $\pi$.

$V_{\pi}(s) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s]$

There exists an optimal value function that has the highest value for all states. Defined as:

$V^*(s) = \max_{\pi}V^{\pi}(s) \quad \forall s \in \mathbb{S}$

$Q$-function

Similarly, let’s define another function known as $Q$-function (state-action value function) that gives the expected return starting from state $s$, taking action $a$, and following policy $\pi$.

$Q_{\pi}(s, a) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s, A_t = a]$

You can think of the $Q$-function as the “quality” of a certain action given a state. We can define the optimal $Q$-function as:

$Q^*(s, a) = \max_{\pi}Q^{\pi}(s, a) \quad \forall s \in \mathbb{S}$

There is a relationship between the two optimal functions $V^*$ and $Q^*$. It is given by:

$V^*(s) = \max_aQ^*(s, a) \quad \forall s \in \mathbb{S}$

That is, the maximum expected total reward when starting at $s$ is the maximum of $Q^*(s, a)$ over all possible actions.

We can find the optimal policy $\pi^*$ by choosing the action $a$ that gives maximum reward $Q^*(s, a)$ for state $s$:

$\pi^*(s) = \text{arg}\max_{a} Q^* (s, a) \quad \forall s \in \mathbb{S}$

There seems to be a synergy between all functions we defined so far. More importantly, we can now build an optimal agent for a given environment. Can we do it in practice?

$Q$-learning

Q-Learning is an off-policy algorithm for Temporal Difference (TD) learning. It is proven that with enough training, it converges with probability 1 to a close approximation of the action-value function for an arbitrary target policy. It learns the optimal policy even when actions are selected using exploratory (some randomness) policy (off-policy).

Given a state $s$ and action $a$, we can express $Q(s, a)$ in terms of itself (recursively):

$Q(s, a) = r + \gamma \max_{a'}Q(s', a')$

This is the Bellman equation. It defines the maximum future reward as the reward $r$ the agent received, for being at state $s$, plus the maximum future reward for state $s'$ for every possible action.

We can define $Q$-learning as an iterative approximation of $Q^*$ using the Bellman equation:

$Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a}Q_t(s_{t + 1}, a) - Q_t(s_t, a_t))$

where $\alpha$ is the learning rate that controls how much the difference between previous and new Q value is considered.

Here’s the general algorithm we’re going to implement:

1. Initialize a $Q$-values table
2. Observe initial state $s$
3. Choose action $a$ and act
4. Observe reward $r$ and a new state $s'$
5. Update the $Q$ table using $r$ and the maximum possible reward from $s'$
6. Set the current state to the new state and repeat from step 2 until a terminal state

Note that $Q$-learning is just one possible algorithm to solve the RL problem. Here is a comparison between some more of them.

Exploration vs exploitation

You might’ve noticed that we glanced over the strategy on choosing an action. Any strategy you implement will have to choose how often to try something new or use something it already knows. This is known as the exploration/exploitation tradeoff.

• Exploration - finding new information about the environment
• Exploitation - using existing information to maximize the reward

Remember, the goal of our RL agent is to maximize the expected cumulative reward. What does this mean for our self-driving taxi?

Initially, our driving agent knows pretty much nothing about the best set of driving directions for picking up and dropping off passengers. It should learn to avoid anomalies too since they are bad for business (passengers tend to disappear or worse in those)! During this time, we expect to make a lot of exploration.

After obtaining some experience, the agent can use it to choose an action more and more often. Eventually, all choices will be based on what is learned.

Driving in a post-apocalyptic world

Your partner sketched this map. Each block represents a small region of the city. Anomalies are marked in bright circles. The four letters are “safe zones”, where pickups and drop-offs happen.

Let’s assume your car is the only vehicle in the city (not much of a stretch). We can break it up into a 5x5 grid, which gives us 25 possible taxi locations.

The current taxi location is (3, 1) in (row, col) coordinates. The pickup and drop off locations are: [(0,0), (0,4), (4,0), (4,3)]. Anomalies are at [(0, 2), (2, 1), (2, 2), (4, 2)].

Environment

We’re going to encode our city map into an environment for the self-driving agent using OpenAI’s Gym. What is this Gym thing anyways?

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.

The most important entity in Gym is the Environment. It provides a unified and easy to use interface. Here are the most important methods:

• reset - resets the environment and returns a random initial state

• step(action) - take action and advance one timestep. It returns:

observation - the new state of the environment

reward - reward received from taking the action

done - most environments are divided into well-defined episodes and if done is True it indicates the episode has been completed

info - additional information about the environment (might be useful for debugging)

• render - renders one frame of the environment (useful for visualization)

The complete source code of the environment is in the notebook. Here, we’ll take a look at the registration to Gym:

register(
id='SafeCab-v0',
entry_point=f"{__name__}:SafeCab",
timestep_limit=100,
)

Note that we set a timestep_limit which limits the number of steps in an episode.

Action Space

Our agent encounters one of the 500 states (5 rows x 5 columns x 5 passanger locations x 4 destinations) and it chooses an action. Here are the possible actions:

1. south
2. north
3. east
4. west
5. pickup
6. dropoff

You might notice that in the illustration above, the taxi cannot perform certain actions when near the boundaries of the city. We will penalize this with -1 and won’t move the taxi if such situation occurs.

Getting a taste of the environment

Let’s have a look at our encoded environment:

env.reset()
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)

Here is the action to number mapping:

• 0 = south
• 1 = north
• 2 = east
• 3 = west
• 4 = pickup
• 5 = dropoff

Building an agent

Our agent has a simple interface for interacting with the environment. Here it is:

class Agent:

def __init__(
self,
n_states,
n_actions,
decay_rate=0.0001,
learning_rate=0.7,
gamma=0.618
):
pass

def choose_action(self, explore=True):
pass

def learn(
self,
state,
action,
reward,
next_state,
done,
episode
):
pass

We’re going to use the choose_action method when we want our agent to make a decision and act. Then, after the reward and new state from the environment are observed, our agent will learn from its actions using the learn method.

Let’s take a look at the implementations:

def __init__(
self,
n_states,
n_actions,
decay_rate=0.0001,
learning_rate=0.7,
gamma=0.618
):
self.n_actions = n_actions
self.q_table = np.zeros((n_states, n_actions))
self.max_epsilon = 1.0
self.min_epsilon = 0.01
self.epsilon = self.max_epsilon
self.decay_rate = decay_rate
self.learning_rate = learning_rate
self.gamma = gamma # discount rate
self.epsilons_ = []

The interesting part of the __init__ is the initialization of our $Q$-table. Initially, it is all zeros. Can we initialize it in some other way?

def choose_action(self, explore=True):

if explore and exploration_tradeoff < self.epsilon:
# exploration
return np.random.randint(self.n_actions)
else:
# exploitation (taking the biggest Q value for this state)
return np.argmax(self.q_table[state, :])

Our strategy is rather simple. We draw a random number from a uniform distribution between $0$ and $1$. If this number is smaller than epsilon and we want to explore, we take a random action. Otherwise, we take the best action based on our current knowledge.

def learn(
self,
state,
action,
reward,
next_state,
done,
episode
):
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
self.q_table[state, action] = self.q_table[state, action] + \
self.learning_rate * (reward + self.gamma * \
np.max(self.q_table[new_state, :]) - self.q_table[state, action])

if done:
# Reduce epsilon to decrease the exploration over time
self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * \
np.exp(-self.decay_rate * episode)
self.epsilons_.append(self.epsilon)

Learning involves the update of the $Q$-table using the $Q$-learning equation and reducing the exploration rate $\epsilon$ if the episode is complete.

Training

Now that our agent is ready for action, we can train it on the environment we’ve created. Let’s train our agent for 50k episodes and record episode rewards over time:

total_episodes = 60000
total_test_episodes = 10

agent = Agent(env.observation_space.n, env.action_space.n)

Let’s have a look at our agent driving before learning anything about the environment:

rewards = []

for episode in range(total_episodes):
state = env.reset()
episode_rewards = []

while True:

action = agent.choose_action()

# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info = env.step(action)

agent.learn(
state,
action,
reward,
new_state,
done,
episode
)

state = new_state

episode_rewards.append(reward)

if done == True:
break

rewards.append(np.mean(episode_rewards))

Recall that we set timestep_limit when we registered the environment so our agent won’t stay in an episode infinitely.

Evaluation

Let’s have a look at reward changes as the training progresses:

Note that the learning curve is smoothened using savgol_filter savgol_filter(rewards, window_length=1001, polyorder=2)

Recall that our exploration rate should decrease as our agent is learning. Have a look:

Here’s how we’re going to test our agent and record the progress:

frames = []

rewards = []

for episode in range(total_test_episodes):
state = env.reset()
episode_rewards = []

step = 1

while True:
action = agent.choose_action(explore=False)

new_state, reward, done, info = env.step(action)

frames.append({
'frame': env.render(mode='ansi'),
'state': state,
'episode': episode + 1,
'step': step,
'reward': reward
})

episode_rewards.append(reward)

if done:
step = 0
break
state = new_state
step += 1

rewards.append(np.mean(episode_rewards))

env.close()

Note that we want our agent to use only the experience it has so we set explore=False. Here’s what the total reward for each episode looks like:

I know that this chart might not give you a good idea of what the agent is capable of. Here is a video of it driving in the city:

Pretty good, right? It looks like that it learned to avoid the anomalies, pick up and drop off passengers.

Conclusion

Congratulations on building a a self-driving taxi agent. You’ve learned how to

• Build your own environment based on one provided by OpenAI’s Gym
• Implement and apply $Q$-learning
• Build an agent that learns to pickup, drop off passengers and avoid danger areas

Complete source code in Google Colaboratory Notebook

Can you increase the size of the city? Does the agent learns well still? Tell me how it went in the comments below!