Curiousily

Build a taxi driving agent in a post-apocalyptic world using Reinforcement Learning | Machine Learning from Scratch (Part VIII)

28.06.2019 — Machine Learning, Reinforcement Learning, MDP, Q-Learning, Self-driving — 8 min read

TL;DR Build a simple MDP for self-driving taxi problem. Pick up passengers, avoid danger and drop them off at a specified location. Build an agent and solve the problem using Q-learning.

You wake up. It is a sunny day. Your partner is still asleep next to you. You take a minute to admire the beauty and even crack a smile.

Your stomach is rumbling, so you jump on your feet and look around. Jackpot! You’re looking at the leftovers of the anniversary dinner you two had last night. You take a minute to scratch some private spot of yours. Yep, what an amazing morning! You have 1/3 of a bean can, expired only 6 months ago. It is delicious!

Although you feel bad about not leaving any food for the one sleeping in your bed, you dress up and prepare for work. “What are you going to do? Not eat and work!?” - those thoughts don’t seem to help anymore. Time to hop in the car!

You try to reach Johny via the radio but no luck. Oh well, Jimmy is missing since last week, his twin brother might be too. Strange, you can’t remember the last time your eyes were feeling wet.

A lot has changed since the event. The streets are getting more dangerous every week, but people still need transportation. You receive a pickup request and have a look at the anomaly map. You accept hesitantly. Your map was updated 2 months ago.

You got the job done and got credit for a half can. That couple was really generous! That leaves time for tinkering with the idea of making self-driving taxis. You even got a name - SafeCab. You read a lot of about this crazy guy called Elon Nusk that was trying to build those fully autonomous vehicles right before the event occurred.

You start scratching your head. Is this possible or just a fantasy? After all, if you get this done, you might afford to eat every other day. Enter Reinforcement Learning.

Complete source code in Google Colaboratory Notebook

Reinforcement Learning

Reinforcement Learning (RL) is concerned with producing algorithms (agents) that try to achieve some predefined goal. The achievement of this objective is dependent on choosing a set of actions - receiving rewards for the good ones and punishment for the bad ones - the reinforcement bit comes from this. The agent act, in environments that have a state, gives rewards, and a set of actions.

source: https://phrasee.co/

Deep Reinforcement Learning (using Deep Neural Networks for choosing actions) achieved some great things lately:

Markov Decision Processes

Markov Decision Process (MDP) is a mathematical formulation of the RL problem. MDPs satisfy the Markov property:

Markov property - the current state completely represents the state of the environment (world). That is, the future depends only on the present.

An MDP can be defined by $(S, A, R, P, \gamma)$ where:

$S$ - set of possible states
$A$ - set of possible actions
$R$ - probability distribution of reward given (state, action) pair
$P$ - probability distribution over how likely any of the states is to be the new states, given (state, action) pair. Also known as transition probability.
$\gamma$ - reward discount factor

The discount factor $\gamma$ allows us to inject the heuristic that 100 bucks now are more valuable than 100 bucks in 30 days. A discount factor of 1 would make future rewards worth just as much as immediate rewards.

Another reason to discount future rewards:

In the long run we are all dead - J. M. Keynes

Learning

Here is how learning happens in RL context:

for time step $t=0$ until done:

The environment gives your agent a state
Your agent chooses an action (from a set of possible ones)
The environment gives a reward along with a new state
Continue until the goal or other condition is met

What is the objective of all this? Find a function $\pi^*$ , known as optimal policy, that maximizes the cumulative discounted reward:

\sum_{t \geq 0}\gamma^t r_t

where $r_t$ is the reward received at step $t$ and $\gamma^t$ is the discount factor at step $t$ .

A policy $\pi$ is a function that maps state $s$ to action $a$ , that our agent believes is the best given that state.

Value function

The value function gives you the maximum expected future reward the agent will get, starting from some state $s$ and following some policy $\pi$ .

V_{\pi}(s) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s]

There exists an optimal value function that has the highest value for all states. Defined as:

V^*(s) = \max_{\pi}V^{\pi}(s) \quad \forall s \in \mathbb{S}

$Q$ -function

Similarly, let’s define another function known as $Q$ -function (state-action value function) that gives the expected return starting from state $s$ , taking action $a$ , and following policy $\pi$ .

Q_{\pi}(s, a) = \mathop{\mathbb{E}}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t + k + 1} | S_t = s, A_t = a]

You can think of the $Q$ -function as the “quality” of a certain action given a state. We can define the optimal $Q$ -function as:

Q^*(s, a) = \max_{\pi}Q^{\pi}(s, a) \quad \forall s \in \mathbb{S}

There is a relationship between the two optimal functions $V^*$ and $Q^*$ . It is given by:

V^*(s) = \max_aQ^*(s, a) \quad \forall s \in \mathbb{S}

That is, the maximum expected total reward when starting at $s$ is the maximum of $Q^*(s, a)$ over all possible actions.

We can find the optimal policy $\pi^*$ by choosing the action $a$ that gives maximum reward $Q^*(s, a)$ for state $s$ :

\pi^*(s) = \text{arg}\max_{a} Q^* (s, a) \quad \forall s \in \mathbb{S}

There seems to be a synergy between all functions we defined so far. More importantly, we can now build an optimal agent for a given environment. Can we do it in practice?

$Q$ -learning

Q-Learning is an off-policy algorithm for Temporal Difference (TD) learning. It is proven that with enough training, it converges with probability 1 to a close approximation of the action-value function for an arbitrary target policy. It learns the optimal policy even when actions are selected using exploratory (some randomness) policy (off-policy).

Given a state $s$ and action $a$ , we can express $Q(s, a)$ in terms of itself (recursively):

Q(s, a) = r + \gamma \max_{a'}Q(s', a')

This is the Bellman equation. It defines the maximum future reward as the reward $r$ the agent received, for being at state $s$ , plus the maximum future reward for state $s'$ for every possible action.

We can define $Q$ -learning as an iterative approximation of $Q^*$ using the Bellman equation:

Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a}Q_t(s_{t + 1}, a) - Q_t(s_t, a_t))

where $\alpha$ is the learning rate that controls how much the difference between previous and new Q value is considered.

Here’s the general algorithm we’re going to implement:

Initialize a $Q$ -values table
Observe initial state $s$
Choose action $a$ and act
Observe reward $r$ and a new state $s'$
Update the $Q$ table using $r$ and the maximum possible reward from $s'$
Set the current state to the new state and repeat from step 2 until a terminal state

Note that $Q$ -learning is just one possible algorithm to solve the RL problem. Here is a comparison between some more of them.

Exploration vs exploitation

You might’ve noticed that we glanced over the strategy on choosing an action. Any strategy you implement will have to choose how often to try something new or use something it already knows. This is known as the exploration/exploitation tradeoff.

Exploration - finding new information about the environment
Exploitation - using existing information to maximize the reward

Remember, the goal of our RL agent is to maximize the expected cumulative reward. What does this mean for our self-driving taxi?

Initially, our driving agent knows pretty much nothing about the best set of driving directions for picking up and dropping off passengers. It should learn to avoid anomalies too since they are bad for business (passengers tend to disappear or worse in those)! During this time, we expect to make a lot of exploration.

After obtaining some experience, the agent can use it to choose an action more and more often. Eventually, all choices will be based on what is learned.

Driving in a post-apocalyptic world

Your partner sketched this map. Each block represents a small region of the city. Anomalies are marked in bright circles. The four letters are “safe zones”, where pickups and drop-offs happen.

Let’s assume your car is the only vehicle in the city (not much of a stretch). We can break it up into a 5x5 grid, which gives us 25 possible taxi locations.

The current taxi location is (3, 1) in (row, col) coordinates. The pickup and drop off locations are: [(0,0), (0,4), (4,0), (4,3)]. Anomalies are at [(0, 2), (2, 1), (2, 2), (4, 2)].

Environment

We’re going to encode our city map into an environment for the self-driving agent using OpenAI’s Gym. What is this Gym thing anyways?

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.

The most important entity in Gym is the Environment. It provides a unified and easy to use interface. Here are the most important methods:

reset - resets the environment and returns a random initial state
step(action) - take action and advance one timestep. It returns:
observation - the new state of the environment
reward - reward received from taking the action
done - most environments are divided into well-defined episodes and if done is True it indicates the episode has been completed
info - additional information about the environment (might be useful for debugging)
render - renders one frame of the environment (useful for visualization)

The complete source code of the environment is in the notebook. Here, we’ll take a look at the registration to Gym:

1register(
2    id='SafeCab-v0',
3    entry_point=f"{__name__}:SafeCab",
4    timestep_limit=100,
5)

Note that we set a timestep_limit which limits the number of steps in an episode.

Action Space

Our agent encounters one of the 500 states (5 rows x 5 columns x 5 passanger locations x 4 destinations) and it chooses an action. Here are the possible actions:

south
north
east
west
pickup
dropoff

You might notice that in the illustration above, the taxi cannot perform certain actions when near the boundaries of the city. We will penalize this with -1 and won’t move the taxi if such situation occurs.

Getting a taste of the environment

Let’s have a look at our encoded environment:

1env.reset()
2env.render()
3
4print("Action Space {}".format(env.action_space))
5print("State Space {}".format(env.observation_space))

1Action Space Discrete(6)
2State Space Discrete(500)

Here is the action to number mapping:

0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff

Building an agent

Our agent has a simple interface for interacting with the environment. Here it is:

1class Agent:
2
3  def __init__(
4    self,
5    n_states,
6    n_actions,
7    decay_rate=0.0001,
8    learning_rate=0.7,
9    gamma=0.618
10  ):
11    pass
12
13  def choose_action(self, explore=True):
14    pass
15
16  def learn(
17    self,
18    state,
19    action,
20    reward,
21    next_state,
22    done,
23    episode
24  ):
25    pass

We’re going to use the choose_action method when we want our agent to make a decision and act. Then, after the reward and new state from the environment are observed, our agent will learn from its actions using the learn method.

Let’s take a look at the implementations:

1def __init__(
2  self,
3  n_states,
4  n_actions,
5  decay_rate=0.0001,
6  learning_rate=0.7,
7  gamma=0.618
8):
9  self.n_actions = n_actions
10  self.q_table = np.zeros((n_states, n_actions))
11  self.max_epsilon = 1.0
12  self.min_epsilon = 0.01
13  self.epsilon = self.max_epsilon
14  self.decay_rate = decay_rate
15  self.learning_rate = learning_rate
16  self.gamma = gamma # discount rate
17  self.epsilons_ = []

The interesting part of the __init__ is the initialization of our $Q$ -table. Initially, it is all zeros. Can we initialize it in some other way?

1def choose_action(self, explore=True):
2  exploration_tradeoff = np.random.uniform(0, 1)
3
4  if explore and exploration_tradeoff < self.epsilon:
5    # exploration
6    return np.random.randint(self.n_actions)
7  else:
8    # exploitation (taking the biggest Q value for this state)
9    return np.argmax(self.q_table[state, :])

Our strategy is rather simple. We draw a random number from a uniform distribution between $0$ and $1$ . If this number is smaller than epsilon and we want to explore, we take a random action. Otherwise, we take the best action based on our current knowledge.

1def learn(
2  self,
3  state,
4  action,
5  reward,
6  next_state,
7  done,
8  episode
9):
10  # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
11  self.q_table[state, action] = self.q_table[state, action] + \
12    self.learning_rate * (reward + self.gamma * \
13      np.max(self.q_table[new_state, :]) - self.q_table[state, action])
14
15  if done:
16    # Reduce epsilon to decrease the exploration over time
17    self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * \
18      np.exp(-self.decay_rate * episode)
19    self.epsilons_.append(self.epsilon)

Learning involves the update of the $Q$ -table using the $Q$ -learning equation and reducing the exploration rate $\epsilon$ if the episode is complete.

Training

Now that our agent is ready for action, we can train it on the environment we’ve created. Let’s train our agent for 50k episodes and record episode rewards over time:

1total_episodes = 60000
2total_test_episodes = 10
3
4agent = Agent(env.observation_space.n, env.action_space.n)

Let’s have a look at our agent driving before learning anything about the environment:

1rewards = []
2
3for episode in range(total_episodes):
4    state = env.reset()
5    episode_rewards = []
6
7    while True:
8
9        action = agent.choose_action()
10
11        # Take the action (a) and observe the outcome state(s') and reward (r)
12        new_state, reward, done, info = env.step(action)
13
14        agent.learn(
15          state,
16          action,
17          reward,
18          new_state,
19          done,
20          episode
21        )
22
23        state = new_state
24
25        episode_rewards.append(reward)
26
27        if done == True:
28          break
29
30    rewards.append(np.mean(episode_rewards))

Recall that we set timestep_limit when we registered the environment so our agent won’t stay in an episode infinitely.

Evaluation

Let’s have a look at reward changes as the training progresses:

Note that the learning curve is smoothened using savgol_filter savgol_filter(rewards, window_length=1001, polyorder=2)

Recall that our exploration rate should decrease as our agent is learning. Have a look:

Here’s how we’re going to test our agent and record the progress:

1frames = []
2
3rewards = []
4
5for episode in range(total_test_episodes):
6    state = env.reset()
7    episode_rewards = []
8
9    step = 1
10
11    while True:
12        action = agent.choose_action(explore=False)
13
14        new_state, reward, done, info = env.step(action)
15
16        frames.append({
17          'frame': env.render(mode='ansi'),
18          'state': state,
19          'episode': episode + 1,
20          'step': step,
21          'reward': reward
22        })
23
24        episode_rewards.append(reward)
25
26        if done:
27            step = 0
28            break
29        state = new_state
30        step += 1
31
32    rewards.append(np.mean(episode_rewards))
33
34env.close()

Note that we want our agent to use only the experience it has so we set explore=False. Here’s what the total reward for each episode looks like:

I know that this chart might not give you a good idea of what the agent is capable of. Here is a video of it driving in the city:

Pretty good, right? It looks like that it learned to avoid the anomalies, pick up and drop off passengers.

Conclusion

Congratulations on building a a self-driving taxi agent. You’ve learned how to

Build your own environment based on one provided by OpenAI’s Gym
Implement and apply $Q$ -learning
Build an agent that learns to pickup, drop off passengers and avoid danger areas

Complete source code in Google Colaboratory Notebook

Can you increase the size of the city? Does the agent learns well still? Tell me how it went in the comments below!

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

Hacker's Guide to Neural Networks in JavaScript

Build Machine Learning models (especially Deep Neural Networks) that you can easily integrate with existing or new web apps. Think of your ReactJs, Vue, or Angular app enhanced with the power of Machine Learning models.

Get SH*T Done with PyTorch

Learn how to solve real-world problems with Deep Learning models (NLP, Computer Vision, and Time Series). Go from prototyping to deployment with PyTorch and Python!

Hacker's Guide to Machine Learning with Python

This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!

Hands-On Machine Learning from Scratch

This book will guide you on your journey to deeper Machine Learning understanding by developing algorithms in Python from scratch! Learn why and when Machine learning is the right tool for the job and how to improve low performing models!