Back to Blog

Guide on Reinforcement Learning

Guide on Reinforcement Learning

Reinforcement Learning rose as a promising field within Artificial Intelligence. This Machine Learning subfield allows AI models to go above and beyond in performance while its algorithms play a pivotal role in Large Language Models (LLMs).But what makes Reinforcement Learning special? Let’s dive deeper into that. 

What is Reinforcement Learning?

Reinforcement Learning (RL) is a subtype of Machine Learning focused on an "artificial agent" system that learns how to get desired responses through trial and error. This approach allows AI models to improve behavior over time all by themselves, allowing them to get better with an experience-based logic.

This thinking process makes Reinforcement Learning quite popular in the gaming industry. A perfect instance of RL in action is DeepMind’s AlphaGo, the first AI program to defeat the world’s Go champion. Another impressive example is OpenAI Five, which has managed to defeat amateurs in Dota 2. These victories prove how powerful Reinforcement Learning is in decision-making and solving complex tasks. 

Reinforcement Learning has also been widely used in the robotics field, where Software Engineers have used it to train general-purpose robots to perform tasks like inspecting, delivery, and maintenance. One of the best examples is Google's MT Opt.

What is Deep Reinforcement Learning?

Within Reinforcement Learning, Deep Reinforcement Learning blends RL with Neural Network (NN)– a key component of Deep Learning— to solve complex problems. By using Artificial Neural Networks, it's possible to handle much larger datasets, which is why it's called "Deep" Reinforcement Learning (Deep RL). 

Deep RL made great contributions to Natural Language Processing (NLP), Computer Vision, and Medical Diagnosis. Similar to traditional Reinforcement Learning, Deep RL is extremely popular in fields like gaming, robotics, and image processing. Further, it has become widely used in state-of-the-art products such as self-driving cars. 

How Does Reinforcement Learning Work?

There are three core aspects to keep in mind to understand how the dynamic environment of Reinforcement Learning works: agents, goals, and reward functions. The RL agent works as the main player, navigating the learning process, while the goal gives the agent direction, shaping its actions. Lastly, reward functions are the key to its trial-and-error approach. As the agent interacts with its simulated environment, it will perform tasks until it gets to the defined goal. Every time it reaches the goal, it gets a reward.

But what is a reward function? For humans, a common analogy would include getting a high-five or a cookie. Here, agent performance receives positive signals to let it know it did what it was supposed to. In an adventure video game, the outcome could be getting to the end of a maze, and in a driving video game, the goal could be reaching a destination. 

The RL agent has the goal of maximizing the number of future rewards it gets over time, also known as cumulative rewards. Yet, the agent won't know if it's taking the right actions until getting—or not— the reward, so it will surely make quite a few mistakes during the process.

There's one more key concept in the agent’s environment, and it’s the state. The RL state represents the current position of the agent, bringing information from previous actions to help shape current actions for the agent. One of the most difficult parts of Reinforcement Learning is teaching the system the actions that led to the desired result. Known as credit assignment, this aspect is key for future actions to get the maximum reward.

What are the Challenges of Reinforcement Learning?

The main challenges of RL revolve around "credit assignment." The goal will always be to get the agent to take better and better actions. In consequence, it must be able to complete tasks more quickly and maximize the number of rewards over time. In order to do that, engineers must be able to filter out the wrong decisions it took to reach its goal. 

It's also important to highlight the decisions that led the agent to achieve its goal, with some people referring to it as the "credit assignment problem." Some methods use penalties apart from rewards every time an undesired outcome occurs, and negative outcomes can happen due to bad actions taken by the agent. However, getting the agent to understand that only specific actions were negative is extremely complex. 

How does that look in real-life examples? Think of a game where a car is moving steadily to a destination, but just before it reaches it, it takes the wrong turn and crashes. Credit assignment should preserve all of the good movements the car made before filtering out the ones that led it to crash, yet this process is not as "independent" as you may think. 

On the one hand, engineers must assign specific values to every possible move the agent can make, and these values are defined based on risks and rewards related to every possible movement. On the other hand, engineers use policy methods to shape the decision process, helping the agent to take actions based on risk-reward relations.

Main Reinforcement Learning Algorithms

RL Policy Iteration

Policy Iteration (PI) is a refinement algorithm that helps find the optimal policy for the agent. First, it assesses how good the current policy is, focusing on the value functions that indicate how well the agent can do. Then, the algorithm updates and improves the policy based on its findings. It's all about refining the plan or rules the agent follows so it can get better results in every scenario (state).

RL Value Iteration

Similar to Policy Iteration, Value Iteration (VI) aims to find the optimal policy. It uses principles of dynamic programming to maximize the cumulative reward, breaking a complex problem into smaller problems. Value Iteration also evaluates the current value function and improves it, yet it can be more efficient than Policy Iteration since it does both tasks in one take. VI is known for using the Bellman equation to update the value function.

RL Q-Learning

The Q-Learning algorithm guides the agent by focusing on Q-values (short for quality values), which represent the expected future rewards based on each action in a given state. Mathematically, the Q-learning algorithm updates the Q-values (action and state), taking them as input based on observed rewards. Think of the action as every available decision the agent can make in a given environment (state). 

The goal is to get the agent to learn the best actions it can take to maximize the cumulative reward. That’s why Q-learning is very useful when agents have to explore a large space with no prior knowledge. Picture a robot that has to learn optimal actions by interacting with an unfamiliar environment. If the agent is familiar with the environment and the space is manageable, Policy or Value Iteration may be more suitable.

RL Deep Q-Network

Deep Q-Network can be seen as an extension of Q-Learning that uses Neural Networks to reach the Q-values. Q-learning stores Q-values in tables that work as a grid of the available actions the agent can take in a given state. You can picture a robot moving across cells in a grid, each cell representing a state-value pair. 

Yet, this procedure may not be ideal if the space is too large and complex. So, Deep Q-learning replaces that Q-table with a Neural Network so it can handle a wider range of high-dimensional spaces. Deep Q-Network is great for more advanced tasks, such as complex gaming environments, and a classic example would be the game Atari 2600 Breakout. A Q-table for this game would need an entry for every possible combination of paddle, ball, and brick positions, and that's simply impractical!

RL Policy Gradients

All of the prior algorithms we've mentioned so far are value-based: they estimate the value function, and then they optimize the policy. However, Policy Gradients (PG) directly optimize policies without explicitly estimating value functions, simplifying algorithm design.

Like to Deep Q-Network, Policy Gradients use Neural Networks to handle high-dimensional spaces effectively. Hence, it’s ideal for complex scenarios where the agent has many available options, which is why they work great in complex video games like Chess and Go.

Why is Reinforcement Learning Important?

Reinforcement Learning is a powerful tool for optimizing decision-making for individuals and businesses alike. As mentioned, it’s used to build AI agents in popular fields like robotics and video games, yet its benefits go far beyond! 

RL also gained popularity in Fintech, Healthcare, Autonomous Vehicles, and Smart Cities. A widespread success case of RL includes the Atlas Robot from Boston Dynamics, OpenAI's robot hand that can solve the Rubik's cube, IBM Watson, and Tesla's self-driving cars.


Reinforcement Learning is a trending and promising AI concept due to its many applications in modern fields, used for building state-of-the-art tools. Some of the largest companies in the world are already using it to deliver disruptive products, and we can't wait to see what new improvements it will bring us in the future.