RL —An Gentle Introduction

8 min readJul 18, 2023

Reinforcement Learning (RL) is a fascinating area of study that focuses on teaching an intelligent agent to learn and make decisions through trial and error. It is considered the science of decision-making. With applications ranging from playing games to robotics and beyond, RL holds immense potential for solving complex problems. It is applied in many different fields and has achieved many distinguished accomplishments.

In this article, we’ll take a gentle dive into the world of RL, demystifying its key concepts and shedding light on how it works.

Reinforcement Learning vs Supervised Learning

Reinforcement learning is different from Supervised Learning. Supervised learning is a kind of learning where an external supervisor provides the training set of labeled examples. Each example describes a different situation together with the specification, the label, and the correct action the system should take in the particular situation. In supervised learning, often the task is to identify the category to which the situation belongs. Supervised learning suffices in specific situations. But in interactive problems, it is often impractical to obtain examples of desired behavior that are correct and perfectly representative of all situations in which the agent acts.

Reinforcement Learning vs Unsupervised Learning

Reinforcement learning is also different from Unsupervised Learning. Unsupervised Learning is a kind of learning which is typically about finding hidden structural patterns in the collection of unlabelled examples. It is tempting to think of reinforcement learning as a kind of unsupervised learning. Reinforcement learning does not rely on examples of correct behavior, It is trying to maximize the reward signal instead of finding hidden structures. Unsupervised learning does not address the problem of maximizing a reward signal.

Characteristics of Reinforcement Learning

No supervisor, only a reward signal
Sequential decision making
Time plays a crucial role in Reinforcement problems
Feedback is always delayed, not instantaneous
The agent’s actions determine the subsequent data it receives

Components of Reinforcement Learning

Agent: An assumed entity performs actions in an environment to gain rewards.
Environment (e): It is the world out there in which the agent interacts.
Reward (R): An immediate or delayed return given to an agent when it performs specific action or task in the environment.
State (s): The state is the information used to determine the next action.
Policy (π): The agent applies a strategy to decide the next action based on the current state. The agent will learn the policy based on experience with the environment. Policies are of two types — Deterministic Policy and Stochastic Policy.
Value (V): It is expected long-term return with a discount compared to the short-term reward.
Value Function: It is the prediction of expected future rewards. Used to evaluate goodness/badness for the states. It will help us select the best possible state.
Environment Model: This mimics the behavior of the environment. It helps you to make inferences to be made and also determine how the environment will behave.
Model-based methods: It is a method for solving reinforcement learning problems that use model-based methods.
Q-value or action value (Q): Q-value is quite similar to value. The only difference between the two is that it takes an additional parameter as a current action.

Before understanding the working of the Reinforcement learning engine. Let’s deep dive on what are different states present in Reinforcement learning.

States in Reinforcement Learning

The state is a crucial component of Reinforcement learning. In RL, the state is interchangeably used with observation. It provides information from the environment to determine the next action.

Formally, the State is a function of history. Theoretically, there are 3 types of states

Environment State
The environment state is the environment’s private representation i.e. the internal data the environment uses to pick the observation and reward.
Usually, this data is not visible to the agent. Even if he sees the information it is completely irrelevant to the agent.
Agent State
The agent state is the agent’s internal representation. It provides the information to the agent to pick the next best action. It is the data used by the algorithm.
Information State
The information state (a.k.a. Markov state) contains useful information from history.
According to the definition, A state can only be Markov if the sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Once the next state is known, the history may be thrown away.

The future is independent of the past given the present.

How RL works

Let’s understand step-by-step how the RL engine works

Initialization
- Set up the environment, defining the states, actions, and reward structure.
- Initialize the policy, value function, and other necessary components.
Interactive Loop
- The RL engine enters a loop of interactions between the agent and the environment.
- The agent observes the current state of the environment and selects an action using its policy based on the observed state.
- The selected action is applied to the environment. The environment transitions to a new state based on its dynamics and the action taken.
- The agent receives a reward from the environment, indicating the desirability of the action taken or the new state reached.
Learning and Update
- The agent uses the observed state, action, reward, and new state to update its knowledge and improve its decision-making.
- Depending on the RL algorithm, the agent may update its value function, policy, or other components.
- The update process is guided by the goal of maximizing cumulative rewards over time.
Exploration and Exploitation
- The agent decides whether to explore new actions or exploit its current knowledge.
- Exploration involves selecting actions to gather information and discover potentially better policies.
- Exploitation focuses on selecting actions that the agent believes will maximize rewards based on its current knowledge.
Termination Condition
- The interaction loop continues until a termination condition is met.
- The termination condition can be a fixed number of time steps, reaching a specific state, or achieving a performance threshold.
Convergence and Optimal Policy
- The agent aims to converge to an optimal policy through repeated interactions and updates.
- The optimal policy maximizes the expected cumulative reward over time.
- The RL algorithm works towards refining the agent’s policy to approach this optimal policy.
Evaluation and Improvement
- Once convergence is reached, the agent’s learned policy can be evaluated against desired performance metrics.
- The performance evaluation helps identify areas for improvement or refinement of the RL algorithm or agent’s components.
- The RL process can be repeated with modifications or enhancements to further enhance the agent’s decision-making capabilities.

RL Algorithms

RL algorithms are major categories into two categories

Model-free Algorithms
Model-based Algorithms

Model Free Algorithms

Model-free algorithms are a class of reinforcement learning (RL) algorithms that do not require explicit knowledge or assumptions about the underlying dynamics of the environment. Instead, they learn directly from experience by interacting with the environment and observing the resulting rewards. They focus on estimating value functions or optimizing policies without relying on a model of the environment. They are further categorized into 3 categories.

Value-based Methods
Value-based methods aim to estimate the value of different states or state-action pairs. They focus on learning the optimal value function, which represents the expected cumulative reward from a particular state or state-action pair. Key value-based algorithms include Q-learning, SARSA, and Deep Q-Networks (DQN).
Policy-Based Methods
Policy-based methods directly optimize the agent’s policy, which is a mapping from states to actions. Instead of estimating value functions, these algorithms aim to find the policy that maximizes the expected cumulative reward. Popular policy-based algorithms include REINFORCE, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO).
Actor-Critic Methods
Actor-critic methods combine value-based and policy-based approaches by using both an actor (policy) and a critic (value function). The actor selects actions based on the current policy, while the critic evaluates the policy’s performance by estimating the value function. Examples of actor-critic algorithms include Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Asynchronous Advantage Actor-Critic (A3C).

Model-Based Algorithms

Model-based algorithms in RL involve building a model of the environment. These algorithms aim to learn the dynamics of the environment, including transition probabilities and reward structures. The learned model is then used for planning, exploration, and policy optimization.

Model-Based Reinforcement Learning
Model-based RL algorithms learn a model of the environment and use it to plan and make decisions. They build an internal representation of the environmental dynamics and use this model to simulate possible future states and rewards. Examples of model-based algorithms include Monte Carlo Tree Search (MCTS), Model Predictive Control (MPC), and Dyna-Q.
Model-Based Policy Optimization
Model-based policy optimization algorithms combine model learning with policy optimization. They learn a model of the environment and use it to generate synthetic experiences for policy updates. Examples include Model-Based Policy Optimization (MBPO) and Model-Ensemble Trust Region Policy Optimization (ME-TRPO).

Model-based algorithms have the advantage of being sample-efficient, as they can use the learned model to plan and make decisions without requiring extensive interaction with the environment. However, they rely on accurate models and can be sensitive to model errors. In contrast, model-free algorithms are more robust but require more exploration and may be less sample-efficient. The choice between model-free and model-based approaches depends on the specific problem and available resources.

RL Applications

RL has demonstrated its effectiveness in a wide range of applications, including:

Robotics: Training robots to perform complex tasks and navigate dynamic environments.
Game Playing: Achieving superhuman performance in games like Go, chess, and Atari games.
Healthcare: Optimizing treatment plans, personalized medicine, and resource allocation.
Finance: Portfolio management, algorithmic trading, and risk assessment.
Supply Chain Management: Optimizing inventory management and logistics.
Autonomous Systems: Training self-driving cars, drones, and industrial automation.

Conclusion

Reinforcement Learning offers a powerful framework for building intelligent systems that learn from experience. By understanding the key components of RL, its working principles, and its wide-ranging applications, professionals can explore how RL can be leveraged to solve complex problems and drive innovation in their respective domains. As RL continues to advance, it is poised to revolutionize industries and open up new possibilities for intelligent decision-making.

References

If you like the post and are interested to see more like these buy me a coffee