Temporal Difference (TD) learning is a class of model-free reinforcement learning methods that sample from the environment and perform updates based on current estimates, similar to dynamic programming methods. Unlike Monte Carlo methods, which adjust their estimates only once the final outcome is known, TD methods adjust predictions to match later, more accurate predictions.
Consider a simple example of TD learning in a Markov decision process (MDP). Imagine an agent that takes actions in an environment to maximize a reward signal. The agent maintains an estimate of the expected future rewards based on its current state and action. Each time the agent takes an action, it receives a reward signal and updates its estimate based on the TD error, which is the difference between the expected and observed rewards. The agent uses this updated estimate to make better decisions in the future.