A heuristic algorithm for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It involves selecting the action that maximizes the expected reward with respect to a randomly drawn belief.
For instance, in an online advertising platform, Thompson sampling can be used to determine which ads to show to users. The algorithm would maintain a distribution over the space of possible ads and update this distribution based on the rewards obtained from showing the ads to users. This allows the algorithm to balance exploration (trying new ads) with exploitation (showing ads that are known to be successful).