The practice of systematically challenging and testing large language models (LLMs) to uncover vulnerabilities that could lead to undesirable behaviors, adapted from cybersecurity red teams.
A researcher creates a prompt that may cause a language model to generate hate speech and tests it to see how the model responds. If the model generates offensive content, the researcher can identify potential vulnerabilities and take steps to address them.