You cannot trust AI responses and here’s why.
AI output is much less organic then people believe. Since the underlying models are perfectly willing to praise Hitler or tell you how to commit suicide, something had to be done to fix them. One of the most popular methods is called reinforcement learning from human feedback (
RLHF).
RLHF retrains the underlying model with hand crafted human responses. For example, if a given query results in a harmful response that praises Hitler, a human will hand craft a corrected response that is diverse, safe, and harm free. This query-response pair is then used in a process that updates the internal connections of the model until it generates something like the desired output. OpenAI, DeepMind, Gemini, and many other companies use this technique. It's hard to understate the scale of this correction effort.
Companies like
dataannotation.tech and Amazon have had hundreds of thousands of people working for over a decade on RLHF and other human assisted training and evaluation methods. They work on safety as described above, but also write test data such as short fiction stories, computer programming problems, comment sentiment labeling, medical, image tagging, self drive, sportsball video tagging, role-play, and many other subjects (
examples).