How prompt evaluation stopped a pipeline from killing the wrong ads
A worked example across three prompt versions: a baseline that missed edge cases, a structured version that got most things right but recommended killing a fatigued ad instead of refreshing it, and a few-shot version that fixed it. The eval framework found the failure before it reached production.