top of page
AI Failure states & Designing for Evaluations
Difference between AI and Non AI system errors and designing for errors
April 24th, 2024
๐ง Understanding AI Failure States and Designing for Better Evaluations

๐ ๏ธ ๐ง๐ฟ๐ฎ๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฎ๐น (๐ก๐ผ๐ป-๐๐) ๐ฆ๐๐๐๐ฒ๐บ๐ ๐ฎ๐ฟ๐ฒ
- Predictable: Given the same input, they always produce the same output.
- Rule-based: Behavior is governed by explicitly coded logic.
- Transparent: Errors are usually due to code bugs or misconfigurations and are easier to trace and fix.
Example: "Error: Interest calculation failed due to missing rate parameter. Please contact support."
๐ค ๐๐ ๐ฆ๐๐๐๐ฒ๐บ๐ ๐ฎ๐ฟ๐ฒ
- Probabilistic: Outputs can vary even with the same input.
- Data-driven: Behavior is shaped by training data
- Opaque: Errors can be subtle, context-dependent, and harder to debug (e.g., hallucinations, bias).
Every AI system will eventually make bad predictions, ๐ฝ๐น๐ฎ๐ป ๐ณ๐ผ๐ฟ ๐ถ๐
A ๐๐ผ๐ป๐ณ๐๐๐ถ๐ผ๐ป ๐ ๐ฎ๐๐ฟ๐ถ๐ helps visualize model performance with:
โ True Positives โ Correct positive predictions
โ False Positives โ Incorrectly flagged positives
โ True Negatives โ Correctly ignored negatives
โ False Negatives โ Missed positives
โ ๏ธ ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ผ๐บ๐บ๐ผ๐ป ๐๐ ๐๐ฟ๐ฟ๐ผ๐ฟ ๐ฆ๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ๐
๐ญ. ๐ฆ๐๐๐๐ฒ๐บ ๐๐ฎ๐ถ๐น๐๐ฟ๐ฒ (๐ช๐ฟ๐ผ๐ป๐ด ๐ข๐๐๐ฝ๐๐)
False positives or false negatives occur due to Poor data, Biases or Model hallucinations
-> Example: "Unusual transaction. Your card is blocked. If it was you, please verify your identity."
๐ฎ. ๐ฆ๐๐๐๐ฒ๐บ ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐ (๐ก๐ผ ๐ข๐๐๐ฝ๐๐)
True negatives occur due to untrained use cases or gaps in knowledge
-> Example: "Sorry, we don't have enough information. Please try a different query!"
๐ฏ. ๐๐ผ๐ป๐๐ฒ๐ ๐๐๐ฎ๐น ๐๐ฟ๐ฟ๐ผ๐ฟ๐ (๐ ๐ถ๐๐๐ป๐ฑ๐ฒ๐ฟ๐๐๐ผ๐ผ๐ฑ ๐ข๐๐๐ฝ๐๐)
True positives that confuse users due to poor explanations or conflicts
with user expectationsโจ
-> Example: User logs in from a new device, gets locked out.
AI responds: โYour login attempt was flagged for suspicious activityโ
๐งช ๐ง๐ต๐ฒ ๐ถ๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฒ๐๐ถ๐ด๐ป๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป๐
Designing for evaluations is fundamental to AI system developmentโit guides improvement, ensures safety, and delivers systems that truly meet user needs.
๐ญ. ๐๐๐ -๐๐ฎ๐๐ฒ๐ฑ ๐๐๐ฎ๐น๐ (๐๐๐ -๐ฎ๐-๐ฎ-๐๐๐ฑ๐ด๐ฒ)
A separate language model serves as an automated judge to
- Grade responses
- Provide reasoning
- Assign labels (helpful/harmful, correct/incorrect)
๐ฎ. ๐๐๐บ๐ฎ๐ป ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ & ๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป
Critical for real-world alignment through in-Product Feedback
- Reaction buttons ๐๐
- Comment boxes ๐ฌ
- Clarification prompts ๐งญ
๐ฏ. ๐๐ผ๐ฑ๐ฒ-๐๐ฎ๐๐ฒ๐ฑ ๐๐๐ฎ๐น๐
Automated, deterministic tests for assessing output correctnessโparticularly effective for structured tasks.
๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ
- PMโs complete guide to evals
lennysnewsletter[.]com/p/beyond-vibe-checks-a-pms-complete
- Designing for Errors
pair.withgoogle[.]com/guidebook/
bottom of page