top of page

AI Failure states & Designing for Evaluations

Difference between AI and Non AI system errors and designing for errors

April 24th, 2024

๐Ÿง  Understanding AI Failure States and Designing for Better Evaluations




๐Ÿ› ๏ธ ๐—ง๐—ฟ๐—ฎ๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น (๐—ก๐—ผ๐—ป-๐—”๐—œ) ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐—ฎ๐—ฟ๐—ฒ

- Predictable: Given the same input, they always produce the same output.

- Rule-based: Behavior is governed by explicitly coded logic.

- Transparent: Errors are usually due to code bugs or misconfigurations and are easier to trace and fix.
Example: "Error: Interest calculation failed due to missing rate parameter. Please contact support."

๐Ÿค– ๐—”๐—œ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐—ฎ๐—ฟ๐—ฒ
- Probabilistic: Outputs can vary even with the same input.
- Data-driven: Behavior is shaped by training data
- Opaque: Errors can be subtle, context-dependent, and harder to debug (e.g., hallucinations, bias).

Every AI system will eventually make bad predictions, ๐—ฝ๐—น๐—ฎ๐—ป ๐—ณ๐—ผ๐—ฟ ๐—ถ๐˜

A ๐—–๐—ผ๐—ป๐—ณ๐˜‚๐˜€๐—ถ๐—ผ๐—ป ๐— ๐—ฎ๐˜๐—ฟ๐—ถ๐˜… helps visualize model performance with:
โœ… True Positives โ€“ Correct positive predictions
โŒ False Positives โ€“ Incorrectly flagged positives
โœ… True Negatives โ€“ Correctly ignored negatives
โŒ False Negatives โ€“ Missed positives

โš ๏ธ ๐—ง๐—ต๐—ฟ๐—ฒ๐—ฒ ๐—–๐—ผ๐—บ๐—บ๐—ผ๐—ป ๐—”๐—œ ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ ๐—ฆ๐—ฐ๐—ฒ๐—ป๐—ฎ๐—ฟ๐—ถ๐—ผ๐˜€
๐Ÿญ. ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—™๐—ฎ๐—ถ๐—น๐˜‚๐—ฟ๐—ฒ (๐—ช๐—ฟ๐—ผ๐—ป๐—ด ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)
False positives or false negatives occur due to Poor data, Biases or Model hallucinations
-> Example: "Unusual transaction. Your card is blocked. If it was you, please verify your identity."

๐Ÿฎ. ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—Ÿ๐—ถ๐—บ๐—ถ๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ (๐—ก๐—ผ ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)
True negatives occur due to untrained use cases or gaps in knowledge
-> Example: "Sorry, we don't have enough information. Please try a different query!"

๐Ÿฏ. ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜๐˜‚๐—ฎ๐—น ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€ (๐— ๐—ถ๐˜€๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ผ๐—ผ๐—ฑ ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)
True positives that confuse users due to poor explanations or conflicts
with user expectationsโ€จ
-> Example: User logs in from a new device, gets locked out.
AI responds: โ€œYour login attempt was flagged for suspicious activityโ€


๐Ÿงช ๐—ง๐—ต๐—ฒ ๐—ถ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป๐—ถ๐—ป๐—ด ๐—ณ๐—ผ๐—ฟ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€
Designing for evaluations is fundamental to AI system developmentโ€”it guides improvement, ensures safety, and delivers systems that truly meet user needs.

๐Ÿญ. ๐—Ÿ๐—Ÿ๐— -๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜€ (๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—๐˜‚๐—ฑ๐—ด๐—ฒ)
A separate language model serves as an automated judge to
- Grade responses
- Provide reasoning
- Assign labels (helpful/harmful, correct/incorrect)

๐Ÿฎ. ๐—›๐˜‚๐—บ๐—ฎ๐—ป ๐—™๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ & ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป
Critical for real-world alignment through in-Product Feedback
- Reaction buttons ๐Ÿ‘๐Ÿ‘Ž
- Comment boxes ๐Ÿ’ฌ
- Clarification prompts ๐Ÿงญ

๐Ÿฏ. ๐—–๐—ผ๐—ฑ๐—ฒ-๐—•๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜€
Automated, deterministic tests for assessing output correctnessโ€”particularly effective for structured tasks.

๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ
- PMโ€™s complete guide to evals
lennysnewsletter[.]com/p/beyond-vibe-checks-a-pms-complete
- Designing for Errors
pair.withgoogle[.]com/guidebook/
bottom of page