AI failure states and designing for evaluations

April 24th, 2024

What is the difference between traditional non-AI systems and AI systems?

𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 (𝗡𝗼𝗻-𝗔𝗜) 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 𝗮𝗿𝗲

Predictable: Given the same input, they always produce the same output.
Rule-based: Behavior is governed by explicitly coded logic.
Transparent: Errors are usually due to code bugs or misconfigurations and are easier to trace and fix.

Example: "Error: Interest calculation failed due to missing rate parameter. Please contact support."

𝗔𝗜 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 𝗮𝗿𝗲

Probabilistic: Outputs can vary even with the same input.
Data-driven: Behaviour is shaped by training data
Opaque: Errors can be subtle, context-dependent, and harder to debug (e.g., hallucinations, bias).
Every AI system will eventually make bad predictions, plan for it. How?

A 𝗖𝗼𝗻𝗳𝘂𝘀𝗶𝗼𝗻 𝗠𝗮𝘁𝗿𝗶𝘅 helps visualize model performance with

✅ True Positives – Correct positive predictions

❌ False Positives – Incorrectly flagged positives

✅ True Negatives – Correctly ignored negatives

❌ False Negatives – Missed positives

Three common AI error scenarios

𝟭. 𝗦𝘆𝘀𝘁𝗲𝗺 𝗙𝗮𝗶𝗹𝘂𝗿𝗲 (𝗪𝗿𝗼𝗻𝗴 𝗢𝘂𝘁𝗽𝘂𝘁)

False positives or false negatives occur due to Poor data, Biases or Model hallucinations

Example: "Unusual transaction. Your card is blocked. If it was you, please verify your identity."

𝟮. 𝗦𝘆𝘀𝘁𝗲𝗺 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 (𝗡𝗼 𝗢𝘂𝘁𝗽𝘂𝘁)

True negatives occur due to untrained use cases or gaps in knowledge

Example: "Sorry, we don't have enough information. Please try a different query!"

𝟯. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗘𝗿𝗿𝗼𝗿𝘀 (𝗠𝗶𝘀𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗼𝗼𝗱 𝗢𝘂𝘁𝗽𝘂𝘁)

True positives that confuse users due to poor explanations or conflicts

with user expectations 

Example: User logs in from a new device, gets locked out.

AI responds: “Your login attempt was flagged for suspicious activity”

Designing for evaluations

Designing for evaluations is fundamental to AI system development—it guides improvement, ensures safety, and delivers systems that truly meet user needs.

There are three key evaluation methods to improve ML systems.

LLM based evaluations (LLM-as-a-judge) A separate language model acts as an automated judge. It can grade responses, explain its reasoning and assign labels like helpful/harmful or correct/incorrect.

E.g., Amazon Bedrock uses the LLM-as-a-Judge approach to evaluate AI model outputs.A separate trusted LLM, like Claude 3 or Amazon Titan, automatically reviews and rates responses based on helpfulness, accuracy, relevance, and safety. For instance, two AI-generated replies to the same prompt are compared, and the judge model selects the better one.This automation reduces evaluation costs by up to 98% and speeds up model selection without relying on slow, expensive human reviews.
Enable code-based evaluations: For structured tasks, use test suites or known outputs to validate model performance, especially for data processing, generation, or retrieval.
Capture human evaluation: Integrate real-time UI mechanisms for users to label outputs as helpful, harmful, incorrect, or unclear. Read more about it in pattern 19. Design to capture user feedback
A hybrid approach of LLM-as-a-judge and human evaluation drastically boost accuracy to 99%.

AI failure states and designing for evaluations

Three common AI error scenarios

Designing for evaluations

AgenticPath

Human-Centered AI Strategy
and Experience Design

AI failure states and designing for evaluations

Three common AI error scenarios

Designing for evaluations

AgenticPath

Human-Centered AI Strategy and Experience Design

Human-Centered AI Strategy
and Experience Design