top of page

AI failure states and designing for evaluations

April 24th, 2024



What is the difference between traditional non-AI systems and AI systems?


๐—ง๐—ฟ๐—ฎ๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น (๐—ก๐—ผ๐—ป-๐—”๐—œ) ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐—ฎ๐—ฟ๐—ฒ

  • Predictable: Given the same input, they always produce the same output.

  • Rule-based: Behavior is governed by explicitly coded logic.

  • Transparent: Errors are usually due to code bugs or misconfigurations and are easier to trace and fix.


Example: "Error: Interest calculation failed due to missing rate parameter. Please contact support."


๐—”๐—œ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐—ฎ๐—ฟ๐—ฒ

  • Probabilistic: Outputs can vary even with the same input.

  • Data-driven: Behaviour is shaped by training data

  • Opaque: Errors can be subtle, context-dependent, and harder to debug (e.g., hallucinations, bias).

  • Every AI system will eventually make bad predictions, plan for it. How?


A ๐—–๐—ผ๐—ป๐—ณ๐˜‚๐˜€๐—ถ๐—ผ๐—ป ๐— ๐—ฎ๐˜๐—ฟ๐—ถ๐˜… helps visualize model performance with

โœ… True Positives โ€“ Correct positive predictions

โŒ False Positives โ€“ Incorrectly flagged positives

โœ… True Negatives โ€“ Correctly ignored negatives

โŒ False Negatives โ€“ Missed positives


Three common AI error scenarios

๐Ÿญ. ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—™๐—ฎ๐—ถ๐—น๐˜‚๐—ฟ๐—ฒ (๐—ช๐—ฟ๐—ผ๐—ป๐—ด ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)

False positives or false negatives occur due to Poor data, Biases or Model hallucinations

Example: "Unusual transaction. Your card is blocked. If it was you, please verify your identity."


๐Ÿฎ. ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—Ÿ๐—ถ๐—บ๐—ถ๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ (๐—ก๐—ผ ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)

True negatives occur due to untrained use cases or gaps in knowledge

Example: "Sorry, we don't have enough information. Please try a different query!"


๐Ÿฏ. ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜๐˜‚๐—ฎ๐—น ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€ (๐— ๐—ถ๐˜€๐˜‚๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ผ๐—ผ๐—ฑ ๐—ข๐˜‚๐˜๐—ฝ๐˜‚๐˜)

True positives that confuse users due to poor explanations or conflicts

with user expectationsโ€จ

Example: User logs in from a new device, gets locked out.

AI responds: โ€œYour login attempt was flagged for suspicious activityโ€



Designing for evaluations

Designing for evaluations is fundamental to AI system developmentโ€”it guides improvement, ensures safety, and delivers systems that truly meet user needs.


There areย three key evaluation methodsย to improve ML systems.

  1. LLM based evaluations (LLM-as-a-judge)ย A separate language model acts as an automated judge. It can grade responses, explain its reasoning and assign labels like helpful/harmful or correct/incorrect.


    E.g.,ย Amazon Bedrock uses the LLM-as-a-Judge approachย to evaluate AI model outputs.A separate trusted LLM, like Claude 3 or Amazon Titan, automatically reviews and rates responses based on helpfulness, accuracy, relevance, and safety. For instance, two AI-generated replies to the same prompt are compared, and the judge model selects the better one.This automation reduces evaluation costs by up to 98% and speeds up model selection without relying on slow, expensive human reviews.

  2. Enable code-based evaluations:ย For structured tasks, use test suites or known outputs to validate model performance, especially for data processing, generation, or retrieval.

  3. Capture human evaluation:ย Integrate real-time UI mechanisms for users to label outputs as helpful, harmful, incorrect, or unclear. Read more about it inย patternย 19. Design to capture user feedback

  4. A hybrid approachย of LLM-as-a-judge and human evaluationย drastically boost accuracy to 99%.


bottom of page