AI failure states and designing for evaluations
April 24th, 2024

What is the difference between traditional non-AI systems and AI systems?
๐ง๐ฟ๐ฎ๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฎ๐น (๐ก๐ผ๐ป-๐๐) ๐ฆ๐๐๐๐ฒ๐บ๐ ๐ฎ๐ฟ๐ฒ
Predictable: Given the same input, they always produce the same output.
Rule-based: Behavior is governed by explicitly coded logic.
Transparent: Errors are usually due to code bugs or misconfigurations and are easier to trace and fix.
Example: "Error: Interest calculation failed due to missing rate parameter. Please contact support."
๐๐ ๐ฆ๐๐๐๐ฒ๐บ๐ ๐ฎ๐ฟ๐ฒ
Probabilistic: Outputs can vary even with the same input.
Data-driven: Behaviour is shaped by training data
Opaque: Errors can be subtle, context-dependent, and harder to debug (e.g., hallucinations, bias).
Every AI system will eventually make bad predictions, plan for it. How?
A ๐๐ผ๐ป๐ณ๐๐๐ถ๐ผ๐ป ๐ ๐ฎ๐๐ฟ๐ถ๐ helps visualize model performance with
โ True Positives โ Correct positive predictions
โ False Positives โ Incorrectly flagged positives
โ True Negatives โ Correctly ignored negatives
โ False Negatives โ Missed positives
Three common AI error scenarios
๐ญ. ๐ฆ๐๐๐๐ฒ๐บ ๐๐ฎ๐ถ๐น๐๐ฟ๐ฒ (๐ช๐ฟ๐ผ๐ป๐ด ๐ข๐๐๐ฝ๐๐)
False positives or false negatives occur due to Poor data, Biases or Model hallucinations
Example: "Unusual transaction. Your card is blocked. If it was you, please verify your identity."
๐ฎ. ๐ฆ๐๐๐๐ฒ๐บ ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐ (๐ก๐ผ ๐ข๐๐๐ฝ๐๐)
True negatives occur due to untrained use cases or gaps in knowledge
Example: "Sorry, we don't have enough information. Please try a different query!"
๐ฏ. ๐๐ผ๐ป๐๐ฒ๐ ๐๐๐ฎ๐น ๐๐ฟ๐ฟ๐ผ๐ฟ๐ (๐ ๐ถ๐๐๐ป๐ฑ๐ฒ๐ฟ๐๐๐ผ๐ผ๐ฑ ๐ข๐๐๐ฝ๐๐)
True positives that confuse users due to poor explanations or conflicts
with user expectationsโจ
Example: User logs in from a new device, gets locked out.
AI responds: โYour login attempt was flagged for suspicious activityโ
Designing for evaluations
Designing for evaluations is fundamental to AI system developmentโit guides improvement, ensures safety, and delivers systems that truly meet user needs.
There areย three key evaluation methodsย to improve ML systems.
LLM based evaluations (LLM-as-a-judge)ย A separate language model acts as an automated judge. It can grade responses, explain its reasoning and assign labels like helpful/harmful or correct/incorrect.
E.g.,ย Amazon Bedrock uses the LLM-as-a-Judge approachย to evaluate AI model outputs.A separate trusted LLM, like Claude 3 or Amazon Titan, automatically reviews and rates responses based on helpfulness, accuracy, relevance, and safety. For instance, two AI-generated replies to the same prompt are compared, and the judge model selects the better one.This automation reduces evaluation costs by up to 98% and speeds up model selection without relying on slow, expensive human reviews.
Enable code-based evaluations:ย For structured tasks, use test suites or known outputs to validate model performance, especially for data processing, generation, or retrieval.
Capture human evaluation:ย Integrate real-time UI mechanisms for users to label outputs as helpful, harmful, incorrect, or unclear. Read more about it inย patternย 19. Design to capture user feedback
A hybrid approachย of LLM-as-a-judge and human evaluationย drastically boost accuracy to 99%.