ARSENIC drift report

Run 7efb5fe6-38d2-4013-a68f-2449f2f84122 · 2026-05-18T21:31:53.229194Z · Overall Amber

v1 (gpt-4o-mini) vs v2 (gpt-4.1-mini)

Executive summary

Probe outcomes and drift valence across the run.

10
Total probes
8
Green
2
Amber
0
Red
2
↓ Regressions
0
↑ Improvements
8
→ Neutral
0
Auto-fix candidates

Safe to upgrade: true · Manual review items: 2

Critical regressions: 2 · Policy drift: 0 · Fidelity drift: 0 · Structural drift: 0 · Content compression: 0

Migration profile

v2 is safe to upgrade — 2 probes warrant review before switching

⚡ 23% faster

Safe to upgrade: true

Dimension overview

Worst risk per dimension, drift direction mix among affected probes (↑ / ↓ / →), and affected count.

Dimension Worst risk Drift (affected) Probes
Morphology Amber ↓0↑0→1 1
Tone Green 0
Factual Amber ↓0↑0→5 5
Schema Green 0
Instruction Green 0
Refusal Green 0
Semantic Amber ↓2↑0→0 2
Claim Red ↓2↑0→0 2
Latency Green 0
Consistency Green 0
Custom assertions Green 0

Red probes

No blocking regressions detected.

All probe results

Expand for prompt, dimension snapshot (with drift arrows), claim intelligence, and responses. Consistency σ shows run variance when multi-run sampling was enabled.

schema_array_of_objects Amber σ v1 0 · v2 0 Instruction · Regression

Prompt

Return a JSON array containing exactly 3 objects. Each object must have: id (integer), label (string). No explanation, just the JSON array.

Dimension snapshot

Severity 🟠 High
Drift category ⛔ Critical regression (CriticalRegression)
Morphology Green Neutral · Δ tokens ≈ 17% · type Δ false
Tone Green Neutral · formality Δ -8%
Semantic Amber Regression · sim 0.5256409491254004
Claim Red Regression · preservation 0.5 (≥ 0.7 for green band) · dropped 3 · new 0
Latency Green Neutral · v1 1367 ms · v2 1144 ms · Δ -223 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Claim diff3 matched · 3 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "id": 1, ↔ v2: {"id": 1, "label": "Apple"},
  • sim 1 · anchors ok true · v1: "id": 2, ↔ v2: {"id": 2, "label": "Banana"},
  • sim 1 · anchors ok true · v1: "id": 3, ↔ v2: {"id": 3, "label": "Cherry"}

Dropped (v1 → missing in v2)

  • "label": "First Item" · anchors: Item
  • "label": "Second Item" · anchors: Item
  • "label": "Third Item" · anchors: Item

v1 response

```json
[
    {
        "id": 1,
        "label": "First Item"
    },
    {
        "id": 2,
        "label": "Second Item"
    },
    {
        "id": 3,
        "label": "Third Item"
    }
]
```

v2 response

```json
[
  {"id": 1, "label": "Apple"},
  {"id": 2, "label": "Banana"},
  {"id": 3, "label": "Cherry"}
]
```
schema_classification_output Amber σ v1 0 · v2 0 Instruction · Regression

Prompt

Classify the following text and return ONLY a JSON object with fields: category (string), confidence (float 0-1, must be a number not a word), reasoning (string). No markdown fences. Raw JSON only.

Text: 'The quarterly revenue exceeded expectations by 12 percent.'

Dimension snapshot

Severity 🟠 High
Drift category ⛔ Critical regression (CriticalRegression)
Morphology Amber Neutral · Δ tokens ≈ 5% · type Δ true
Tone Green Neutral · formality Δ 5%
Semantic Amber Regression · sim 0.38770385429172166
Claim Red Regression · preservation 0.3333333333333333 (≥ 0.7 for green band) · dropped 2 · new 0
Latency Green Neutral · v1 963 ms · v2 908 ms · Δ -55 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Claim diff1 matched · 2 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "reasoning": "The text discusses financial performance, specifically revenue and expectations, which are key elements in the business context." ↔ v2: {"category":"Finance","confidence":0.95,"reasoning":"The text discusses quarterly revenue and its performance relative to expectations, which is directly related to financial reporting and business performance metrics."}

Dropped (v1 → missing in v2)

  • "category": "Business",
  • "confidence": 0.95, · anchors: 0.95

v1 response

{
  "category": "Business",
  "confidence": 0.95,
  "reasoning": "The text discusses financial performance, specifically revenue and expectations, which are key elements in the business context."
}

v2 response

{"category":"Finance","confidence":0.95,"reasoning":"The text discusses quarterly revenue and its performance relative to expectations, which is directly related to financial reporting and business performance metrics."}
schema_deeply_nested Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return this exact JSON with these exact values. No markdown code fences. No explanation. Raw JSON only:
{
  "request": {
    "meta": {
      "version": 1,
      "source": "api.example.com"
    },
    "payload": {
      "items": ["item1", "item2", "item3"]
    }
  }
}

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 1
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Regression · v1 1138 ms · v2 1413 ms · Δ 275 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Amber Neutral · regression false
Claim diff4 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "request": { ↔ v2: "request": {
  • sim 1 · anchors ok true · v1: "version": 1, ↔ v2: "version": 1,
  • sim 1 · anchors ok true · v1: "source": "api.example.com" ↔ v2: "source": "api.example.com"
  • sim 1 · anchors ok true · v1: "payload": { ↔ v2: "payload": {

v1 response

{
  "request": {
    "meta": {
      "version": 1,
      "source": "api.example.com"
    },
    "payload": {
      "items": ["item1", "item2", "item3"]
    }
  }
}

v2 response

{
  "request": {
    "meta": {
      "version": 1,
      "source": "api.example.com"
    },
    "payload": {
      "items": ["item1", "item2", "item3"]
    }
  }
}
schema_enum_field Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with these fields: status (must be exactly the string "pending"), priority (integer, set it to 3). No explanation, just the JSON.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.9999999883971046
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Regression · v1 857 ms · v2 1238 ms · Δ 381 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Amber Neutral · regression false
Claim diff2 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "status": "pending", ↔ v2: "status": "pending",
  • sim 1 · anchors ok true · v1: "priority": 3 ↔ v2: "priority": 3

v1 response

```json
{
  "status": "pending",
  "priority": 3
}
```

v2 response

```json
{
  "status": "pending",
  "priority": 3
}
```
schema_flat_required_fields Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with exactly these fields: name (string), age (integer), active (boolean). No other fields. No explanation, just the JSON.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.9999999734629076
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Regression · v1 913 ms · v2 1352 ms · Δ 439 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Claim diff2 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "name": "John Doe", ↔ v2: "name": "John Doe",
  • sim 1 · anchors ok true · v1: "age": 30, ↔ v2: "age": 30,

v1 response

```json
{
  "name": "John Doe",
  "age": 30,
  "active": true
}
```

v2 response

```json
{
  "name": "John Doe",
  "age": 30,
  "active": true
}
```
schema_nested_object Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with this exact structure and these exact values. No explanation, just the JSON:
{
  "user": {
    "id": 1,
    "email": "user@example.com"
  },
  "status": "active"
}

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.999999993085143
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Neutral · v1 1141 ms · v2 1046 ms · Δ -95 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Amber Neutral · regression false
Claim diff2 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "id": 1, ↔ v2: "id": 1,
  • sim 1 · anchors ok true · v1: "email": "user@example.com" ↔ v2: "email": "user@example.com"

v1 response

```json
{
  "user": {
    "id": 1,
    "email": "user@example.com"
  },
  "status": "active"
}
```

v2 response

```json
{
  "user": {
    "id": 1,
    "email": "user@example.com"
  },
  "status": "active"
}
```
schema_no_markdown_wrapping Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object: {"ok": true, "code": 200}. Output the raw JSON only. Do not wrap it in markdown code fences. Do not add any text before or after.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 1
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Improvement · v1 889 ms · v2 560 ms · Δ -329 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Green Neutral · regression false
Claim diff1 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: {"ok": true, "code": 200} ↔ v2: {"ok": true, "code": 200}

v1 response

{"ok": true, "code": 200}

v2 response

{"ok": true, "code": 200}
schema_nullable_field Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with these fields: title (string), description (string or null), count (integer). Set title to "test", description to null, count to 0. No explanation, just the JSON.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.9999999826690928
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Neutral · v1 965 ms · v2 1053 ms · Δ 88 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Amber Neutral · regression false
Claim diff2 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "description": null, ↔ v2: "description": null,
  • sim 1 · anchors ok true · v1: "count": 0 ↔ v2: "count": 0

v1 response

```json
{
  "title": "test",
  "description": null,
  "count": 0
}
```

v2 response

```json
{
  "title": "test",
  "description": null,
  "count": 0
}
```
schema_numeric_types Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with: count (must be the integer 5), ratio (must be the float 0.75), label (must be the string "test"). Values must be exactly as specified. No explanation, just the JSON.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.9999999826690928
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Improvement · v1 4220 ms · v2 886 ms · Δ -3334 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Amber Neutral · regression false
Claim diff2 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: "count": 5, ↔ v2: "count": 5,
  • sim 1 · anchors ok true · v1: "ratio": 0.75, ↔ v2: "ratio": 0.75,

v1 response

```json
{
  "count": 5,
  "ratio": 0.75,
  "label": "test"
}
```

v2 response

```json
{
  "count": 5,
  "ratio": 0.75,
  "label": "test"
}
```
schema_strict_no_extras Green σ v1 0 · v2 0 Instruction · Neutral

Prompt

Return a JSON object with ONLY these two fields: result (string), confidence (float between 0 and 1). Set result to the string "ok" and confidence to 0.9. Do not add any other fields. No markdown, no explanation. Raw JSON only.

Dimension snapshot

Severity ⚪ Informational
Drift category ✓ No significant drift (NoSignificantDrift)
Morphology Green Neutral · Δ tokens ≈ 0% · type Δ false
Tone Green Neutral · formality Δ 0%
Semantic Green Neutral · sim 0.9999999641035373
Claim Green Neutral · preservation 1 (≥ 0.7 for green band) · dropped 0 · new 0
Latency Green Improvement · v1 1077 ms · v2 803 ms · Δ -274 ms
Consistency Green Neutral · runs 3/3 · var 0 / 0
Refusal Green Neutral · new false · lifted false
Factual Green Neutral · regression false
Claim diff1 matched · 0 dropped · 0 new

Matched claims

  • sim 1 · anchors ok true · v1: {"result":"ok","confidence":0.9} ↔ v2: {"result":"ok","confidence":0.9}

v1 response

{"result":"ok","confidence":0.9}

v2 response

{"result":"ok","confidence":0.9}

Upgrade path

Structured rollout view. Validated prompt patches appear when mutations validate.

Blocking 0 · Verify 2 · Neutral 0 · Validated 0

No blocking regressions detected.

Improvements to verify

Validated prompt patches

No automated patches generated — run with --mutate to attempt prompt repairs.

Latency impact

Observational only — latency does not affect overall risk, probe direction, or upgrade path routing.

Average baseline latency

1353 ms

Average target latency

1040 ms

Delta: -313 ms (-23.1%) · Direction: Improvement

v2 responded 23% faster on average across 10 probes

Upgrade recommendation

Overall risk is acceptable for a controlled rollout. Review amber items and any "verify" red valence before broad deployment.

Copied to clipboard