Scanning Engines

AllyProof uses three engines in parallel to maximize detection coverage. The primary engine (axe-core) has a zero false-positive policy. Combined coverage: ~70% of automatable WCAG issues as deterministic violations, plus a dedicated advisory track for WCAG criteria that can only be partially automated and need human review.

Engine Pipeline

Four-layer accessibility scan architecture diagram.
Four engines run per page — deterministic findings drive the score; advisory, preview, and AI layers surface separately.

Accessibility Analysis Pipeline: a layered scan architecture that produces deterministic, advisory, preview, and AI-assisted outputs. Engine 1, axe-core, produces deterministic violations with zero false positives — this is the primary engine. Engine 2, HTML_CodeSniffer, produces additional violations from errors and warnings. Results are deduped by element fingerprint combined with WCAG criterion, and a static technique-overlap map drops techniques like H37, G18, and H44 where axe-core already flagged the same defect. Engine 2b, HTMLCS notices, are advisory manual-review items with is_advisory set to true. They're surfaced with a Manual review label and are never counted in the score, VPAT, certificates, or public reports. Engine 3, APCA, is a WCAG 3.0 Preview contrast check — informational only, shown alongside WCAG 2.x but never used to gate compliance. Engine 4, AI analysis, runs post-scan, is tier-gated to Agency and Enterprise, and fires asynchronously on Layer 4. It targets criteria that need human judgement: WCAG 1.4.1 colour-only, 3.3.2 format instructions, and 3.2.3 cross-page navigation consistency. It writes advisory items with rule_id prefixed ai-, each confidence-labeled. Deterministic findings from Engines 1 and 2 feed the core score. Advisory and preview layers are surfaced separately.

Read pipeline as structured text
  1. Engine 1axe-coreDeterministic

    Deterministic violations with a zero-false-positive guarantee.

  2. Engine 2HTML_CodeSnifferDeterministic

    Errors and warnings surface additional violations. Results are deduped by element fingerprint × WCAG criterion, and a static technique-overlap map drops techniques like H37, G18, and H44 where axe-core already flagged the same defect.

  3. Engine 2bHTMLCS noticesAdvisory

    Items with is_advisory = true, surfaced with a “Manual review” label on the issue list. Never counted in score, VPAT, certificates, or public reports.

  4. Engine 3APCAInformational

    WCAG 3.0 Preview contrast. Shown alongside WCAG 2.x, never used to gate compliance.

  5. Engine 4AI analysis (Layer 4)Advisory

    Post-scan, tier-gated (Agency and Enterprise). Targeted LLM checks on WCAG 1.4.1 (colour alone), 3.3.2 (format instructions), and 3.2.3 (cross-page nav consistency). Writes advisory items with rule_id = ai-* and a confidence label.

Why four tracks, not one combined list

  • Deterministic violations (axe + HTMLCS errors/warnings) feed scores, VPATs, certificates, and public shared reports. Zero false positives is the guarantee — anything surfaced here can be defended in a compliance conversation.
  • Advisory items (HTMLCS notices) are manual-review prompts for WCAG criteria neither engine can fully automate — 1.4.1 (color alone), 3.3.2 (form instructions), 2.1.1 / 2.4.7 (focus behavior on modals, carousels), 1.2.x (media alternatives). These used to be discarded; keeping them recovers most of the manual-review gap at no runtime cost.
  • APCA is an informational preview of the WCAG 3.0 contrast algorithm — shown alongside WCAG 2.x results, never used to gate compliance.
  • AI analysis (Layer 4) runs three targeted LLM checks on criteria where judgment matters more than rule matching. Agency and Enterprise tiers only. Findings are advisory, confidence-labeled, and never affect scores or legal reports.

Engine 1: axe-core (Primary)

PropertyValue
SourceDeque Systems (open source)
Rules91 active + 5 experimental
StandardsWCAG 2.0/2.1/2.2 A+AA, best-practice
False positivesZero (strict policy)
Coverage~57% of automatable WCAG issues

axe-core is the industry standard. It has a zero false-positive policy, meaning it stays silent rather than risk reporting something that isn't a real violation. This makes it the trusted baseline for all AllyProof scans.

Enabled experimental rules

  • css-orientation-lock — WCAG 1.3.4 Orientation
  • label-content-name-mismatch — WCAG 2.5.3 Label in Name
  • p-as-heading — WCAG 1.3.1 Info and Relationships
  • table-fake-caption — WCAG 1.3.1 Info and Relationships
  • td-has-header — WCAG 1.3.1 Info and Relationships

Engine 2: HTML_CodeSniffer (Secondary)

PropertyValue
SourceSquiz Labs (open source)
Rules~200 rules
StandardsWCAG 2.1 A, AA, AAA
IntegrationBrowser script injection
False positivesLow (errors), Medium (warnings)

HTML_CodeSniffer uses different detection algorithms than axe-core, catching issues axe's strict zero-false-positive policy causes it to skip. Rule IDs are prefixed with htmlcs- for source identification.

How deduplication works

Dedup happens in two layers so we never double-count, and never silently drop a legitimate finding when the two engines disagree about the detail but agree about the element.

Layer 1 — per-(element, criterion) fingerprint. Each element's outerHTML is normalized (lowercase, collapse whitespace, strip comments, truncate) into a stable fingerprint. The dedup key is {fingerprint}|{wcag_criterion}. An HTMLCS finding is dropped only if axe reported the same element violating the same success criterion. If axe flags button-name on a button and HTMLCS flags contrast on the same button, both are kept — they are different issues.

Layer 2 — static technique-overlap map. A hard-coded set of HTMLCS technique codes (H37, G18, H44, F77, …) lists the cases where axe has a dedicated rule that covers the entire surface. Those HTMLCS findings are dropped regardless of fingerprint — a safety net against formatting differences. Conservative by design: when HTMLCS catches edge cases axe doesn't, the code stays out of the overlap list.

Message types → buckets

HTMLCS typeBucketFlagCounted?
Error (1)Deterministic violationimpact=seriousYes
Warning (2)Deterministic violationimpact=moderateYes
Notice (3)Advisory / manual reviewis_advisory=trueNo — surfaced inline with a Manual review pill

Advisory items (HTMLCS notices)

Notices used to be discarded as “too verbose.” That threw away exactly the signal needed for criteria neither engine can fully automate. Notices now persist as advisory items with is_advisory=true:

PropertyAdvisory behavior
Counted in violation totalsNo
Affects accessibility scoreNo
Appears in VPAT / certificates / public reportsNo
Feeds AI fix suggestionsNo
Surfaced in UIYes — inline on the issue list with a Manual review pill in place of the severity pill. Detail page opens to an amber callout explaining what manual review means.
Persisted across scansYes, with bucket-segregated resolve tracking
Counted separatelyYes — advisory_count on scan jobs and page scans

Notices whose technique is in the axe-overlap map are dropped (no point in “please manually review alt text” when axe already checked every image). Everything else is kept.

Engine 3: APCA Contrast (Preview)

PropertyValue
SourceMyndex / W3C WCAG 3.0 draft
TypePerceptual contrast calculator
StandardWCAG 3.0 (draft)
False positivesNone (mathematical)

WCAG 3.0 replaces the WCAG 2.x contrast ratio formula with APCA (Advanced Perceptual Contrast Algorithm). APCA accounts for font size, weight, and perceptual uniformity, producing more accurate readability predictions.

APCA Lc thresholds

Lc valueUse case
90+Preferred for body text
75Minimum for body text (16px regular)
60Minimum for large/bold text
45Minimum for non-text UI elements
30Absolute minimum for any text

Engine 4: AI Analysis (Layer 4 — Agency + Enterprise)

PropertyValue
RunsAsync after the deterministic scan completes
Tier gateAgency and Enterprise only
ModelPer-tier selection by superadmin (Anthropic / Google / OpenAI)
Cost ceiling~40K tokens per scan, hard-capped
OutputAdvisory items (is_advisory=true, rule_id prefix ai-)
Counts in scores / VPATs / certificatesNo

After Layers 1–3, roughly 30% of WCAG AA criteria are still inaccessible to static analysis. Most of those need judgment— "is the meaning of that red dot conveyed elsewhere?", "does this label tell the user to enter the date as MM/DD?", "is the nav the same on every page?" An LLM can reason over DOM + context in a way no rule engine can.

Layer 4 is deliberately narrow. It runs three targeted checks where the AI has the highest signal-to-noise ratio, not every possible WCAG criterion:

CheckCriterionWhat the AI decides
Color-only indicator1.4.1 Use of ColorIs color the sole cue, or is text / icon / shape also conveying meaning?
Missing format instructions3.3.2 Labels or InstructionsDoes a format-requiring input (date, phone, code) communicate the expected format?
Nav consistency3.2.3 Consistent NavigationDoes the primary nav match across pages? (Only AI can do cross-page comparison.)

How it works

  1. Snippet capture during the browser pass for each page: up to 3 navs, 20 form inputs (with labels + 300 chars of surrounding context + pattern/describedby flags), and 15 status-colored elements (filtered by computed color hue). Each region is size-capped and stored on page_scans.ai_snippets.
  2. Tier gate + model lookup.The orchestrator checks the org's plan against aiScanAnalysis and reads the superadmin-configured model for the scan-analysis workload.
  3. Page sampling. Up to 3 pages per scan — homepage always included when present, plus the pages with the most interesting snippet mass. Deterministic so reruns are cache-friendly.
  4. Batched LLM calls. Each check sends compact JSON payloads with a conservative system prompt. Findings are parsed defensively — malformed output is dropped rather than crashing the whole check.
  5. Upsert as advisory. Valid findings map into violations with is_advisory=true, rule_id prefix ai-, and a confidence label. They appear inline on the issue list with the same Manual review pill as HTMLCS notices; the detail page opens to an amber callout explaining AI-sourced findings.

Conservative-output policy

  • Color-only: must cite the absenceof a specific redundant cue; err toward "unclear".
  • Format instructions:free-form fields default to "no_format_required" — only confident misses are flagged.
  • Nav consistency: low-confidence findings are discarded outright — cross-page diffs have high false-positive risk.

Model selection

Superadmin chooses the model per plan tier in the AI Scan Analysis panel, mirroring the AI Fix Suggestions panel. Typical setup: a cheaper Haiku-class model for scan analysis (runs frequently), a stronger model for fix suggestions (runs less often, deeper analysis). Available providers: Anthropic (Claude), Google (Gemini), OpenAI (GPT).

Coverage Comparison

Engine / layerRulesTierCoverage contribution
axe-core91All57% deterministic (baseline)
HTML_CodeSniffer (errors & warnings)~200All+10–15% deterministic
HTML_CodeSniffer (notices → advisory)same 200 rule-setAllCloses ~60% of the manual-review gap
APCA1 (contrast)AllWCAG 3.0 preview
AI scan analysis (Layer 4)3 focused checksAgency + EnterpriseCovers 1.4.1 / 3.3.2 / 3.2.3 — the highest-impact criteria AI can judge reliably
Combined~290 + 3 AITier-dependent~70% automatable + AI-flagged + HTMLCS-flagged manual-review for the rest

Security wall around AI

The AI layer consumes HTML extracted from third-party sites. That HTML is untrusted — a malicious site could embed hidden text instructing the model to exfiltrate data, misclassify findings, or emit a response containing XSS. The LLM itself never has direct access to our environment variables or database; the real attack surface is (1) a poisoned response that misleads users, (2) stored XSS in rendered AI output, (3) reconnaissance probing for tools/prompts, (4) resource exhaustion via inflated output, and (5) hostile URLs in AI content.

Physical isolation — the ai-worker service

Every LLM call in production leaves the main app over HTTP to an isolated sibling service at services/ai-worker/. That service holdsonly the LLM provider keys and an HMAC shared secret. It has no Supabase client, no Paddle keys, no Resend keys, no R2 credentials, no file-system writes (container runs with --read-only and --cap-drop=ALL), no access to other services. It's bound to the internal Docker network and never exposed to the internet. Requests are HMAC-SHA256-signed with a 60-second replay window and a 1 MB body cap.

The main app calls the worker through a single chokepoint (src/lib/ai/worker-client.ts). Provider SDK imports are dynamic and gated by a dev-only fallback that is hard-blocked in production. A compromise of the LLM path — prompt injection, supply-chain, or anything else — therefore burns tokens but exposes no user data, no tenant secrets, no billing credentials.

Prompt-layer wall (defense in depth, inside both processes)

LayerWhereWhat it does
Input sanitizerMain appStrips <script>, <style>,<iframe>, event handlers, javascript:/vbscript:/data:text/html URIs, zero-width and bidi characters, HTML comments, data-* attributes, and known prompt-injection markers before any DOM snippet touches a prompt. Visually-hidden elements are emptied.
Prompt wrapperMain appAll untrusted content sits between UNTRUSTED_START /UNTRUSTED_END markers with an explicit guardrail paragraph and a repeated reminder after the block. System prompts start with a shared safety preamble that separates trust levels and forbids the model from revealing its system prompt, tools, or environment.
Worker scrubai-workerLast chance before a response crosses the boundary. Strips script/ style/iframe/event-handler/dangerous-URI content from prose (preserves fenced code blocks), strips invisibles, caps response at 16 KB.
Output validatorMain appEvery response is parsed against a strict Zod schema. Free-form text is scrubbed (HTML tags removed, invisible chars stripped, length capped). URLs must resolve to http(s). Anything that fails validation is silently dropped.
UI escapeMain appAI output is rendered via React text nodes and custom code highlighters — no dangerouslySetInnerHTML. Even if a scrub missed something, the browser receives escaped text.

No single layer is a security boundary on its own. The value comes from combining a physical isolation perimeter with multiple prompt-layer trip wires. A poisoned response that survives the worker's scrub still has to pass the main app's Zod validation — and even if it did, it would only reach a process that doesn't hold the keys an attacker would want.maxTokens is capped per call and Layer 4 enforces a 40K token-per-scan budget that exits early when exhausted.

What stays manual even with Layer 4

Even with advisory items and AI analysis, these criteria remain out of reach of any static or async scanner — they need human testing or runtime probes:

  • 2.3.1 Photosensitive flashing — requires frame-rate analysis (PEAT)
  • 3.3.1 / 3.3.3 Error announcement timing — runtime only, requires assistive tech
  • 1.2.3 / 1.2.5 Caption / audio description quality — presence yes, correctness no
  • 2.1.1 / 2.4.3 Modal focus trap / return-focus — runtime behavior, needs Playwright interaction probes (future work, scriptable without LLM)