How We Made AI Testing 200x Faster with Screen HTML
Why sending screenshots to AI models is the wrong approach, and how converting iOS screens to semantic HTML changed everything.
The Problem: Screenshots Are Slow and Dumb
Every AI testing tool follows the same pattern: take a screenshot, send it to a vision model, ask "what's on screen?" The AI squints at pixels and guesses.
This approach has three fatal problems:
- It's slow. Taking a screenshot, encoding it, sending it to the API, waiting for vision processing — 20-30 seconds per screen read.
- It's expensive. Vision API calls cost 5-10x more tokens than text.
- It's inaccurate. The AI can't reliably distinguish a button from a label, can't read small text, and has no concept of tap coordinates.
When you're running an autonomous QA agent that needs to read the screen 10-20 times per test, 20-30 seconds per read means a simple test takes 5+ minutes. Most of that time is wasted staring at screenshots.
The Insight: LLMs Already Understand HTML
Here's what we realized: LLMs are trained on billions of web pages. They parse HTML natively. They know that <button> is tappable, <input> takes text, and <p> is content.
So instead of sending a screenshot and asking "what do you see?", what if we converted the iOS screen into HTML and sent that?
<!-- What the AI receives -->
<screen name="Booking Confirmation">
<header>Confirm Your Ride</header>
<p data-center="195,120">Mumbai → Pune</p>
<p data-center="195,160">Seats: 2</p>
<p data-center="195,200" class="price">Total: ₹249</p>
<button data-center="195,450" id="pay-btn">Pay Now</button>
<button data-center="195,510" id="cancel">Cancel</button>
</screen>
The AI instantly knows: there are two buttons, the price is ₹249, and to tap "Pay Now" it should target coordinates (195, 450). No vision processing. No guessing. No ambiguity.
The Results
| Metric | Screenshot | Screen HTML | Improvement |
|---|---|---|---|
| Screen read time | 20-30 seconds | ~100ms | 200x faster |
| Token cost per read | ~2,000 tokens (image) | ~200 tokens (text) | 10x cheaper |
| Tap accuracy | ~70% (guessing coordinates) | ~95% (exact coordinates) | Far more reliable |
| Element identification | Often wrong | Always correct | No ambiguity |
A test that took 3-5 minutes now takes 30-60 seconds. Not because the AI got faster — because we stopped wasting time on screenshots.
How It Works Under the Hood
NoobQA uses the Noober SDK — a lightweight iOS debugging library that runs inside your app. When the AI agent calls noober_screen_html, here's what happens:
- Noober walks the UIKit/SwiftUI view hierarchy
- For each view, it extracts: type (button, label, input, image), text content, frame coordinates, accessibility traits
- It converts this into semantic HTML with
data-centerattributes containing tap coordinates - The HTML is sent to the AI agent as plain text — no image encoding, no vision API
The key insight is that iOS view hierarchies and HTML DOM trees are structurally identical. A UIButton maps directly to <button>. A UITextField maps to <input>. The mapping is natural, not forced.
Why Not Just Use the Accessibility Tree?
Good question. iOS has a built-in accessibility API that exposes the view hierarchy. Most testing tools use it (via XCUITest or similar). But:
- It's slow. Enumerating the accessibility tree takes 20-30 seconds on complex screens (the same problem as screenshots).
- Labels are often missing. SwiftUI views don't automatically expose accessibility labels unless you add them manually.
- No semantic context. The accessibility tree tells you "there's a button at (195, 450)" but not that it's a payment button inside a confirmation flow.
Our Screen HTML approach reads the actual view hierarchy (not the accessibility layer), runs in-process (no IPC overhead), and adds semantic context from the view type and content.
What This Enables
Fast screen reading unlocks capabilities that are impossible with screenshots:
- Recorded flow replay. The AI can execute 10 taps in one message because coordinates are known. No need to re-read the screen between taps.
- Failed tap recovery. If a tap doesn't work, the AI re-reads the screen in 100ms, gets fresh coordinates, and retries — all within the same turn.
- Batch verification. Check 5 values on screen in a single call instead of 5 separate screenshots.
- Real-time testing. Tests feel responsive because the AI isn't waiting for screenshots to process.
The Bigger Picture
Screenshots are a crutch. They exist because existing tools don't have access to the app's internal state. If you can see inside the app — its view hierarchy, its network requests, its logs — you don't need to squint at pixels.
This is the core idea behind NoobQA: don't test from the outside looking in. Test from the inside looking out.
Screen HTML is just one example. Noober also gives the AI access to network requests (noober_assert_request), analytics events (noober_check_event), and app logs (noober_get_app_logs) — all things that are invisible to screenshot-based tools.
The result is QA testing that's not just faster, but fundamentally deeper. You're not just checking "does the button look right?" You're checking "did the API return the right data, did the analytics event fire, and is the UI showing the correct calculation?"
That's what we built. And it runs in under 60 seconds.