Why Computer-Use Agents Should Think Less

/San Francisco/Surya DantuluriPosted by Surya Dantuluri

Archon demo animation

Over the weekend, I won #3 at OpenAI's GPT-5 Hackathon with Archon - a copilot for your computer. It comes with a mini vision model for speed, and GPT-5 for variable reasoning to plan. I took some time to write about how it works, and our approach to building a self-driving computer with inference math, and the tradeoffs we made.

Archon is a small bar that sits at the bottom of your Mac/Windows screen where you can type what you want your computer to do in natural language. It takes screenshots to see what's on screen, uses GPT-5's reasoning to plan, then a custom fine-tuned model executes clicks and keystrokes. In a racing game demo with a single instruction to 'start playing' it recognized the view, used WASD, and navigated the track. Although it didn't win this time due to latency, its instruction-following ability was clearly superior to prior models. The goal is to make a copilot that makes computers self-driving. Archon is a lightweight client demonstrating that GPT-5's powerful reasoning combined with tiny fine-tuned models can control any interface through natural language.

Full demo video sped up 2x

GPT-5: Why it worked for us

Archon was built entirely using GPT-5's advanced reasoning capabilities. We leveraged probably every aspect of GPT-5 from initial development to debugging to training. Codex CLI with GPT-5 with High Thinkinng enabled us to build the entire app, and GPT-5 with Vision enabled us to see and percieve the screen. GPT-5's reasoning ability was crucial for instruction following, and planning. These all in one model quite simply wasn't possible with any other model.

What makes GPT-5 particularly suited for computer control is its ability to reason through complex multi-step processes while maintaining context across long interactions. Unlike previous models that might hallucinate or lose track of the current state, GPT-5's chain-of-thought reasoning allows it to break down "start playing this game" into discrete, executable steps while adapting to unexpected UI changes.

We calibrated how much compute to use strategically to trade off accuracy and latency. For complex workflows, high reasoning effort mapped out interaction sequences with error handling. GPT-5-mini with function callling preambles enabled us to show the user what we were thinking while simultaneously calling our grounding model. This adaptive approach keeps the user in mind. Whether they are a user, enterprise, need to go through complex, changing UIs, or just need to get something done, we can trade off reasoning for latency and vice versa.

How it actually works

+-----------------+      +----------------------+      +------------------+      +----------------+
|  User intent    | ---> |  Planner (GPT-5)     | ---> |  Archon-Mini     | ---> |  Executor      |
|  ("book a ...") |      |  plan: {click "..."} |      |  ground -> (x,y) |      |  mouse/keyboard|
+-----------------+      +----------------------+      +------------------+      +----------------+
                                                                 ^        |
                                                                 | verify |  (screen diff predicate)
                                                                 +--------+

Archon uses a hierarchical approach: a large reasoning model (GPT-5/o-series) decides what to do, and archon‑mini figures out exactly where to click. This split matters because reasoning and grounding are fundamentally different problems with different computational requirements.

The reasoning model sees the screen and your request, then outputs a semantic action: "click the blue Submit button at the bottom." Descriptions enable reasoning to be done in natural language. archon‑mini takes that description plus the screenshot and outputs exact pixel coordinates: (523, 412). One model for the "what," another for the "where."

# The loop:
screen = capture_screen()
instruction = "book a flight for tomorrow"

# Step 1: Reasoning (200-500ms)
plan = gpt5.reason(screen, instruction)
# → {"action": "click", "target": "Flights tab"}

# Step 2: Grounding (≈50ms)
coords = archon_mini.ground(screen, plan.target)
# → (567, 234)

# Step 3: Execute
mouse.click(coords.x, coords.y)

# Repeat until done

archon‑mini is a 7B Qwen‑2.5‑VL–based executor (dynamic‑res ViT) fine‑tuned with GRPO for GUI grounding. In the future, it will be trained with a combination of trajectory‑boosted human demos and synthetic teacher rollouts. It outputs direct (x, y) screen coordinates and structured tool calls.

Why vision tokens are expensive (and how to fix it)

Most of what happens on a screen is static. If we can avoid sending unchanged data, we save money and get step-function faster. Similar to Cursor's Fast Apply, instead of sending your whole codebase every time you make a change, just send the diff.

The problem: Full-frame encoding
OpenAI charges vision tokens based on image tiles. A 1080p screen gets split into 6 tiles (3×2), with a base cost of 65 tokens plus 129 tokens per tile:

Full-frame cost: 65 + (129 × 6) = 839 tokens/frame
Over 20 steps: 839 × 20 = 16,780 tokens
At $0.15/1M tokens: ~$0.0025 per task

The solution: Selective patch downsampling
I found that we could get away with making specific patches lower resolution. In the future, we'll use a saliency model to pick the most important patches. Illustrative math:
• Split into a 16×16 grid (256 patches total)
• Pick top-20 most important patches using a tiny saliency model
• Cache unchanged patches (over 70% hit rate because most of the screen is static)
• Only encode ~6 new patches per frame

Selective patch cost: 20 patches × 32 tokens × 30% new = 192 tokens/frame
Over 20 steps: 192 × 20 = 3,840 tokens
Savings: 4.4× fewer tokens (16,780 → 3,840)
Cost reduction: $0.0025 → $0.0006 per task

Why patches beat tiles for precision
A 512×512 tile covers 262K pixels. A 32×32 patch covers just 1K pixels—256× more precise for locating UI elements. For a button at pixel (523, 412):
• Tile method: ±256px accuracy
• Patch method: ±16px accuracy

[ 1920×1080 frame ]  ──►  16×16 saliency heatmap (3 MB model, ~5 ms)
                         │
                         ├─ top-K patches (~20) → crop & upscale to 32×32 → encode → ~640 tokens raw
                         │                             ▲
                         │                             └─ with 70% cache hits → ~192 new tokens/frame
                         │
                         └─ low-saliency regions → skip or cheap downsample
        + cache unchanged patches across frames
Saliency heatmap showing attention regions on a screenshot
Illustration of the saliency heatmap.

Total cost per action
For the fast path (grounding only, ~10-50 text tokens output):

Cost breakdown per step:
• Vision tokens: ~192 × $0.15/1M = $0.000029
• Text tokens: ~25 × $0.60/1M = $0.000015
• GPU inference: ~$0.0001
Total: ~$0.00014 per action

When using reasoning, you can add the planner's token costs.

If you enjoy this kind of systems math, pricing, selective encoding, cache design, and test‑time routing — we're hiring. prava.co/careers

Training: GRPO and synthetic data generation

To continue on why we use patches, we trained archon‑mini with GRPO, where rewards look like "anywhere inside the element" and the reward is 1 if the click is inside the element and 0 otherwise. Patches are small enough such that if you click anywhere inside the element, you get a reward. To further improve the grounding model, we found trajectory augmentation would be a good idea. Using human demonstrations, you can make a bunch of related trajectories, "boosting" the grounding model.

GRPO (Group Relative Policy Optimization)

for each (screenshot, target):
  sample N=8 clicks →  •  •  ○  •  ○  •  ○  •
                       hit=•  miss=○   (inside element = success)
  z-normalize rewards → advantages
  update policy: ↑prob(hits), ↓prob(misses)

It's fine if you don't click the center of the element, any pixel inside the boundary is valid.
Trajectory boosting

30 human demos
   │
   ├─ Thought completion (per-step rationale)
   ├─ Action diversification (Enter, Tab+Space, alternate pixels)
   ▼
~1k enriched steps → train with GRPO (8 samples/step) → robust grounding

While testing, archon-mini was really bad at clicking bright red buttons, compared to tiny blue buttons it was clicking. We suspect this is because the bright red buttons are more likely to be at the center of the element, and the tiny blue buttons are more likely to be at the edge of the element. More work is needed to make the model more robust and for us to interpret all its capabilities.

Speed: adaptive compute that feels instant

Test-time compute is getting extremely hyped these days, particularly off of the success of the o-series models. In my experience, I personally get much usage from GPT-5 Pro and previously o3-pro. The reason is because a lot of my day-to-day work revolves around "knowledge work". Good thing for archon-mini is that it's a lot of "grounding work" and not a lot of "knowledge work". You can get a lot of mileage out of a 7B model if you instead vary the reasoning and determine how to properly pipeline the tasks. In the future I intend to use archon-mini with aggressive simplicity and caching with a simple policy:observe → ground() → execute() → verify → repeat

observe → ground() → execute() → verify → repeat

On this path, archon‑mini runs alone (no planner call), hitting ~50 ms per action on a A100. The router only escalates when signals are uncertain: high saliency entropy, too many candidate targets, recent misclicks, or ambiguous copy (e.g., multiple “Submit” buttons). When that trips, we pipeline one step ahead: Step N (simple) executes now while the reasoner prepares a short plan for Step N+1. The router is a simple policy that looks at the signals and decides whether to escalate or not.

Pipelined control (perceived continuity)

time ►  [ ground ] [ ground ] [ ground ] …
          ▲ while planner prepares the next complex step ▲
Routing policy

+----------------------------+
| signals: H(saliency), #cands,
| ocr_density, recent_fails  |
+----------------------------+
   /         |               
 simple   ambiguous        complex
   |         |               |
 ground()  quick_reason     deep_plan
  ~50ms      ~200ms          500–1000ms

Consumer workloads are bursty (batch=1). Enterprise is steady (batch 8–64, optimize for throughput). We want the router to be different for each.

For the typical consumer, we think it's better to bias toward the fast path (planner stays cold unless ambiguity is detected). In enterprise, we enable continuous batching for planner calls, short aggregation windows, and aggressive prefix caching; archon‑mini stays on‑GPU so grounding still feels immediate.

After ~1 hour of use we typically see a pretty high patch‑cache hit‑rate where similar patches (imagine a screenshot of a button) are cached and reused. Verifying is cheap (single screenshot + state predicate), so we keep iterating quickly without silent drift.

The encompassing effect is that compared to computer-use models today, many steps can finish in < 100 ms end‑to‑end; a 20‑step flow can land in a few seconds without the “stop‑and‑think” feel.

What's next: streaming control and unifying the stack

Screenshot loop:  capture ─ process ─ act ─ capture ─ process ─ act
Streaming input:  █████ continuous frames █████ → act → act → act

In the future we hope to run a streaming capture pipeline similar to Gemma 3. Consuming frames at 20–30 fps, emitting actions at 5–10 Hz, and verifying state on each commit. This closes the perception→action loop for drag/hover/scroll and makes motion feel natural. The planner would hook into the same stream, but only for escalations.

We also plan to compile solved steps into micro‑policies. If you're running something like a RPA task or similar workflow as before, you can simply run the execution locally (with archon-mini running locally) and not have to worry about the planning. Over time, the planner is a background teacher, not a crutch. We also found that recording screens on computers is a great way to get enough data to do RL training which materially boosts the performance of the model for each specific use case(s) in each vertical/profession.

We will distill those plans into the local model so more steps stay on the fast path. The path forward is to adopt an end-to-end approach to the problem. For Tesla that's camera, steering, acceleration. For us it's screen, mouse, keyboard.

Current (hierarchical, streaming):
pixels → planner(plan) → archon‑mini(ground) → execute → verify

Evolving (more unified):
pixels → archon‑mini(policy) → execute → verify
           ↑ distilled from planner traces

Eventually we'll get rid of all the brittle policies and controls and have a model that can think on a second-order to understand how much compute it requires to do a task. Today we want to keep a planner in the loop for rare edge cases and safety; as the executor absorbs those patterns (via streaming, macros, distillation), the system becomes simpler and end-to-end.

With the release of Tesla Vision V12 on FSD, Tesla showed that they could replace 300K lines of driving code with end-to-end neural nets. I think we'll see a similar thing happen with the self-driving computer in the next few years.

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.

We're hiring

Our mission is to diffuse AGI into the economy. If you're excited about training models or applying AI to real-world problems, reach out.

@sdandprava.co/careers
← Back to Prava
San Francisco