Teaching GPT-5 to Use a Computer

August 12, 2025/San Francisco/

Over the weekend, I won #3 at OpenAI's GPT-5 Hackathon with Archon - a copilot for your computer. It comes with a mini vision model for speed, and GPT-5 for variable reasoning to plan. I took some time to write about how it works, and our approach to building a self-driving computer with inference math, and the tradeoffs we made.

Archon is a small bar that sits at the bottom of your Mac/Windows screen where you can type what you want your computer to do in natural language. It takes screenshots to see what's on screen, uses GPT-5's reasoning to plan, then a custom fine-tuned model executes clicks and keystrokes. In a racing game demo with a single instruction to 'start playing' it recognized the view, used WASD, and navigated the track. Although it didn't win this time due to latency, its instruction-following ability was clearly superior to prior models. The goal is to make computers self-driving. Archon is a lightweight client demonstrating that GPT-5's powerful reasoning combined with tiny fine-tuned models can control any interface through natural language.

Full demo video sped up 2x

GPT-5: Why it worked for us

Archon was built entirely using GPT-5's advanced reasoning capabilities. We leveraged probably every aspect of GPT-5 from initial development to debugging to training. Codex CLI with GPT-5 with High Thinking enabled us to build the entire app, and GPT-5 with Vision enabled us to see and perceive the screen. GPT-5's reasoning ability was crucial for instruction following, and planning. This quite simply wasn't possible with any other model.

What makes GPT-5 particularly suited for computer control is its ability to reason through complex multi-step processes while maintaining context across long interactions. Unlike previous models that might hallucinate or lose track of the current state, GPT-5's chain-of-thought reasoning allows it to break down "start playing this game" into discrete, executable steps while adapting to unexpected UI changes.

We calibrated how much compute to use strategically to trade off accuracy and latency. For complex workflows, high reasoning effort mapped out interaction sequences with error handling. GPT-5-mini with function calling preambles enabled us to show the user what we were thinking while simultaneously calling our grounding model. This adaptive approach keeps the user in mind. Whether they need to go through complex, changing UIs, or just need to get something done, we can trade off reasoning for latency and vice versa.

How it actually works

User Intent

natural language

Planner

GPT-5

prava-fc-small

Fast Click grounding

Executor

click + type

👁

See

Screenshot

10ms

→

💭

Think

What to click?

0-450ms

→

📍

Find

Where exactly?

20ms

→

👆

Act

Click & type

15ms

Archon uses a hierarchical approach: a large reasoning model (GPT-5/o-series) decides what to do, and prava-fc-small (Prava's Fast Click grounding model) figures out exactly where to click. This split matters because reasoning and grounding are fundamentally different problems with different computational requirements.

The reasoning model sees the screen and your request, then outputs a semantic action: "click the blue Submit button at the bottom." Descriptions enable reasoning to be done in natural language. prava-fc-small takes that description plus the screenshot and outputs exact pixel coordinates: (523, 412). One model for the "what," another for the "where."

prava-fc-small (Prava's Fast Click grounding model) is a vision transformer (ViT) fine-tuned specifically for finding UI elements. It outputs exact (x, y) screen coordinates for clicking.

Why vision tokens are expensive (and how we optimize them)

For GPT-5's computer-using agent, each action involves vision, reasoning, and response. A 1920×1080 screenshot becomes 6 tiles at 170 tokens each, plus reasoning tokens billed as output.

Per step: 3,200–9,400 tokens
100 steps: $3.20–$9.40
With caching: $0.32–$0.94 (90% discount)
Latency: 2–5 seconds per action (100-step task = 3–8 minutes)

Running the same workflow 100 times daily costs $940, over $28,000/month without caching. Each run takes 3–8 minutes, so what would take a human 50 minutes would take 5–13 hours of compute time. And because they're LLMs, they aren't deterministic everytime, compounding the cost and time.

Our approach: split reasoning from grounding. GPT-5 decides "click the blue Submit button," prava-fc-small finds the exact coordinates. We cache patches aggressively since UI elements rarely move between actions, so the Submit button at (834, 672) stays cached across clicks. Combined with a 3MB saliency scorer that identifies interactive regions, we achieve 70%+ cache hits and 10–50ms grounding latency.

Tiles

Button

512×512size

262Kpixels

170tokens

Patches

Button

32×32size

1Kpixels

256×more precise

Patches give us the precision to find UI elements exactly where they are, not where they might be.

Instead of throwing away dead space, we also just downsample irrelevant regions, keeping the important UI elements at full resolution.

Saliency heatmap showing attention regions on a screenshot

Illustration of the saliency heatmap.

If you enjoy this kind of systems math, pricing, selective encoding, cache design, and test-time routing—we're hiring.

prava.co/careers

Training: GRPO and synthetic data generation

We trained prava-fc-small with GRPO (Group Relative Policy Optimization), where rewards are binary: 1 if the click lands inside the target UI element, 0 otherwise. Patches work well for this because they're small enough that clicking anywhere within a patch-covered element still gets rewarded.

To scale training data, we used trajectory augmentation on human demonstrations. From one recorded workflow, we generate multiple related trajectories by varying timing, UI states, and interaction patterns - effectively "boosting" the grounding model's robustness across different scenarios.

Learning from group performance

Submit

Group baseline:0.58

Each advantage:reward - 0.58

Model learns to prefer actions that beat the group average

While testing, prava-fc-small was really bad at clicking bright red buttons, compared to tiny blue buttons it was clicking. We suspect this is because the bright red buttons are more likely to be at the center of the element, and the tiny blue buttons are more likely to be at the edge of the element. More work is needed to make the model more robust and for us to interpret all its capabilities.

Speed: adaptive compute that feels instant

Test-time compute is getting extremely hyped these days, particularly off of the success of the o-series models. In my experience, I personally get much usage from GPT-5 Pro and previously o3-pro. The reason is because a lot of my day-to-day work revolves around "knowledge work". Good thing for prava-fc-small is that it's a lot of "grounding work" and not a lot of "knowledge work". You can get a lot of mileage out of a 7B model if you instead vary the reasoning and determine how to properly pipeline the tasks.

Observe10ms

Ground20ms

Execute15ms

Verify5ms

On this path, prava-fc-small runs alone (no planner call), hitting ~50 ms per action on a A100. The router only escalates when signals are uncertain: high saliency entropy, too many candidate targets, recent misclicks, or ambiguous copy (e.g., multiple “Submit” buttons). When that trips, we pipeline one step ahead: Step N (simple) executes now while the reasoner prepares a short plan for Step N+1. The router is a simple policy that looks at the signals and decides whether to escalate or not.

Simple UI

~50ms

Simple interface with one obvious button

"Click the submit button"

70%of tasks

95%accurate

Complex UI

~500ms

"Click cell B3, which contains the quarterly revenue data, located in the second column, third row"

30%of tasks

99%accurate

The system automatically chooses speed or precision based on visual complexity

The fundamental tradeoff is simple: consumers want one thing done fast, enterprises want many things done efficiently. Same model, different routing strategy.

For the typical consumer, we think it's better to bias toward the fast path (planner stays cold unless ambiguity is detected). In enterprise, we enable continuous batching for planner calls, short aggregation windows, and aggressive prefix caching; prava-fc-small stays on-GPU so grounding still feels immediate.

After ~1 hour of use we typically see a pretty high patch-cache hit‑rate where similar patches (imagine a screenshot of a button) are cached and reused. Verifying is cheap (single screenshot + state predicate), so we keep iterating quickly without silent drift.

The encompassing effect is that compared to computer-use models today, many steps can finish in < 100 ms end-to-end; a 20‑step flow can land in a few seconds without the “stop‑and‑think” feel.

What's next: streaming control and unifying the stack

Screenshot loop

Streaming input

In the future we hope to run a streaming capture pipeline similar to Gemma 3. Consuming frames at 20-30 fps, emitting actions at 5-10 Hz, and verifying state on each commit. This closes the perception-action loop for drag/hover/scroll and makes motion feel natural. The planner hooks into the same stream, but only for escalations.

We also plan to compile solved steps into micro-policies. If you're running something like a RPA task or similar workflow as before, you can simply run the execution locally (with prava-fc-small running locally) and not have to worry about the planning. Over time, the planner is a background teacher, not a crutch. We also found that recording screens on computers is a great way to get enough data to do RL training which materially boosts the performance of the model for each specific use case(s) in each vertical/profession.

We will distill those plans into the local model so more steps stay on the fast path. The path forward is to adopt an end-to-end approach to the problem. For Tesla that's camera, steering, acceleration. For us it's screen, mouse, keyboard.

Eventually we'll get rid of all the brittle policies and controls and have a model that can think on a second-order to understand how much compute it requires to do a task. Today we want to keep a planner in the loop for rare edge cases and safety; as the executor absorbs those patterns (via streaming, macros, distillation), the system becomes simpler and end-to-end.

Self-driving cars

InputCameras

OutputSteering, acceleration

ModelEnd-to-end neural net

Self-driving computers

InputScreenshots

OutputMouse, keyboard

ModelVision transformer

Same principle, different domain: end-to-end vision models replacing complex rules.

Get the research preview

Related Work

He, Y., Jin, J., & Liu, P. (2025). Efficient Agent Training for Computer Use. arXiv preprint arXiv:2505.13909.

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., Xu, R., Pan, L., Xiong, C., & Li, J. (2025). GTA1: GUI Test-time Scaling Agent. arXiv preprint arXiv:2507.05791.

We're hiring

Our mission is to diffuse AGI across the economy. If you're excited about training models or applying AI to real-world problems, reach out.

@sdand prava.co/careers

← Back to Prava

San Francisco