Karpathy's Roadmap: How AI Output Will Evolve From Text to Neural Video

Andrej Karpathy, former Director of AI at Tesla and founding member of OpenAI, dropped a 13K-like thread this week that lays out a surprisingly concrete roadmap for the evolution of AI-human interaction. It’s not about bigger models or longer contexts — it’s about how AI communicates back to us.

His thesis: audio is the human-preferred input to AI, but vision is the preferred output. Around a third of our brain’s cortex is dedicated to visual processing — a “10-lane superhighway of information into the brain.” The output side is where the real transformation is happening.

The Five (or Six, or N) Stages of AI Output

Karpathy maps out a progression from primitive to futuristic:

Stage 1: Raw Text

“Hard and effortful to read.” This is GPT-2 era output — walls of unformatted text that require cognitive labor to parse.

Stage 2: Markdown ← Current Default

Bold, italic, headings, tables. “A bit easier on the eyes.” This is where most LLM interfaces sit today — ChatGPT, Claude, Gemini all output rich Markdown. It’s better than plain text, but still procedural (code-generated) rather than truly visual.

Stage 3: HTML ← Early but Forming

“Still procedural, but a lot more flexibility on graphics, layout, even interactivity.” Karpathy recommends asking your LLM to “structure your response as HTML” and viewing it in a browser. He’s had success asking for slideshows, interactive dashboards, and more.

Important detail: We’re in this stage right now. Anyone can try it today with a simple prompt addition. Karpathy considers this the emerging new default.

Stage 4: Interactive Neural Videos

This is where Karpathy’s prediction gets ambitious. He envisions “interactive videos generated directly by a diffusion neural net” — not a video file, but a real-time rendered visual experience that responds to user interaction. It would blend “Software 1.0” artifacts (procedural logic, simulations) with neural artifacts (diffusion grids, generative visuals).

Stage N: Full Mind-Meld

Karpathy acknowledges this endpoint exists but is far off — the Neuralink-style brain-computer interface. His key point is that there’s enormous progress to be made before we need to drill into skulls. The I/O surface between humans and AI is still primitive, and improving it is a tractable engineering problem.

Why This Matters: The Input/Output Asymmetry

Karpathy’s framework reveals an asymmetry most people miss:

Input side (human → AI): We’ve made massive progress. Voice dictation, multimodal upload (images, screenshots, PDFs), even pointing and gesturing. But Karpathy notes even this isn’t solved — he still feels the need to point at things on screen, gesture, sketch. Voice alone is too narrow a channel.

Output side (AI → human): We’re mostly stuck in Markdown. The reason this matters is that information density is gated by output format. A table communicates more than a paragraph. A chart more than a table. An interactive visualization more than a static chart. A simulation more than all of them.

Karpathy references a recently viral interactive neural rendering demo as evidence that Stage 4 technology is already germinating.

What This Means for Product Builders

1. The HTML Output Prompt is a Product

The simple instruction “structure your response as HTML” is a surprisingly effective prompt engineering technique. Karpathy’s endorsement — coming from someone who literally helped build the foundations of modern AI — suggests this should be treated as a product primitive, not a hack.

2. The Browser Is Becoming the AI Viewport

If Stage 3 (HTML output) becomes the default, the browser transforms from a document viewer into an AI rendering surface. This has implications for every company building LLM-powered products — the competition won’t be about model benchmarks, but about who can render the richest output experience.

3. The Future of “Reading” AI Output Is “Watching” It

If Karpathy is right about Stage 4, we’re heading toward a world where AI doesn’t tell you things — it shows you. The 30% of your brain dedicated to vision is an underutilized channel in current AI interfaces. Whoever connects to that channel first wins.

Skepticism Check

Karpathy’s vision is compelling, but it’s worth noting what’s missing:

Latency: Neural video generation is computationally expensive. Stage 4 requires real-time diffusion, which is not yet practical even on high-end hardware.
Determinism: Interactive simulations require procedural consistency that diffusion models struggle with. The “Software 1.0 + Neural” hybrid Karpathy envisions is a hard engineering problem.
Trust: Visual output is inherently harder to fact-check than text. A hallucinated chart looks just as convincing as a real one.

These aren’t arguments against the direction — they’re the reasons it’s a roadmap rather than a product release.

The Bottom Line

Karpathy’s thread is a rare artifact: a roadmap drawn by someone with the technical depth to make it credible and the product instinct to make it actionable. The key takeaway for anyone building AI products: the output surface is the next frontier. As models converge on capability, differentiation will come from how AI presents information — and the browser, not the chat window, is where that battle will be fought.