Mausam.
HomeAboutExperienceProjectsPublicationsSkillsBlogContact
Mausam.

Machine learning engineer & researcher based in Nepal. Exploring computer vision and medical imaging in fundus images.

Quick Links

  • About
  • Experience
  • Projects
  • Publications
  • Blog
  • Contact

Connect

hello@mausamgrg.com.np

© 2026 Mausam Gurung. All rights reserved.

Back to blog
Deep Learning · RL

World Models: The AI That Thinks Before It Acts

How machines learn to simulate reality — and why that changes everything

May 20266 min readDeep Learning, Reinforcement Learning, World Models, AI

Close your eyes and imagine walking from your front door to the nearest coffee shop.

You just ran a world model.

You didn't physically walk there. You simulated the route in your head — the turns, the landmarks, the approximate time — and decided whether it was worth the trip. That internal simulation is exactly what researchers mean when they talk about world models in AI.


The core idea

A world model is a learned, internal simulation of an environment. An agent that has one can ask:

"If I take this action, what will happen next?"

— and answer that question without actually doing anything.

This sounds deceptively simple. But it's one of the most powerful ideas in modern AI, and it sits at the foundation of systems that beat world champions at games, design new molecules, and drive cars.


Why most AI doesn't have one

Most neural networks are reactive: they take an input, produce an output, and stop. They have no concept of time, consequences, or futures. Every decision is made in the moment, with no internal simulation of "what happens if I do this."

Reinforcement learning agents are slightly better — they learn from rewards over time. But standard RL is famously sample inefficient: to learn to play a video game, an agent might need millions of trials that would take a human only a few hours. The agent keeps bumping into the world, one step at a time, with no ability to think ahead.

A world model changes this. Instead of running a million real experiments, the agent can run millions of imagined experiments inside its own head, at no cost to the real environment.

This is why world models are sometimes called model-based RL, as opposed to model-free RL (which has no internal simulation). The "model" is the world model.


The foundation: three components

Every world model — from the simple to the state-of-the-art — is built on three foundational pieces.

1. An encoder

Raw reality is overwhelming: millions of pixels, thousands of sensor readings, complex language. The encoder compresses this into a small, abstract latent state — a compact representation that captures what matters.

Think of it as the agent forming a mental sketch of the current situation, rather than memorizing every pixel.

2. A transition model

This is the world model proper. Given the current latent state and a candidate action, it predicts the next latent state.

z′ = f(z, a)

It has learned the rules of the world — not by being told them, but by watching what happens when actions are taken.

3. A reward / value model

Given a latent state, this component predicts how good that state is. Combined with the transition model, it lets the agent score imagined futures without ever experiencing them.


Explore the ideas

The widget below covers the key questions about world models — tap through the Q&A, explore the agent loop, and compare world models to related approaches.

World Models — Interactive Explorer

Click questions, explore the agent loop, compare approaches

A world model is an internal simulation an agent builds inside its own mind. Instead of only reacting to what it sees right now, it can ask: "If I take this action, what will happen next?" — and answer that question without actually doing anything in the real world.

💡 Switch between Q&A, Agent Loop, and Compare above to explore different angles


Planning in latent space

The magic happens when these three components work together. The agent:

  1. Observes the current state → encodes it to z
  2. Considers a candidate action a
  3. Asks the transition model: what is z′?
  4. Asks the value model: how good is z′?
  5. Repeats for many candidate actions and planning horizons
  6. Picks the action that leads to the best imagined future

This is called planning in latent space — and it's orders of magnitude cheaper than planning in raw pixel or sensor space.

The landmark system Dreamer (and its successor DreamerV3) demonstrated this fully: trained only on raw image pixels, it learns a latent world model and plans entirely inside that compressed space — achieving superhuman performance on dozens of tasks.


Real systems using this idea

AlphaZero uses a learned world model of the board game to run Monte Carlo Tree Search — imagining thousands of future board positions before choosing a move.

Tesla Autopilot builds a 4D spatial model of the scene from cameras — a world model that lets it reason about where other vehicles will be in the next few seconds.

Google's Genie (2024) learns an interactive world model purely from unlabeled internet videos — then lets a user control an agent inside those imagined worlds.

Language models are a soft form of world model: trained to predict the next token, they implicitly encode facts about physics, causality, and human behavior. They just can't act in real environments directly.

The hardest open problem: compounding errors. Each prediction step is slightly wrong, and those errors multiply across a long planning horizon until the imagined world no longer resembles the real one. Keeping world models accurate over long rollouts is an active research frontier.


Why it matters now

The field is converging on a simple belief: general intelligence requires world models.

An agent that can only react will always be brittle — surprised by novel situations, unable to plan, requiring massive amounts of real-world experience. An agent with a rich world model can generalize: it has internalized the rules of the environment, not just memorized responses.

Yann LeCun has argued this is the central missing piece in current AI — that large language models alone won't reach human-level reasoning precisely because they lack a grounded, simulatable world model.

Whether he's right or not, the direction is clear: AI systems that can think before they act, simulate consequences, and plan across time are fundamentally more capable — and understanding world models is the first step to building them.


The coffee shop example wasn't trivial. You compressed your knowledge of the neighborhood into a mental map, ran a simulation, predicted the outcome, and made a decision. You did this in seconds, effortlessly.

Teaching a machine to do the same — reliably, across arbitrary environments — is one of the defining problems of modern AI.

And the world model is the foundation.

Share this post