The Takeaway: The real leap isn’t bigger models — it’s models that can learn from the world, keep memory, and build their own scaffolding.
- World models matter because they may unlock a deeper kind of understanding than text-only scaling ever will.
- The next agent breakthrough may come less from clever prompts and more from models deciding when to reason, delegate, or even write their own sub-agents.
- Continual learning looks most practical as file-system-style memory, not constantly rewriting model weights.
Oriol Vinyals, co-lead of Gemini at Google, comes at AI from a long deep-learning lineage, and his philosophy is basically: keep pushing generality until the system itself becomes the product. On world models, he draws a sharp line between today’s strong multimodal systems and the bigger prize: extracting structure from video and images without leaning so hard on language. “I’m not sure we quite have seen” the GPT moment for video and images yet, he says, because the field still relies on text to bridge concepts like gravity, motion, and cause-effect.
That same bias toward generality shows up in agents. Vinyals thinks the important shift is not just building a better scaffold around a model, but eventually letting the model generate the scaffold itself. The point isn’t endless token-spending; it’s deciding “should you reason, for how long should you reason” based on task complexity. That’s a very Google-ish answer: make the system broad, then let intelligence specialize it on demand.
On memory, he’s even more concrete. Working memory is already strong; the missing piece is durable, retrievable knowledge. His preferred path is a nonparametric one — a personal knowledge base, files, folders, retrieval — because it’s easier to serve than custom weights for every user. In other words: the future may look less like one giant brain and more like a model with a very good hard drive.