AlphaGo Zero demonstrates superhuman performance playing Go, chess, and shogi. Models like R2D2 do the same playing classic Atari titles. A new approach to deep reinforcement learning is the first to achieve state-of-the-art results playing both board and video games.
What’s new: DeepMind researchers Julian Schrittwieser, Ioannis Antonoglou, and Thomas Hubert adapted techniques from AlphaGo Zero to develop MuZero. While AlphaGo Zero requires knowledge of game rules, MuZero does not.
Key insight: Board games like Go or chess have two players, and the only outcomes are win or lose. Video games may have only one player and offer immediate rewards. MuZero mastered these diverse conditions by learning a world model and employing AlphaGo Zero-style search.
How it works: At each step in the game, MuZero considers the immediate outcome of a given move and the probability of winning if it is made. It analyzes potential consequences through a series of components.
- A state-representation submodel extracts information about the current game state and uses it to form a simplified description of that state.
- Based on the simplified state description, the value-and-policy submodel predicts the optimal move to make and the expected reward for making it.
- Similarly, the dynamics-and-reward submodel predicts the next game state and the immediate reward for taking a particular action.
- At each timestep, the value-and-policy module searches potential outcomes multiple steps ahead, and the dynamics-and-reward submodel produces many future samples. Then MuZero performs the action likely to yield the best overall rewards and value.
Results: MuZero matched AlphaZero’s performance in chess, shogi, and Go with slightly less computation at each timestep. In Atari games, MuZero beat the previous state-of-the-art median score across 57 titles by 5 percent in one-tenth of the training time
Why it matters: Previous models either perform precise planning (best for board games) or learn complicated dynamics (best for video games). MuZero shows that a single model can do both.
We’re thinking: Stellar performance in games attracts lots of attention, but making the translation to significant impact on real-world tasks has been a challenge. MuZero addresses some of the weaknesses of previous algorithms — a step toward making a difference beyond games.