Every AI Researcher Should Rethink MDP

Every AI Researcher Should Rethink MDP Introduction: Why Are We Still Studying MDP in 2026? From 2025 to 2026, MDP—a concept born in the last century—experienced an unexpected revival, and in a peculiar way: it was revived under siege. DeepSeek-R1 used GRPO to elicit long-chain reasoning from LLMs, and subsequent papers pointed out that GRPO makes degenerate assumptions about MDP, rendering it essentially equivalent to filtered iterative supervised learning. Almost simultaneously, Ben Recht at UC Berkeley ignited a more fundamental debate in the community—he bluntly declared MDP and dynamic programming to be "red herrings," arguing that RL does not need this formalism: sample, score, update—three steps suffice. Critics quickly countered: once the objective is to maximize expected cumulative discounted reward, the Bellman structure emerges naturally in the derivation; it is not something you can discard at will. These two events appear unrelated—one concerns LLM training techniques, the other a philosophical debate about RL—but they press upon the same question: when someone claims a method is "RL," on what grounds do you make that judgment? The answer is not hidden in the related work section of some paper; it is hidden in your understanding of the underlying structure of MDP. Without understanding it, you can only parrot others' conclusions; with it, you can deconstruct the foundational assumptions of any algorithm yourself. However, this article takes a different approach from typical MDP tutorials. We will not linger on the surface of the five-tuple. Instead, we will subject MDP to two extreme degenerations and observe what it becomes when it loses certain elements. In brief: if the state space is reduced to a single state, MDP degenerates into the multi-armed bandit—exposing the most fundamental exploration-exploitation dilemma of reinforcement learning; if we assume the agent has complete knowledge of the environment, MDP degenerates into an optimal control problem—revealing the recursive structure of optimal decision-making. The full MDP stands at the intersection of the two: the world changes in response to your decisions, and your understanding of the world is incomplete. Once you grasp these two degenerations, you have seized the two core dimensions of reinforcement learning algorithm design. Let us begin with the simplest possible question. You stand at the entrance of a maze, facing three branching paths. You do not know which one leads to the exit, nor how long each path is. You can only take one step at a time—at each new fork, you make your next choice. Your goal is to find the exit as quickly as possible. This problem appears simple, yet it contains nearly all the core elements of reinforcement learning: you need to explore the unknown (which path to take?), you need to exploit what you know (this path looks promising—keep going?), and every choice you make affects not only your immediate progress but also where you will stand in the future. Reinforcement learning is about solving precisely this kind of problem: learning optimal decisions through trial and error in an unknown environment. The mathematical language that reinforcement learning uses to describe such problems is called the Markov Decision Process (MDP). MDP is the cornerstone of reinforcement learning not merely because it describes how the world "really" works, but because it provides a precise mathematical language for stating the problem—what constitutes good behavior (maximizing return), how to define good behavior (value functions), and what structure good behavior should satisfy (the Bellman equation). With this language, we can then discuss "how to approximate this structure when the model is unknown"—and that, in its entirety, is what reinforcement learning is about. ...

June 14, 2026 · 30 min · 6341 words · Frank Tian