POMDP Research in the Seventies Notes for presentation at the AAAI Symposium October 24, 1998 Loren Platzman loren_platzman@mindspring.com I wrote a doctoral dissertation on the subject of POMDP's in the seventies, and briefly pursued research in the area, before moving on to other activities, and eventually leaving academics altogether. So I'm pretty much unfamiliar with what's been going on over the past twenty years. The aim of this talk is to tell you something about what we were thinking back then. The thing we *weren't* thinking about is computational complexity. It was enough to be able to define the problem, formulate a solution methodology that might involve an indeterminate number of iterations (but eventually terminate, assuming perfect real arithmetic) and demonstrate its efficacy by solving the smallest conceivable problem instance (often by hand). Computers were so memory-constrained that other limitations had not yet been observed. For example, if a larger problem exhibited difficulties with numerical accuracy, it was supposed that double-precision would fix it, but one couldn't verify this, because there wasn't enough memory to store the problem parameters in double precision format. A classic example of this viewpoint is solving linear programming problems by the simplex algorithm. Computer scientists were already thinking about these issues, of course, but the original POMDP researchers weren't computer scientists. We came from backgrounds in Operations Research, Statistics, and Automatic Control, all of which were based on traditional methodologies predating the emergence of modern computing. The origins of POMDP research can be traced back to Bellman's introduction of Dynamic Programming (DP), in the fifties. Bellman considered decision-making in an uncertain environment, and asserted such problems could be solved by DP, provided the decision-maker could observe the 'state' at any given time. In keeping with the conventions of the time, this meant it was adequate to consider systems with only a handful of discrete states, and thus the MDP was born. Howard's policy-iteration algorithm solved problems having an indeterminate number of stages, thus eliminated the most obvious run-time memory difficulty. The POMDP was introduced by Al Drake, a doctoral student of Ron Howard, and later, my own Ph.D. advisor. I recall a course I took from him in 1974, where Al joked that more papers had been written about the MDP than real problems had ever been solved using it. A decade later, Chip White and others undertook a heroic search for 'real' MDP applications, and found ... well, they validated Al's claim, so far as I'm concerned. Al's point was that the assumption that the decision-maker knows the state is too convenient to be taken seriously in applications. (Why not assume the decision-maker knows the right action to take? That would be even more convenient.) So he tried looking at problems where the decision-maker does not know the state. And things got ugly real fast, in a way that could be appreciated even back then. Suppose the underlying system has only two states (as simple as it gets) but the decision-maker does not know for sure which is the current state. The current information state is single real number in the range 0 to 1, so the optimal policy is a function having this number as its domain. There was no obvious data structure to express such a function, and this simplest problem couldn't be solved manually. Al managed to solve a sample problem approximately by rounding the information state to the closest of a given set of values (called the grid). But this wasn't an exact solution of a trivial problem - it was an approximate solution - and so the problem remained 'open'. Ed Sondik resolved the problem of strategy expression by showing that the information space can be partitioned into a finite set of regions, and then solved exactly, using the regions as surrogate states. Unfortunately, the cardinality of the partition grows quickly with the number of stages, so the stationary virtues of Howard's method were lost. In my own work, I considered the possibility that the partition cardinality might not grow without limit. If so, there would be a finite number of information regions, where the exact value of the information state within the region was irrelevant, and the resulting problem could be solved as and MDP. I took Sondik's sample problem as my sample problem, and solved it. This validated the method enough to earn me a degree. This was inspired in part by wishful thinking (what if the difficulty Sondik fears might happen doesn't actually happen?), but also by conventional thinking in control theory. In aerospace problems, the state is a vector of positions, velocities and orientations, the action is a vector of flap positions and throttle levels, and measurements are corrupted by white noise. Under assumptions that really do hold in first-order approximations, the optimal policy can be shown to be a linear system, defined by a small matrix. In other words, the original problem, which is too difficult to solve, is replaced by an approximation of the problem, which is easy to solve, and the exact solution to the approximate solution is used as an approximate solution to the exact problem. In these terms, I added the artificial requirement that the POMDP must be controlled by a finite-state machine, and then solved the problem of finding the best such constrained controller. I was also inspired by then-current work in statistics. There was much interest, at the time, in a problem called the Two-Armed Bandit problem. A compulsive gambler feeds coins into one or the other of two machines forever, and attempts to maximize the return (minimize the damage) by directing the coins to the machine with more favorable odds. This is an abstraction of problems of drilling for oil, or corporate R&D. In the finite horizon, the optimal solution is to sample both machines for a while (exploration) and then feed only the machine that appears best (harvest). This shows that simple problems often have simple solutions, if you look at it the right way. It also shows that the undiscounted infinite-horizon, wielded carelessly, can lead to disturbing paradoxes. To make the stationary model work, one must introduce an element of forgetfulness into the system, an expression of second-law-of-thermodynamics, to ensure that information observed long ago will become irrelevant. Equivalently, discounting can accomplish this by discouraging one from banking on payoffs in the far distant future. These can also be viewed as ways of making the problem less sensitive to fine details, and consequently more solvable by crude methods. After I gave up working on POMDPs, I spent a decade working on material handling methods in manufacturing and distribution. All such problems can be modeled as POMDPs, where the state is a network of queues. But the POMDP model doesn't really help to understand and solve these problems, both in the sense that the POMDP formulation can't be solved, and also, if it could be solved, it wouldn't be implementable. The decision makers in these problems were primitive robotic devices or people, both incapable of implementing complex computational strategies. The real problem called for a viable rule of thumb, not an optimal solution. And this, I still believe, is where our best hopes for POMDP progress lie.