Markov decision processes (MDPs) are widely used for modeling decision-making problems in robotics, automated control, and economics. Traditional MDPs assume that the decision maker (DM) knows all states and actions. However, this may not be true in many situations of interest. We define a new framework, MDPs with unawareness (MDPUs), which allows for the possibility that a DM may not be aware of all possible actions. We provide a complete characterization of when a DM can learn to play near-optimally in an MDPU, and give an algorithm that learns to play near-optimally when it is possible to do so, as efficiently as possible. In particular, we characterize when a near-optimal solution can be found in polynomial time.
We formalize decision-making problems in robotics and automated control using continuous MDPs and actions that take place over continuous time intervals. We then approximate the continuous MDP using finer and finer discretizations. Doing this results in a family of systems, each of which has an extremely large action space, although only a few actions are “interesting”. This can be modeled using MDPUs, where the action space is much smaller. As we show, MDPUs can be used as a general framework for learning tasks in robotic problems. We prove results on the difficulty of learning a near-optimal policy in an MDPU for a continuous task. We apply these ideas to the problem of having a humanoid robot learn on its own how to walk.
Finally, we consider the scenario in which the DM has a limited budget for solving the problem on top of unawareness. In order to deal with such problems, we define a model called MDPUs with a prior and a budget (MDPUBs) that considers both unawareness and a limited budget. We also consider the problem of learning to play approximately optimally in a subclass of MDPUBs called budgeted learning problems with unawareness (BLPUs), which are multiarmed bandit problems in which there may be arms that the DM is unaware of. We provide a policy that is 0:25c-optimal for BLPUs, where c is a constant determined by the probability of discovering a new arm and the probability of there being undiscovered arms, that can be computed in polynomial time.
|School Location:||United States -- New York|
|Source:||DAI-B 78/04(E), Dissertation Abstracts International|
|Keywords:||Budgeted learning, Learning, Markov decision process, Multi-armed bandit problem, Robotics, Unawareness|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be