Sequentially making-decision abounds in real-world problems ranging from robots needing to interact with humans to companies aiming to provide reasonable services to their customers. It is as diverse as self-driving cars, health-care, agriculture, robotics, manufacturing, drug discovery, and aerospace. Reinforcement Learning (RL), as the study of sequential decision-making under uncertainty, represents a core aspect challenges in real-world applications.
While most of the practical application of interests in RL are high dimensions, we study RL problems from theory to practice in high dimensional, structured, and partially observable settings. We show how statistically develop efficient RL algorithm for a variety of RL problems, from recommendation systems to robotics and games. We theoretically study these problems from their first principles to provide RL agents which efficiently interact with their surrounding environment and learn the desired behavior while minimizing their regrets.
We study linear bandit problems where we propose Projected Stochastic Linear Bandit (PSLB), upper confidence bound based algorithm in linear bandit which exploit the intrinsic structure of the decision-making problem to significantly enhance the performance of RL agents.
We study the problem of RL in Markov Decision Process (MDP) where we propose the first sample efficient model-free algorithm for the general continuous state and action space MDPs. We further investigate safe RL setting and introduce a safe RL algorithm to avoid catastrophic mistakes that can be made by an RL agent. We extensively study tree-based methods, a well-popularized method in RL which is also the core to Alpha-Go, a technique to beat the masters of board games such as Go game.
We extend our study to partially observable environments, such as partially observable Markov decision processes (POMDP) where we propose the first regret analysis for the class of memoryless policies. We continue this study to a class of problems known as rich observable Markov decision processes (ROMPD) and propose the first regret bound with no dependency in the ambient dimension in the dominating terms.
We empirically study the significance of all these theoretically guaranteed methods and show their value in practice.
|Commitee:||Levorato, Marco, Anandkumar, Anima|
|School:||University of California, Irvine|
|Department:||Electrical and Computer Engineering - Ph.D.|
|School Location:||United States -- California|
|Source:||DAI-B 81/2(E), Dissertation Abstracts International|
|Subjects:||Computer science, Artificial intelligence|
|Keywords:||Decision making, Machine learning, Reinforcement learning|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be