We derive competitive tests and estimators for several properties of discrete distributions, based on their i.i.d. sequences. We focus on symmetric properties that depend only on the multiset of probability values in the distributions and not on specific symbols of the alphabet that assume these values. Many applications of probability estimation, statistics and machine learning involve such properties.
Our method of probability estimation, called profile maximum likelihood (PML), involves maximizing the likelihood of observing the profile of the given sequences, i.e., the multiset of symbol counts in the sequences. It has been used successfully for universal compression of large alphabet data sources, and has been shown empirically to perform well for other probability estimation problems like classification and distribution multiset estimation. We provide competitive estimation guarantees for the PML method for several such problems.
For testing closeness of distributions, i.e., finding whether two given i.i.d. sequences of length n are generated by the same distribution or by two different ones, our schemes have an error probability of at most [special characters omitted] whenever the best possible error probability is δ ≤ [special characters omitted]. The running time of our scheme is [special characters omitted](n). We do not make any assumptions on the distributions, including on their support size. In terms of sample complexity, this implies that if there is a closeness test which takes sequences of length n and has error probability at most δ, our tests have the same error guarantee on sequences of length n' = [special characters omitted]. Similar results are implied for the related problem of classification.
For finding the probability multiset of a discrete distribution using a length-n i.i.d. sequence drawn from it, we show the following guarantee for the PML-based estimator. For any class of distributions and any distance metric on their probability multisets, if there is an estimator that approximates all distributions in this class to within a distance of ε > 0 with error probability at most δ ≤ [special characters omitted], then the PML estimator is within a distance of 2ε with error probability at most δ · [special characters omitted]. Equivalently, the PML estimator approximates distributions to within a distance of 2ε with error probability delta using sequences of length n' = [special characters omitted]. Thus, this estimator is competitive with other estimators, including the one by Valiant et al. that approximates distributions of superlinear support size k = [special characters omitted](ε2.1n log(n)) to within a relative earthmover distance of ε and whose error probability can be shown to be at most [special characters omitted]. However, unlike the case of closeness testing, we do not yet have efficient schemes for computing the PML distribution.
We extend the results for PML for distribution multiset estimation to two related problems of estimating the parameter multiset of multiple distributions or processes. These include the problems of estimating the multiset of success probabilities of Bernoulli processes, and the multiset of means of Poisson distributions.
|Commitee:||Arias-Castro, Ery, Dasgupta, Sanjoy, Kim, Young-Han, Siegel, Paul|
|School:||University of California, San Diego|
|Department:||Electrical Engineering (Communication Theory and Systems)|
|School Location:||United States -- California|
|Source:||DAI-B 73/08(E), Dissertation Abstracts International|
|Subjects:||Applied Mathematics, Electrical engineering, Computer science|
|Keywords:||Classification, Discrete distributions, Maximum likelihoood, Probability estimation, Symmetric properties|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be