Whole-book recognition is a document image analysis strategy that operates on the complete set of a book’s page images using automatic adaptation to improve accuracy. We describe an algorithm which expects to be initialized with approximate iconic and linguistic models—derived from (generally errorful) OCR results and (generally imperfect) dictionaries—and then, guided entirely by evidence internal to the test set, corrects the models which, in turn, yields higher recognition accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier, and the linguistic model describes word-occurrence probabilities. Our algorithm detects “disagreements” between these two models by measuring cross entropy between (1) the posterior probability distribution of character classes (the recognition results resulting from image classification alone), and (2) the posterior probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). We show how disagreements can identify candidates for model corrections at both the character and word levels. Some model corrections will reduce the error rate over the whole book, and these can be identified by comparing model disagreements, summed across the whole book, before and after the correction is applied. Experiments on passages up to one hundred and eighty pages long show that when a candidate model adaptation reduces whole-book disagreement, it is also likely to correct recognition errors. Also, the longer the passage operated on by the algorithm, the more reliable this adaptation policy becomes, and the lower the error rate achieved. Best results occur when both the iconic and linguistic models mutually correct one another. We have observed recognition error rates driven down by nearly an order of magnitude fully automatically without supervision (or indeed without any user intervention or interaction). Improvement is nearly monotonic, and asymptotic accuracy is stable, even over long runs. If implemented naively, the algorithm runs in time quadratic in the length of the book; but random subsampling and caching techniques speed it up by two orders of magnitude with negligible loss of accuracy. Prior knowledge of word frequency information can be employed to further drive down the error rates. These results hold when tested on standard English corpora. We also propose a strategy to coordinate different models’ adaptations for joint improvements when more than two models are employed in the algorithm. Whole-book recognition has potential applications in digital libraries as a safe unsupervised anytime algorithm.
|Advisor:||Baird, Henry S.|
|Commitee:||Davison, Brian D., Korth, Henry F., Lee, Dar-Shyang, Lopresti, Daniel P.|
|School Location:||United States -- Pennsylvania|
|Source:||DAI-B 72/04, Dissertation Abstracts International|
|Keywords:||Anytime algorithm, Book recognition, Optical character recognition, Unsupervised learning|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be