Better Good Turing Probability Estimation

Speaker: Dr.Pramod Viswanath

We consider hypothesis testing using training data in the large alphabet regime (motivated by classification of natural language data such as emails). The key feature of this scenario is that the length of the observed string is of the same order as the size of the underlying alphabet (words of the natural language). The traditional ML estimate is widely off the mark (many elements are unseen in the training data). Universal methods are also inapplicable since the data is not universally compressible (redundancy of compression in this regime is nonzero).
The current heuristic practice is to use the Good-Turing formula (developed during the WWII) to estimate the underlying distribution of the training data and then generate a likelihood test. We introduce a natural scaling formulation to study this problem and use it to show that this Good-Turing based test is not statistically consistent. This exercise allows us to prove a consistency result of the classical Good-Turing formula for a different problem (the first such result in the literature). More importantly, we introduce a novel probability estimator that makes for strongly consistent hypothesis testing in the large alphabet regime.


Pramod Viswanath received the PhD degree in EECS from the University of California at Berkeley in 2000. He was a member of technical staff at Flarion Technologies until August 2001 before joining the ECE department at the University of Illinois, Urbana-Champaign. He is a recipient of the Eliahu Jury Award from the EECS department of UC Berkeley (2000), the Bernard Friedman Award from the Mathematics department of UC Berkeley (2000), and the NSF CAREER Award (2003). He is an associate editor of the IEEE Transactions on Information Theory for the period 2006-2008.

Presented On: April 12th, 2007
Video: Click here to see the video