Very Large Corpora and Zipf's Law

Data Skeptic

Episode | Podcast

Date: Fri, 18 Jan 2019 16:00:00 +0000

<p>The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data.</p> <p>Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled <a href="">Scaling to Very Very Large Corpora for Natural Language Disambiguation</a> popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora.</p> <p>Although not explicitly in about NLP, the noteworthy paper <a href="">The Unreasonable Effectiveness of Data</a> emphasizes this point further while paying homage to the classic treatise <a href="">The Unreasonable Effectiveness of Mathematics in the Natural Sciences</a>.</p> <p>In this episode, Kyle shares a few thoughts along these lines with Linh Da.</p> <p>The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.</p>