Predictive Models on Random Data

Data Skeptic

Episode | Podcast

Date: Fri, 22 Jul 2016 15:00:00 +0000

<p class="p1"><span class="s1">This week is an insightful discussion with <a href="https://twitter.com/claudia_perlich"><span class="s2">Claudia Perlich</span></a> about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our discussion covers some novel observations about ROC and AUC, as well as an informative discussion of leakage.</span></p> <p class="p2"><span class="s3">Much of our discussion is inspired by two excellent papers Claudia authored: <a href="http://dstillery.com/wp-content/uploads/2014/05/Leakage-in-Data-Mining-Formulation-Detection-and-Avoidance.pdf"> <span class="s4">Leakage in Data Mining: Formulation, Detection, and Avoidance</span></a> and <a href="http://www.kdd.org/exploration_files/v12-02-4-UR-Perlich.pdf"><span class="s4"> On Cross Validation and Stacking: Building Seemingly Predictive Models on Random Data</span></a>. Both are highly recommended reading!</span></p>