Podevcast

Building the howto100m Video Corpus

Data Skeptic

Episode | Podcast

Date: Mon, 19 Aug 2019 20:12:43 +0000

Video annotation is an expensive and time-consuming process. As a consequence, the available video datasets are useful but small. The availability of machine transcribed explainer videos offers a unique opportunity to rapidly develop a useful, if dirty, corpus of videos that are "self annotating", as hosts explain the actions they are taking on the screen. This episode is a discussion of the <a href="https://www.di.ens.fr/willow/research/howto100m/">HowTo100m</a> dataset - a project which has assembled a video corpus of 136M video clips with captions covering 23k activities. <h3>Related Links</h3> The paper will be presented at <a href="http://iccv2019.thecvf.com/">ICCV 2019</a> <a href="https://twitter.com/antoine77340">@antoine77340</a> <a href="https://github.com/antoine77340">Antoine on Github</a> <a href="https://www.di.ens.fr/~miech/">Antoine's homepage</a>