Bonobo: Lightweight ETL Toolkit for Python 3 with Romain Dorgueil

The Python Podcast.__init__

Episode | Podcast

Date: Sat, 06 Jan 2018 22:00:00 -0500

<h3>Summary</h3> <p>A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.</p> <h3>Preface</h3> <ul> <li>Hello and welcome to Podcast.&#95;&#95;init&#95;&#95;, the podcast about Python and the people who make it great.</li> <li>I would like to thank everyone who supports us on <a href="https://www.pythonpodcast.com/podcastinit?utm_source=rss&amp;utm_medium=rss">Patreon</a>. Your contributions help to make the show sustainable.</li> <li>When you&#8217;re ready to launch your next project you&#8217;ll need somewhere to deploy it. Check out Linode at <a href="https://www.pythonpodcast.com/linode?utm_source=rss&amp;utm_medium=rss">podastinit.com/linode</a> and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.</li> <li>If you&#8217;re tired of cobbling together your deployment pipeline then it&#8217;s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to <a href="https://www.pythonpodcast.com/gocd?utm_source=rss&amp;utm_medium=rss">podcatinit.com/gocd</a>. Professional support and enterprise plugins are available for added piece of mind.</li> <li>Visit the <a href="https://www.pythonpodcast.com?utm_source=rss&amp;utm_medium=rss">site</a> to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at <a href="https://twtiter.com/podcastinit?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">@Podcast&#95;&#95;init&#95;&#95;</a> or email <a href="mailto:hosts@podcastinit.com">hosts@podcastinit.com</a>)</li> <li>To help other people find the show please leave a review on <a href="https://itunes.apple.com/us/podcast/podcast.-init/id981834425?mt=2&amp;uo=6&amp;at=&amp;ct=&amp;utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">iTunes</a>, or <a href="https://play.google.com/music/m/I7ogju4xv6adasgqz6545jndgsy?t=Podcastinit_-_Python_and_the_people_who_make_it_great&amp;utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Google Play Music</a>, tell your friends and co-workers, and share it on social media.</li> <li>Your host as usual is Tobias Macey and today I&#8217;m interviewing Romain Dorgueil about Bonobo, a data processing toolkit for modern Python</li> </ul> <h3>Interview</h3> <ul> <li>Introductions</li> <li>How did you get introduced to Python?</li> <li>What is Bonobo and what was your motivation for creating it? <ul> <li>What is the story behind the name?</li> </ul> </li> <li>How does Bonobo differ from projects such as Luigi or Airflow?<br /> [RD] After I explain why that&#8217;s totally different things, maybe a good follow up would be to ask about differences from other data streaming solutions, like Apache Beam or Spark.</li> <li>How is Bonobo implemented and how has its architecture evolved since you began working on it?</li> <li>What have been some of the most challenging aspects of building and maintaining Bonobo?</li> <li>What are some extensions that you would like to have but don&#8217;t have the time to implement?</li> <li>What are some of the most interesting or creative uses of Bonobo that you are aware of?</li> <li>What do you have planned for the future of Bonobo?</li> </ul> <h3>Keep In Touch</h3> <ul> <li>Bonobo Project <ul> <li><a href="https://www.bonobo-project.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Bonobo ETL</a></li> <li><a href="https://bonobo-slack.herokuapp.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Slack</a></li> <li><a href="https://github.com/python-bonobo/bonobo?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">GitHub</a></li> </ul> </li> <li>Romain <ul> <li><a href="https://romain.dorgueil.net/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Website</a></li> <li><a href="https://twitter.com/rdorgueil?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">@rdorgueil</a> on Twitter</li> <li><a href="https://github.com/hartym?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">hartym</a> on GitHub</li> </ul> </li> </ul> <h3>Picks</h3> <ul> <li>Tobias <ul> <li><a href="https://dataskeptic.com/blog/episodes/2017/quantum-computing?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Skeptic: Quantum Computing</a></li> </ul> </li> <li>Romain <ul> <li><a href="http://medikit.rdc.li/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Medikit</a>, or how to manage hundreds of projects at the same time, still being able to sleep at night.</li> <li><a href="https://github.com/grammarly/rocker?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Rocker</a>, a better builder for docker images.</li> </ul> </li> </ul> <h3>Links</h3> <ul> <li><a href="https://www.bonobo-project.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Bonobo</a></li> <li><a href="https://www.redhat.com/en?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">RedHat</a></li> <li><a href="https://en.wikipedia.org/wiki/Anaconda_(installer)?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Anaconda Installer</a></li> <li><a href="https://en.wikipedia.org/wiki/Extract,_transform,_load?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">ETL</a></li> <li><a href="http://www.pentaho.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Pentaho</a></li> <li><a href="http://etl.rdc.li/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">RDC.ETL</a></li> <li><a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">DAG (Directed Acyclic Graph)</a></li> <li><a href="http://luigi.readthedocs.io/en/stable/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Luigi</a></li> <li><a href="http://airflow.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Airflow</a></li> <li><a href="https://docs.python.org/3.6/library/collections.html#collections.namedtuple?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">NamedTuple</a></li> <li><a href="http://jupyter.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Jupyter</a></li> <li><a href="https://oauth.net/https://graphviz.gitlab.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">OAuth</a></li> <li><a href="https://graphviz.gitlab.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Graphviz</a></li> <li><a href="https://dask.pydata.org/en/latest/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Dask</a></li> <li><a href="https://www.dataengineeringpodcast.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast</a></li> <li><a href="https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Dask Interview</a></li> <li><a href="http://www.seleniumhq.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Selenium</a></li> <li><a href="https://zapier.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Zapier</a></li> <li><a href="https://ifttt.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">IFTTT (If This Then That)</a></li> <li><a href="https://en.wikipedia.org/wiki/Field-programmable_gate_array?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">FPGA</a></li> </ul> <p>The intro and outro music is from Requiem for a Fish <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CC BY-SA</a><img alt="" height="0" src="https://analytics.boundlessnotions.com/piwik.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fwww.pythonpodcast.com%2Fbonobo-with-romain-dorgueil-episode-143%2F&amp;action_name=Bonobo%3A+Lightweight+ETL+Toolkit+for+Python+3+with+Romain+Dorgueil+-+Episode+143&amp;urlref=https%3A%2F%2Fwww.pythonpodcast.com%2Ffeed%2F&amp;utm_source=rss&amp;utm_medium=rss" style="border: 0; width: 0; height: 0;" width="0" /></p>