Podevcast

Go From Notebook To Pipeline For Your Data Science Projects With Orchest

The Python Podcast.init

Episode | Podcast

Date: Mon, 01 Mar 2021 21:00:00 -0500

<div class="wp-block-jetpack-markdown"><h3>Summary</h3> <p>Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.</p> <h3>Announcements</h3> <ul> <li>Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.</li> <li>When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to <a href="https://www.pythonpodcast.com/linode?utm_source=rss&utm_medium=rss">pythonpodcast.com/linode</a> and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!</li> <li>Your host as usual is Tobias Macey and today I’m interviewing Rick Lamers and Yannick Perrenet about Orchest, a development environment designed for building data science pipelines from notebooks and scripts.</li> </ul> <h3>Interview</h3> <ul> <li>Introductions</li> <li>How did you get introduced to Python?</li> <li>Can you start by giving an overview of what Orchest is and the story behind it?</li> <li>Who are the users that you are building Orchest for and what are their biggest challenges? <ul> <li>What are some examples of the types of tools or workflows that they are using now?</li> </ul> </li> <li>What are some of the other tools or strategies in the data science ecosystem that Orchest might replace? (e.g. MLFlow, Metaflow, etc.)</li> <li>What problems does Orchest solve?</li> <li>Can you describe how Orchest is implemented? <ul> <li>How have the design and goals of the project changed since you first started working on it?</li> </ul> </li> <li>What is the workflow for someone who is using Orchest?</li> <li>What are some of the sharp edges that they might run into?</li> <li>What is the deployable unit once a pipeline has been created? <ul> <li>How do you handle verification and promotion of pipelines across staging and production environments?</li> </ul> </li> <li>What are the interfaces available for integrating with or extending Orchest? <ul> <li>How might an organization incorporate a pipeline defined in Orchest with the rest of their data orchestration workflows?</li> </ul> </li> <li>How are you approaching governance and sustainability of the Orchest project?</li> <li>What are the most interesting, innovative, or unexpected ways that you have seen Orchest used?</li> <li>What are the most interesting, unexpected, or challenging lessons that you have learned while building Orchest?</li> <li>When is Orchest the wrong choice?</li> <li>What do you have planned for the future of the project and company?</li> </ul> <h3>Keep In Touch</h3> <ul> <li>Rick <ul> <li><a href="https://github.com/ricklamers?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">ricklamers</a> on GitHub</li> <li><a href="https://www.linkedin.com/in/lamersrick/?originalSubdomain=nl&utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">LinkedIn</a></li> <li><a href="https://twitter.com/RickLamers?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">@RickLamers</a> on Twitter</li> </ul> </li> <li>Yannick <ul> <li><a href="https://github.com/yannickperrenet?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">yannickperrenet</a> on GitHub</li> <li><a href="https://www.linkedin.com/in/yannickperrenet/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">LinkedIn</a></li> </ul> </li> </ul> <h3>Picks</h3> <ul> <li>Tobias <ul> <li><a href="https://www.google.com/search?q=fresh+bagels+near+me&utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Fresh Bagels</a></li> </ul> </li> <li>Rick <ul> <li><a href="https://github.com/vaexio/vaex?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Vaex</a></li> </ul> </li> <li>Yannick <ul> <li><a href="https://cookiecutter.readthedocs.io/en/latest/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Cookiecutter</a></li> <li><a href="https://github.com/pyenv/pyenv?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Pyenv</a></li> </ul> </li> </ul> <h3>Links</h3> <ul> <li><a href="https://www.orchest.io/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Orchest</a></li> <li><a href="https://www.cs.toronto.edu/~hinton/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Geoffrey Hinton</a></li> <li><a href="http://yann.lecun.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Yann LeCun</a></li> <li><a href="https://coffeescript.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">CoffeeScript</a></li> <li><a href="https://www.vim.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Vim</a></li> <li><a href="https://en.wikipedia.org/wiki/Generative_adversarial_network?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">GAN == Generative Adversarial Network</a></li> <li><a href="http://git-scm.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Git</a></li> <li><a href="https://en.wikipedia.org/wiki/SQL?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">SQL</a></li> <li><a href="https://cloud.google.com/bigquery?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">BigQuery</a></li> <li><a href="https://software-carpentry.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Software Carpentry</a> <ul> <li><a href="https://www.pythonpodcast.com/episode-33-maneesha-sane-on-software-and-data-carpentry/?utm_source=rss&utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://colab.research.google.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Google Colab</a></li> <li><a href="https://airflow.apache.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Airflow</a> <ul> <li><a href="https://www.pythonpodcast.com/episode-44-airflow-with-maxime-beauchemin/?utm_source=rss&utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://kedro.readthedocs.io/en/stable/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Kedro</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/kedro-data-pipeline-episode-100/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://github.com/fastai/nbdev/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">nbdev</a> <ul> <li><a href="https://www.pythonpodcast.com/nbdev-literate-programming-episode-300/?utm_source=rss&utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://papermill.readthedocs.io/en/latest/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Papermill</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/using-notebooks-as-the-unifying-layer-for-data-roles-at-netflix-with-matthew-seal-episode-54/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://mlflow.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">MLFlow</a></li> <li><a href="https://metaflow.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Metaflow</a> <ul> <li><a href="https://www.pythonpodcast.com/metaflow-machine-learning-operations-episode-274/?utm_source=rss&utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://dvc.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">DVC</a> <ul> <li><a href="https://www.pythonpodcast.com/data-version-control-episode-206/?utm_source=rss&utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://www.andrewng.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Andrew Ng</a></li> <li><a href="https://www.kubeflow.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Kubeflow</a></li> <li><a href="http://www.lua.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Lua</a></li> <li><a href="https://caddyserver.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Caddy</a></li> <li><a href="https://traefik.io/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Traefik</a></li> <li><a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">DAG == Directed Acyclic Graph</a></li> <li><a href="https://jupyter-enterprise-gateway.readthedocs.io/en/latest/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Jupyter Enterprise Gateway</a></li> <li><a href="https://www.streamlit.io/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Streamlit</a></li> <li><a href="https://kubernetes.io/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Kubernetes</a></li> <li><a href="https://dagster.io/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Dagster</a> <ul> <li><a href="https://www.pythonpodcast.com/dagster-data-orchestration-episode-279/?utm_source=rss&utm_medium=rss">Podcast.__init__ Episode</a></li> <li><a href="https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://www.getdbt.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">DBT</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://gitlab.com/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">GitLab</a></li> <li><a href="https://spark.apache.org/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">Spark</a></li> <li><a href="https://en.wikipedia.org/wiki/Extract,_transform,_load?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">ETL</a></li> </ul> <p>The intro and outro music is from Requiem for a Fish <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&utm_medium=rss" rel="noopener" target="_blank">CC BY-SA</a></p> </div> <img alt="" height="0" src="https://analytics.boundlessnotions.com/piwik.php?idsite=1&rec=1&url=https%3A%2F%2Fwww.pythonpodcast.com%2Forchest-data-science-ide-episode-304%2F&action_name=Go+From+Notebook+To+Pipeline+For+Your+Data+Science+Projects+With+Orchest+-+Episode+304&urlref=https%3A%2F%2Fwww.pythonpodcast.com%2Ffeed%2F&utm_source=rss&utm_medium=rss" style="border: 0; width: 0; height: 0;" width="0" />

Go From Notebook To Pipeline For Your Data Science Projects With Orchest

The Python Podcast.__init__

The Python Podcast.init