Combining Python And SQL To Build A PyData Warehouse

The Python Podcast.__init__

Episode | Podcast

Date: Mon, 02 Sep 2019 12:00:00 -0400

<div class="wp-block-jetpack-markdown"><h3>Summary</h3> <p>The ecosystem of tools and libraries in Python for data manipulation and analytics is truly impressive, and continues to grow. There are, however, gaps in their utility that can be filled by the capabilities of a data warehouse. In this episode Robert Hodges discusses how the PyData suite of tools can be paired with a data warehouse for an analytics pipeline that is more robust than either can provide on their own. This is a great introduction to what differentiates a data warehouse from a relational database and ways that you can think differently about running your analytical workloads for larger volumes of data.</p> <h3>Announcements</h3> <ul> <li>Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.</li> <li>When you&#8217;re ready to launch your next app or want to try a project you hear about on the show, you&#8217;ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you&#8217;ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to <a href="https://www.pythonpodcast.com/linode?utm_source=rss&amp;utm_medium=rss">pythonpodcast.com/linode</a> to get a $20 credit and launch a new server in under a minute. And don&#8217;t forget to thank them for their continued support of this show!</li> <li>Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at <a href="https://www.pythonpodcast.com/mode?utm_source=rss&amp;utm_medium=rss">mode.com/init</a>. Specifically, we&#8217;re going to be focusing on their similarities, rather than their differences.</li> <li>You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don&#8217;t want to miss out on this year&#8217;s conference season. We have partnered with organizations such as O&#8217;Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O&#8217;Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to <a href="https://www.pythonpodcast.com/conferences?utm_source=rss&amp;utm_medium=rss">pythonpodcast.com/conferences</a> to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.</li> <li>Your host as usual is Tobias Macey and today I&#8217;m interviewing Robert Hodges about how the PyData ecosystem can play nicely with data warehouses</li> </ul> <h3>Interview</h3> <ul> <li>Introductions</li> <li>How did you get introduced to Python?</li> <li>To start with, can you give a quick overview of what a data warehouse is and how it differs from a &quot;regular&quot; database for anyone who isn&#8217;t familiar with them? <ul> <li>What are the cases where a data warehouse would be preferable and when are they the wrong choice?</li> </ul> </li> <li>What capabilities does a data warehouse add to the PyData ecosystem?</li> <li>For someone who doesn&#8217;t yet have a warehouse, what are some of the differentiating factors among the systems that are available?</li> <li>Once you have a data warehouse deployed, how does it get populated and how does Python fit into that workflow?</li> <li>For an analyst or data scientist, how might they interact with the data warehouse and what tools would they use to do so?</li> <li>What are some potential bottlenecks when dealing with the volumes of data that can be contained in a warehouse within Python? <ul> <li>What are some ways that you have found to scale beyond those bottlenecks?</li> </ul> </li> <li>How does the data warehouse fit into the workflow for a machine learning or artificial intelligence project?</li> <li>What are some of the limitations of data warehouses in the context of the Python ecosystem?</li> <li>What are some of the trends that you see going forward for the integration of the PyData stack with data warehouses? <ul> <li>What are some challenges that you anticipate the industry running into in the process?</li> </ul> </li> <li>What are some useful references that you would recommend for anyone who wants to dig deeper into this topic?</li> </ul> <h3>Keep In Touch</h3> <ul> <li><a href="https://www.linkedin.com/in/berkeleybob2105/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">LinkedIn</a></li> <li><a href="https://github.com/hodgesrm?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">hodgesrm</a> on GitHub</li> </ul> <h3>Picks</h3> <ul> <li>Tobias <ul> <li><a href="http://shop.oreilly.com/product/0636920161417.do?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Foundations Of Architecting Data Solutions: Managing Successful Data Projects</a> by Ted Malaska &amp; Jonathan Seidman</li> </ul> </li> <li>Robert <ul> <li>Reading old academic papers such as <a href="http://db.csail.mit.edu/projects/cstore/vldb.pdf?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CStore</a></li> <li><a href="https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning-second-edition?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Python Machine Learning</a> by Sebastian Raschka</li> </ul> </li> </ul> <h3>Closing Announcements</h3> <ul> <li>Thank you for listening! Don&#8217;t forget to check out our other show, the <a href="https://www.dataengineeringpodcast.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast</a> for the latest on modern data management.</li> <li>Visit the <a href="https://www.pythonpodcast.com?utm_source=rss&amp;utm_medium=rss">site</a> to subscribe to the show, sign up for the mailing list, and read the show notes.</li> <li>If you&#8217;ve learned something or tried out a project from the show then tell us about it! Email <a href="mailto:hosts@podcastinit.com">hosts@podcastinit.com</a>) with your story.</li> <li>To help other people find the show please leave a review on <a href="https://itunes.apple.com/us/podcast/podcast.-init/id981834425?mt=2&amp;uo=6&amp;at=&amp;ct=&amp;utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">iTunes</a> and tell your friends and co-workers</li> <li>Join the community in the new Zulip chat workspace at <a href="https://www.pythonpodcast.com/chat?utm_source=rss&amp;utm_medium=rss">pythonpodcast.com/chat</a></li> </ul> <h3>Links</h3> <ul> <li><a href="https://www.altinity.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Altinity</a></li> <li><a href="https://clickhouse.yandex?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Clickhouse</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Interview</a></li> </ul> </li> <li><a href="https://mysql.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">MySQL</a></li> <li><a href="https://en.wikipedia.org/wiki/Data_warehouse?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Warehouse</a></li> <li><a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Column Oriented Database</a></li> <li><a href="https://en.wikipedia.org/wiki/SIMD?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">SIMD == Single Instruction Multiple Data</a></li> <li><a href="https://www.postgresql.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">PostgreSQL</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://www.microsoft.com/en-us/sql-server/default.aspx?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Microsoft SQL Server</a></li> <li><a href="https://pandas.pydata.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Pandas</a></li> <li><a href="https://numpy.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">NumPy</a></li> <li><a href="https://www.tensorflow.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Tensorflow</a></li> <li><a href="https://jupyter.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Jupyter</a></li> <li><a href="https://towardsdatascience.com/sampling-techniques-a4e34111d808?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Sampling</a></li> <li><a href="https://dask.readthedocs.io/en/latest/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Dask</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast</a></li> </ul> </li> <li><a href="https://ray.readthedocs.io/en/latest/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Ray</a></li> <li><a href="https://en.wikipedia.org/wiki/MapReduce?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Map/Reduce</a></li> <li><a href="https://www.vertica.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Vertica</a></li> <li><a href="https://en.wikipedia.org/wiki/Shard_(database_architecture)?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Sharding</a></li> <li><a href="http://hadoop.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Hadoop</a></li> <li><a href="https://www.snowflake.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">SnowflakeDB</a></li> <li><a href="https://delta.io?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Delta Lake</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://cloud.google.com/bigquery/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">BigQuery</a></li> <li><a href="https://aws.amazon.com/redshift/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">RedShift</a></li> <li><a href="https://docs.snowflake.net/manuals/user-guide/data-sharing-intro.html?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Snowflake Data Sharing</a></li> <li><a href="https://en.wikipedia.org/wiki/Oracle_Database?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">OracleDB</a></li> <li><a href="https://kubernetes.io?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Kubernetes</a></li> <li><a href="https://www.getdbt.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">DBT</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://en.wikipedia.org/wiki/Comma-separated_values?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CSV</a></li> <li><a href="http://parquet.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Parquet</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/data-serialization-with-doug-cutting-and-julien-le-dem-episode-8/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="http://kafka.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Kafka</a></li> <li><a href="https://www.ucdavis.edu?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">UC Davis</a></li> <li><a href="https://en.wikipedia.org/wiki/Web_scraping?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Web Scraping</a></li> <li><a href="https://clickhouse-driver.readthedocs.io/en/latest/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Clickhouse Python Driver</a></li> <li><a href="https://www.sqlalchemy.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">SQLAlchemy</a> <ul> <li><a href="https://www.altinity.com/blog/2019/2/25/clickhouse-and-python-jupyter-notebooks?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Altinity Blog Post</a></li> </ul> </li> <li><a href="https://en.wikipedia.org/wiki/Materialized_view?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Materialized View</a></li> <li><a href="https://pytorch.org?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">PyTorch</a> <ul> <li><a href="https://www.pythonpodcast.com/pytorch-deep-learning-epsiode-202/?utm_source=rss&amp;utm_medium=rss">Podcast Interview</a></li> </ul> </li> <li><a href="https://scikit-learn.org/stable/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">scikit-learn</a></li> <li><a href="http://spark.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Spark</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/putting-apache-spark-into-action-with-jean-georges-perrin-episode-60/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Interview</a></li> </ul> </li> <li><a href="https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">BigQuery ML</a></li> <li><a href="http://arrow.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Apache Arrow</a></li> <li><a href="https://wesmckinney.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Wes McKinney</a> <ul> <li><a href="https://www.pythonpodcast.com/wes-mckinney-python-for-data-analysis-episode-203/?utm_source=rss&amp;utm_medium=rss">Podcast Interview</a></li> </ul> </li> <li><a href="https://en.wikipedia.org/wiki/User-defined_function?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">User Defined Function</a></li> <li><a href="https://en.wikipedia.org/wiki/Kdb%2B?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">KDB</a></li> <li><a href="http://db.csail.mit.edu/projects/cstore/vldb.pdf?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CStore Paper</a> by Dr. Michael Stonebraker, et al</li> <li><a href="https://www.kinetica.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Kinetica</a></li> <li><a href="https://www.omnisci.com?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">MapD/OmniSci</a></li> </ul> <p>The intro and outro music is from Requiem for a Fish <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CC BY-SA</a></p> </div> <img alt="" height="0" src="https://analytics.boundlessnotions.com/piwik.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fwww.pythonpodcast.com%2Fpydata-warehouse-episode-227%2F&amp;action_name=Combining+Python+And+SQL+To+Build+A+PyData+Warehouse+-+Episode+227&amp;urlref=https%3A%2F%2Fwww.pythonpodcast.com%2Ffeed%2F&amp;utm_source=rss&amp;utm_medium=rss" style="border: 0; width: 0; height: 0;" width="0" />