Stream Processing In Real Time And At Scale In Pure Python With Bytewax

The Python Podcast.__init__

Episode | Podcast

Date: Sun, 10 Jul 2022 19:00:00 -0400

<div class="wp-block-jetpack-markdown"><h2>Summary</h2> <p>Analysis of streaming data in real time has long been the domain of big data frameworks, predominantly written in Java. In order to take advantage of those capabilities from Python requires using client libraries that suffer from impedance mis-matches that make the work harder than necessary. Bytewax is a new open source platform for writing stream processing applications in pure Python that don&#8217;t have to be translated into foreign idioms. In this episode Bytewax founder Zander Matheson explains how the system works and how to get started with it today.</p> <h2>Announcements</h2> <ul> <li>Hello and welcome to Podcast.__init__, the podcast about Python&#8217;s role in data and science.</li> <li>When you&#8217;re ready to launch your next app or want to try a project you hear about on the show, you&#8217;ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it&#8217;s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to <a href="https://www.pythonpodcast.com/linode?utm_source=rss&amp;utm_medium=rss">pythonpodcast.com/linode</a> and get a $100 credit to try out a Kubernetes cluster of your own. And don&#8217;t forget to thank them for their continued support of this show!</li> <li>The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star&#8217;s data discovery platform solves that out of the box, with a fully automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your dbt, Snowflake, Tableau, Looker, or whatever you&#8217;re using and Select Star will set everything up in just a few hours. Go to <a href="https://www.pythonpodcast.com/selectstar?utm_source=rss&amp;utm_medium=rss">pythonpodcast.com/selectstar</a> today to double the length of your free trial and get a swag package when you convert to a paid plan.</li> <li>Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure? Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error-handling, monitoring, automatic containerization, syncing with Github, and more. Plus, it comes with over 70 open-source, low-code templates to help you quickly build solutions with the tools you already use. Go to <a href="https://www.dataengineeringpodcast.com/shipyard?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">dataengineeringpodcast.com/shipyard</a> to get started automating with a free developer plan today!</li> <li>Your host as usual is Tobias Macey and today I&#8217;m interviewing Zander Matheson about Bytewax, an open source Python framework for building highly scalable dataflows to process ANY data stream.</li> </ul> <h2>Interview</h2> <ul> <li>Introductions</li> <li>How did you get introduced to Python?</li> <li>Can you describe what Bytewax is and the story behind it?</li> <li>Who are the target users for Bytewax?</li> <li>What is the problem that you are trying to solve with Bytewax?</li> <li>What are the alternative systems/architectures that you might replace with Bytewax?</li> <li>Can you describe how Bytewax is implemented? <ul> <li>What are the benefits of Timely Dataflow as a core building block for a system like Bytewax?</li> <li>How have the design and goals of the project changed/evolved since you first started working on it?</li> </ul> </li> <li>What are the axes available for scaling Bytewax execution?</li> <li>How have you approached the design of the Bytewax API to make it accessible to a broader audience?</li> <li>Can you describe what is involved in building a project with Bytewax? <ul> <li>What are some of the stream processing concepts that engineers are likely to run up against as they are experimenting and designing their code?</li> </ul> </li> <li>What is your motivation for providing the core technology of your business as an open source engine? <ul> <li>How are you approaching the balance of project governance and sustainability with opportunities for commercialization?</li> </ul> </li> <li>What are the most interesting, innovative, or unexpected ways that you have seen Bytewax used?</li> <li>What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bytewax?</li> <li>When is Bytewax the wrong choice?</li> <li>What do you have planned for the future of Bytewax?</li> </ul> <h2>Keep In Touch</h2> <ul> <li><a href="https://join.slack.com/t/bytewaxcommunity/shared_invite/zt-vkos2f6r-_SeT9pF2~n9ArOaeI3ND2w?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Slack</a></li> <li><a href="https://twitter.com/MathesonZander?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Twitter</a></li> <li><a href="https://www.linkedin.com/in/alexandermatheson/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">LinkedIn</a></li> </ul> <h2>Picks</h2> <ul> <li>Tobias <ul> <li><a href="https://www.altaracks.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Alta Racks</a></li> </ul> </li> <li>Zander <ul> <li><a href="https://www.athertonbikes.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Atherton Bikes</a></li> </ul> </li> </ul> <h2>Links</h2> <ul> <li><a href="https://www.bytewax.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Bytewax</a> <ul> <li><a href="https://github.com/bytewax/bytewax?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">GitHub</a></li> </ul> </li> <li><a href="https://flink.apache.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Flink</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Spark Streaming</a></li> <li><a href="https://docs.confluent.io/platform/current/connect/index.html?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Kafka Connect</a></li> <li><a href="https://faust.readthedocs.io/en/latest/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Faust</a> <ul> <li><a href="https://www.pythonpodcast.com/fast-stream-processing-in-python-using-faust-with-ask-solem-episode-176/?utm_source=rss&amp;utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://ray.readthedocs.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Ray</a> <ul> <li><a href="https://www.pythonpodcast.com/ray-distributed-computing-episode-258/?utm_source=rss&amp;utm_medium=rss">Podcast Episode</a></li> </ul> </li> <li><a href="https://dask.org/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Dask</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/episode-2-dask-with-matthew-rocklin/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://github.com/TimelyDataflow/timely-dataflow?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Timely Dataflow</a></li> <li><a href="https://github.com/PyO3/pyo3?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">PyO3</a></li> <li><a href="https://materialize.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Materialize</a> <ul> <li><a href="https://www.dataengineeringpodcast.com/materialize-streaming-analytics-episode-112/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Data Engineering Podcast Episode</a></li> </ul> </li> <li><a href="https://en.wikipedia.org/wiki/HyperLogLog?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">HyperLogLog</a></li> <li><a href="https://riverml.xyz/0.11.1/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Python River Library</a></li> <li><a href="https://www.omnicalculator.com/statistics/shannon-entropy#how-to-calculate-entropy-entropy-formula?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Shannon Entropy Calculation</a></li> <li><a href="https://www.bytewax.io/blog/cyberthreats/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">The blog post using incremental shannon entropy</a></li> <li><a href="https://nats.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">NATS</a></li> <li><a href="https://github.com/bytewax/bytewax/blob/main/docs/articles/deployment/waxctl.md?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">waxctl</a></li> <li><a href="https://prometheus.io/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Prometheus</a></li> <li><a href="https://grafana.com/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Grafana</a></li> <li><a href="https://github.com/python-streamz/streamz?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">Streamz</a></li> </ul> <p>The intro and outro music is from Requiem for a Fish <a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">The Freak Fandango Orchestra</a> / <a href="http://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&amp;utm_medium=rss" rel="noopener" target="_blank">CC BY-SA</a></p> </div> <img alt="" height="0" src="https://analytics.boundlessnotions.com/piwik.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fwww.pythonpodcast.com%2Fbytewax-python-stream-processing-episode-370%2F&amp;action_name=Stream+Processing+In+Real+Time+And+At+Scale+In+Pure+Python+With+Bytewax+-+Episode+370&amp;urlref=https%3A%2F%2Fwww.pythonpodcast.com%2Ffeed%2F&amp;utm_source=rss&amp;utm_medium=rss" style="border: 0; width: 0; height: 0;" width="0" />