Apache Beam with Kenneth Knowles and Pablo Estrada

Google Cloud Platform Podcast

Episode | Podcast

Date: Wed, 06 Apr 2022 16:30:00 +0000

<p><span style="font-weight: 400;">On the podcast this week, your hosts</span> <a href="https://twitter.com/stephr_wong"><span style="font-weight: 400;">Stephanie Wong</span></a> <span style="font-weight: 400;">and</span> <a href="https://twitter.com/markmirch"><span style="font-weight: 400;">Mark Mirchandani</span></a> <span style="font-weight: 400;">talk about the data processing tool Apache Beam with guests</span> <a href="https://twitter.com/polecito"><span style="font-weight: 400;">Pablo Estrada</span></a> <span style="font-weight: 400;">and</span> <a href="https://twitter.com/KennKnowles"><span style="font-weight: 400;">Kenneth Knowles</span></a><span style="font-weight: 400;">.</span></p> <p><span style="font-weight: 400;">Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and emphasis on correctness garnered support from developers early on and continues to attract users. Pablo helps us understand why Beam is a better option for certain projects looking to process large amounts of data. Our guests describe how Beam may be a better fit than microservices that could become obsolete as company needs change.</span></p> <p><span style="font-weight: 400;">Next, we step back and take a look at why batch and stream is the gold standard of data processing because of its balance between low latency and ease of “being done” with data collection. Beam’s focus on the correctness of data and correctness in processing that data is a core component. With good data, processing becomes easier, more reliable, and cheaper. Kenn gives examples of how things can go wrong with bad data processing. Beam strives for the perfect combination of low latency, correct data, and affordability. Users can choose where to run Beam pipelines, from other Apache software offerings to Dataflow, which means excellent flexibility. Our guests talk about the pros and cons of some of these options and we hear examples of how companies are using Beam along with supporting software to solve data processing challenges.</span></p> <p><span style="font-weight: 400;">To get started with Beam, check out Beam College or attend Beam Summit 2022.</span></p> <h5><strong>Kenneth Knowles</strong></h5> <p><a href="https://twitter.com/KennKnowles"><span style="font-weight: 400;">Kenn Knowles</span></a> <span style="font-weight: 400;">is chair of the Apache Beam Project Management Committee. Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Kenn holds a PhD in programming languages from the University of California, Santa Cruz.</span></p> <h5><strong>Pablo Estrada</strong></h5> <p><a href="https://twitter.com/polecito"><span style="font-weight: 400;">Pablo</span></a> <span style="font-weight: 400;">is a Software Engineer at Google, and a management committee member for Apache Beam. Pablo is big into working on an open source project, and has worked all across the Apache Beam stack.</span></p> <h5><strong>Cool things of the week</strong></h5> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Under the sea: Building the world’s fiber optic internet</span> <a href="https://www.youtube.com/watch?v=N0ng8R0_Tis"><span style="font-weight: 400;">video</span></a></li> <li style="display: inline;"> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Discovering Data Centers</span> <a href="https://www.youtube.com/watch?v=2R-UVdw6thI&amp;list=PLIivdWyY5sqI7lzvVHfp4zbwp3Xaub2jm"> <span style="font-weight: 400;">videos</span></a></li> </ul> </li> <li style="font-weight: 400;"><span style="font-weight: 400;">Google Data Cloud Summit</span> <a href="https://cloudonair.withgoogle.com/events/summit-data-cloud-2022"><span style="font-weight: 400;"> site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">It’s official—Google Distributed Cloud Edge is generally available</span> <a href="https://cloud.google.com/blog/products/infrastructure-modernization/google-distributed-cloud-edge-is-ga"> <span style="font-weight: 400;">blog</span></a></li> <li style="display: inline;"> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">GCP Podcast Episode 228: Fastly with Tyler McMullen</span> <a href="https://www.gcppodcast.com/post/episode-228-fastly-with-tyler-mcmullen/"> <span style="font-weight: 400;">podcast</span></a></li> </ul> </li> <li style="font-weight: 400;"><span style="font-weight: 400;">Save big by temporarily suspending unneeded Compute Engine VMs—now GA</span> <a href="https://cloud.google.com/blog/products/compute/save-by-suspending-vms-on-google-compute-engine"> <span style="font-weight: 400;">blog</span></a></li> </ul> <h5><strong>Interview</strong></h5> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Beam</span> <a href="https://beam.apache.org/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Beam Documentation</span> <a href="https://beam.apache.org/documentation/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Dataflow</span> <a href="https://cloud.google.com/dataflow"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Flink</span> <a href="https://flink.apache.org/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Spark</span> <a href="https://spark.apache.org/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Samza</span> <a href="https://samza.apache.org/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Apache Nemo</span> <a href="https://nemo.apache.org/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Spanner</span> <a href="https://cloud.google.com/spanner"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">BigQuery</span> <a href="https://cloud.google.com/bigquery"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Beam College</span> <a href="https://beamcollege.dev/"><span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Beam College on Github</span> <a href="https://github.com/griscz/beam-college/blob/main/day2/B1_Beam_College_Advanced_Windows_and_Triggers_a_practical_guide_v0_9_0.ipynb"> <span style="font-weight: 400;">site</span></a></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Beam Developer Mailing List</span> <span style="font-weight: 400;">email</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Beam User Mailing List</span> <span style="font-weight: 400;">email</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Beam Summit</span> <a href="https://2022.beamsummit.org/"><span style="font-weight: 400;">site</span></a></li> </ul> <h5><strong>What’s something cool you’re working on?</strong></h5> <p><span style="font-weight: 400;">Mark is working on a new Apache Beam video series</span> <a href="https://www.youtube.com/watch?v=65lmwL7rSy4&amp;list=PLIivdWyY5sqIEiHGunZXg_yoS7unlHNJt"> <span style="font-weight: 400;">Getting Started Wtih Apache Beam</span></a></p> <h5><strong>Hosts</strong></h5> <p><span style="font-weight: 400;">Stephanie Wong and Mark Mirchandani</span></p>