Beam and Spark with Holden Karau

Google Cloud Platform Podcast

Episode | Podcast

Date: Wed, 09 May 2018 00:00:00 +0000

<p><a href="https://twitter.com/holdenkarau">Holden Karau</a> is on the podcast this week to talk all about Spark and Beam, two open source tools that helps process data at scale, with <a href="https://twitter.com/Neurotic">Mark</a> and <a href="https://twitter.com/nyghtowl">Melanie</a>.</p> <h5 id="holden-karau">Holden Karau</h5> <p><a href="https://twitter.com/holdenkarau">Holden Karau</a> is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related “big data” tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a commiter on and PMC on Apache Spark and committer on SystemML & Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.</p> <h5 id="cool-things-of-the-week">Cool things of the week</h5> <ul> <li>Twitter’s collaboration with Google Cloud <a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/a-new-collaboration-with-google-cloud.html"> blog</a> & <a href="https://twitter.com/gregsramblings/status/992506460734025728">tweet</a></li> <li>Kaggle CERN TrackML Particle Tracking Challenge Competition <a href="https://www.kaggle.com/c/trackml-particle-identification">site</a></li> <li>Open-sourcing gVisor, a sandboxed container runtime <a href="https://cloudplatform.googleblog.com/2018/05/Open-sourcing-gVisor-a-sandboxed-container-runtime.html"> blog</a> & <a href="https://github.com/google/gvisor">repo</a></li> <li>Announcing Stackdriver Kubernetes Monitoring <a href="https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc"> blog</a></li> <li>MLPerf: collaborative effort to standardize ML benchmarks <a href="https://mlperf.org/">site</a></li> </ul> <h5 id="interview">Interview</h5> <ul> <li>Spark <a href="http://spark.apache.org/">site</a> & <a href="https://spark.apache.org/community.html">community site</a></li> <li>Beam <a href="https://beam.apache.org/">site</a></li> <li>Cloud Dataflow <a href="https://cloud.google.com/dataflow/">site</a> & <a href="https://cloud.google.com/dataflow/docs/">docs</a></li> <li>Cloud Dataproc <a href="https://cloud.google.com/dataproc/">site</a> & <a href="https://cloud.google.com/dataproc/docs/">docs</a></li> <li>Using Spark on Kubernetes Engine <a href="https://cloud.google.com/solutions/spark-on-kubernetes-engine">blog</a></li> <li>Testing future Apache Spark releases and changes on Google Kubernetes Engine and Cloud Dataproc <a href="https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc"> blog</a></li> <li>Spark Packages <a href="https://spark-packages.org/">site</a></li> <li>Spark testing base <a href="https://github.com/holdenk/spark-testing-base">repo</a></li> <li>Flink <a href="https://flink.apache.org/">site</a></li> <li>Arrow <a href="https://arrow.apache.org/">site</a></li> </ul> <p>Upcoming Talks:</p> <ul> <li><a href="https://us.pycon.org/2018/">PyCon 2018</a> & Debugging PySpark <a href="https://us.pycon.org/2018/schedule/presentation/97/">talk</a></li> <li><a href="https://eu.scaladays.org/">Scala Days</a> & Keeping the “fun” in Spark <a href="https://eu.scaladays.org/lect-6920-keeping-the-%22fun%22-in-apache-spark%3A-datasets-and-fp.html"> talk</a></li> <li><a href="https://conferences.oreilly.com/strata/strata-eu">Strata London</a> & Understanding Spark tuning with auto-tuning <a href="https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/64759"> talk</a></li> <li><a href="https://jonthebeach.com/">J on the Beach</a> & General Purpose Big Data Systems are eating the world <a href="https://jonthebeach.com/speakers/29/Holden+Karau">talk</a></li> <li><a href="https://databricks.com/sparkaisummit/north-america">Spark Summit 2018</a> & Accelerating TF with Apache Arrow on Spark <a href="https://databricks.com/session/accelerating-tensorflow-with-apache-arrow-on-spark-bonus-making-it-available-in-scala"> talk</a></li> </ul> <h5 id="question-of-the-week">Question of the week</h5> <p>I have a continuous integration build process setup with Container Builder, but it’s all sequential. I want to speed things up by processing parts of it in parallel. How do I do that?</p> <ul> <li>Configure Build Step Order <a href="https://cloud.google.com/container-builder/docs/configuring-builds/configure-build-step-order"> docs</a></li> </ul> <h5 id="where-can-you-find-us-next">Where can you find us next?</h5> <p>Mark can be found streaming <a href="https://agones.dev">Agones</a> development on <a href="https://twitch.tv/markmandel">Twitch</a>.</p> <p>Melanie is speaking at the <a href="https://meetings.internet2.edu/2018-global-summit/">internet2 Global Summit</a>, May 9th in San Diego, and will also be talking at the <a href="https://understandrisk.org/event/ur2018/">Understand Risk Forum</a> on May 17th, in Mexico City.</p> <p>Special shout out: <a href="https://events.google.com/io/">Google I/O</a> and <a href="https://us.pycon.org/2018/">PyCon</a> are both happening this week</p>