Scalæ By the Bay has ended
Back To Schedule
Saturday, November 12 • 2:10pm - 2:50pm
Beyond Shuffling: Scaling Apache Spark

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

This session will cover personal & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate. The planned topics are: - Using Spark counters for performance investigation - Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI. - Working with Key/Value Data - Replacing groupByKey for awesomeness groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations. - Effective caching & checkpointing - Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact. - Considerations for noisy clusters - Functional transformations with Spark Datasets - How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code

avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →

Saturday November 12, 2016 2:10pm - 2:50pm PST
Off by One