Loading…
Scalæ By the Bay has ended
View analytic
Saturday, November 12 • 2:10pm - 2:50pm
Beyond Shuffling: Scaling Apache Spark

Sign up or log in to save this to your schedule and see who's attending!

This session will cover personal & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate. The planned topics are: - Using Spark counters for performance investigation - Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI. - Working with Key/Value Data - Replacing groupByKey for awesomeness groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations. - Effective caching & checkpointing - Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact. - Considerations for noisy clusters - Functional transformations with Spark Datasets - How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code

Speakers
avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer and PMC... Read More →


Saturday November 12, 2016 2:10pm - 2:50pm
Off by One

Attendees (48)