Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Saturday, November 12 • 2:10pm - 2:50pm
Beyond Shuffling: Scaling Apache Spark

Sign up or log in to save this to your schedule and see who's attending!

This session will cover personal & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate. The planned topics are: - Using Spark counters for performance investigation - Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI. - Working with Key/Value Data - Replacing groupByKey for awesomeness groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations. - Effective caching & checkpointing - Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact. - Considerations for noisy clusters - Functional transformations with Spark Datasets - How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code

Speakers
avatar for Holden Karau

Holden Karau

Principal Software Engineer, IBM
Holden Karau is a software development engineer and is active in open source. She a co-author of Learning Spark & Fast Data Processing with Spark and has taught intro Spark workshops. Prior to IBM she worked on a variety of big data, search, and classification problems at Alpine, DataBricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of computers she... Read More →


Saturday November 12, 2016 2:10pm - 2:50pm
Off by One

Attendees (47)