Loading…
Scalæ By the Bay has ended
Back To Schedule
Saturday, November 12 • 1:40pm - 2:00pm
Spark DataFrames for Data Munging

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Have you ever been handed a few hundred gigabytes of data collected by someone else? What’s in that data and how will you analyze it? Data munging is a messy job that most data engineers and data scientists have to deal with. When it needs to be done at scale, one of the best tools for the job is the Spark DataFrame Scala API. DataFrames were first introduced in Spark 1.3, with major improvements in 1.4-1.6, and 2.0.

In this talk, you’ll learn the top reasons why Spark DataFrames, when combined with notebooks, are great for data exploration and data munging:
* Spark is fast, interactive, and scalable.
* Built-in support for semi-structured input, namely JSON.
* Summary statistics and approximate counting for quick overviews of a data set.
* Language-integrated SQL and UDFs for querying the data.
* Numerous utility functions for math, string, and date-time manipulation.
* Datasets, in Spark 2.0, for functional transformations.

This talk will include a live demo of the Spark DataFrame Scala API for data exploration and data munging on a real data set, with a Zeppelin notebook. The data set will be Tweet data in JSON format. The speaker is a data analytics developer who has been data munging with Spark DataFrames since they were first introduced in Spark 1.3. She has over ten years experience developing analytics and data pipelines at scale.

Speakers
avatar for (Susan) Xinh Huynh

(Susan) Xinh Huynh

Software Engineer, Mesosphere
Susan is a data analytics developer who has been data munging with Spark DataFrames since they were first introduced in Spark 1.3. She has over ten years experience in analytics, big data, and data science. She is currently working on the Mesos - DC/OS big data stack at Mesosphere... Read More →


Saturday November 12, 2016 1:40pm - 2:00pm PST
Off by One