Scalæ By the Bay has ended
Sunday, November 13 • 4:00pm - 4:40pm
Logical Signatures for Spark

Sign up or log in to save this to your schedule and see who's attending!

Dealing with problems that arise when running a long process over a large dataset can be one of the most time consuming parts of development. For this reason, many data engineers and scientists will save intermediate results and use them to quickly zero in on the sections which have issues and avoid rerunning sections that are working as intended. For data pipelines that have several sections, dealing with the saving and loading of intermediate results can become almost as complicated as the core problem that the developers are trying to solve. Changes that are made may require previously saved intermediate results to be invalidated and overwritten. This process is typically manual and it's very easy for a developer to mistakenly use outdated intermediate results. These problems can be even worse when multiple developers are sharing intermediate results. These issues can be addressed by the introduction of a logical signature for datasets. For each dataset, we'll compute a signature based on the identity of the input and on the logic applied. If the input and logic stay the same for some dataset between two executions, the signature will be consistent and we can safely load previously saved results. If either the input or the logic change then the signature will change and the dataset will be freshly computed. With these signatures, we can implement automatic checkpointing that works even among several concurrent users and other useful features as well.

avatar for Nimbus Goehausen

Nimbus Goehausen

Software Engineer, Bloomberg LP

Sunday November 13, 2016 4:00pm - 4:40pm
Off by One

Attendees (21)