Xavier started his career as a researcher in Experimental Physics and also focused on data processing. Further down the road, he took part in projects in finance, genomics and software development for academic research. During that time, he worked on timeseries, on prediction of biological molecular structures and interactions, and applied Machine Learning methodologies. He developed solutions to manage and process data distributed across data centres.
He now founded and works at Data Fellas, a company dedicated to distributed computing and advanced analytics, leveraging Scala, Spark and other distributed technologies like H2O for machine learning.
Sparkling Water on the Spark Notebook: Interactive Genomes clustering
It’s a matter of fact that H2O provides advanced Machine Learning capabilities scaling with large datasets. Also, interoperating between H2O and generic large scale data manipulation frameworks like Apache Spark is of utmost importance to help Data Scientists bring the most efficiency on the table, this is where Sparkling Water is shining. The last stone of the edifice is then to work interactively on data from a single environment, allowing the data scientist to share his results and code. We present here the Spark Notebook working with Sparkling Water to bring the valuable H2O libraries to the Spark environment. We show a case of genomics data processing, leveraging Spark and its genomics library ADAM to efficiently access raw data with domain specific objects, data preparation is done with spark and deep learning from H2O is used to compute a model for population stratification within the set of genomes under investigation.