The Big Data Puzzle - Where Does the Eclipse Piece Fit?


Details
Speaker Bio: J. Langley has ~15 years of experience with software development that covers a wide variety of languages, platforms, and methodologies. He has spent most of my career working on US Department of Defense programs, and a small bit of time on Canadian Department of National Defence, Swedish Ministry of Defense, British Ministry of Defense, and the Saudi Arabian National Guard. He has been working with Eclipse as a development framework since 2009.
Abstract:We will introduce a Big Data configuration that uses Avro & Parquet for data formats, Hadoop for storage, and Spark / Hive for running queries. All of these projects are from the Apache Software Foundation and are widely used in the Data Science field. We will show how Eclipse provides an excellent foundation for IDE support and tooling to make it easier to develop solutions based on this technology stack.
CohesionForce has put together a data set consisting of over 200M samples based on actual records from New York City taxi cabs. This data has been used to compare file size, read/write time, and query speeds using the tooling configuration provided above. We have also created tools in Eclipse that help us transform the data between formats, and we have made those available under the EPL:
https://github.com/CohesionForce/dis-toolkit
https://github.com/LangleyStudios/eclipse-avro
https://github.com/CohesionForce/avroToParquet
We will give a short description of these projects (not a sales pitch) and discuss possibilities for other ways to use Eclipse in the Big Data field. Eclipse Modeling Framework - define the data formats
XTend - generate code and other files
BIRT - create reports of the analysis results
We would also like to solicit feedback from others that may be using Eclipse in the Data Science field.

The Big Data Puzzle - Where Does the Eclipse Piece Fit?