Big Data Schema design


Details
This presentation will consist of two to four smaller talks. Each mini-talk will describe:
A specific large scale data storage challenge. Background on why the problem is difficult to solve with traditional non-scalable data storage. A solution for the problem in a distributed data store. We already have two speakers slated to talk (on Cassandra Schema Design) but we are looking for others. Please contact the organizers if you would like to purpose a talk.
Schema Design for Storing TimeSeries Metrics and Implementing Multi-Dimensional Aggregate Composites with Counters For Reporting - Joe Stein
Walk through of a generic and reusable implementation for storing time series data points and multi-dimensional permutations for aggregates of those data points. Composite Keys, Composite Columns, MultiGet Queries, Slices (Range of Columns) Queries and Counters with that schema will be discussed in detail for how to store and retrieve all the aggregate sets of data. Using this pattern billions of data points can be collected and reported on in every different point of view and pivot for all permutations without suffering performance degradation.
Casbase - Bringing SQL like features to NoSQL- Edward Capriolo
Typically NoSQL data stores relax or remove some features of relation databases. CasBase provides similar features such as primary key and unique index enforcement, as well as the ability to do optimized range queries on data. This presentation talks about fundamental differences between Cassandra and some relational databases, and then describes the techniques used to bridge the gap between the two.
Using frequency weights and discretizing continuous random variables to analyze big data sets with no loss of signal in low dimensional analyses. Ori Stitelman
Hive and Hadoop are great tools for storing and extracting large data; however, many times the statistical tools which we wish to implement are not available in a form that takes advantage of the map reduce structure either because the tool has not yet been developed for map reduce or the method itself can not easily be broken down into a map and reduce step. At this point we are faced with the decision of either downsampling our data or dealing with a less than optimal method for analysis. By descretizing continuous variables and using frequency weights one may avoid this down sampling dilemma and still implement the desired tools without losing signal in low dimensional problems. We will present an example using a generalized additive model (GAM) where we were able to save hours of computation time per analysis and even analyze data sets that would have previously crashed our systems by following the proposed approach. In the example explored we used R, a particularly memory intensive software, to implement the analysis.

Big Data Schema design