• What we'll do
Brandon Sherman presents Tidy Data
A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.
The TL;DR is data should be organized in a way such that every row corresponds to an observation and every column corresponds to a variable.
Mohannad Elhamod presents Efficient Determination of Dynamic Split Points in a Decision Tree
"We consider the problem of choosing split points for continuous predictor variables in a decision tree. Previous approaches
to this problem typically either (1) discretize the
continuous predictor values prior to learning or (2) apply
a dynamic method that considers all possible split points
for each potential split. In this paper, we describe a number
of alternative approaches that generate a small number
of candidate split points dynamically with little overhead.
We argue that these approaches are preferable to
pre-discretization, and provide experimental evidence that
they yield probabilistic decision trees with the same prediction
accuracy as the traditional dynamic approach. Furthermore,
because the time to grow a decision tree is proportional
to the number of split points evaluated, our approach
is significantly faster than the traditional dynamic
Dinner will be provided! We usually eat between 6:30 and 7:00, with papers starting at 7.
• Important to know
Big ups to Microsoft Reactor for hosting this month!
As a chapter of Papers We Love we abide by and enforce the PWL Code of Conduct (https://github.com/papers-we-love/seattle/blob/master/code-of-conduct.md) at our events. Please give it a read, plan on acting like an adult, and involve one of the organizers if you need help.
Stop slacking and join us in the #seattle channel at https://papersweloveslack.herokuapp.com!
If you have a paper you'd like to present, or even just a mini, please hit up one of the organizers :) We're always looking for more presenters.