Predictive Analytics - S.E Michigan Message Board › March 23 Data Analysis Workshop
Bloomfield Hills, MI
Four of us spent Saturday working on the Titanic data in the cafe at the Bloomfield Township Public Library. We spent the entire meeting working on preprocessing of the data.
Here's where we are right now:
1) The first goal of the workshop is to have one data set that can be used by everyone. We are not quite there because we have to fill in some missing data for age and we still need to determine if there are any other derived attributes that we want to include. To fill in the missing data, we agreed to group the age data by sex and pclass, look at the distribution, then pick the value that best represents that group. We still need to finish this. Attributes that are currently included in the model are pclass, sex, age, sibsp, parch, fare, embarked, moniker, nickname. Nickname is a derived attribute. The team agreed to also create a set data with no missing values. We can use this set to check the accuracy of our feature selection models.
2) There are two new files. The file tit_nmd.csv is in the Anne folder and is the subset of the data that had no missing values for the attributes selected. The file combine__training_test_derived_and_additional_attributes.csv is in Erin's folder. The combines the training and test data so that whatever missing data is filled in for the training data will also be filled in for the test data. After the missing values are filled in, this will again be split into training and test files and we can begin feature selection.
Clearly, working on the data for this project and the subsequent analyses takes dedicted time. If our goal is to submit to the Kaggle site in May, we probably need at least two more workshops.
Erin, Paul and Mark, please feel free to chime in if I forgot anything.
See everyone on Wednesday night.