Next Meetup

Am I, or Can I Be a Data Scientist?
• What we'll do Meet members of the Galvanize Data Science team for a discussion about how to start a career in data science! You'll leave this event with a clear perspective and structured options for beginning your career in this quickly-growing field. You will also receive a list of the best resources available to use to plan your studies with Galvanize! But don't forget to RVSP to secure your spot (https://www.eventbrite.com/e/am-i-or-can-i-be-a-data-scientist-tickets-43342348130)! Galvanize Data Science is for people who are serious about becoming data scientists. Throughout the course, you'll learn the tools, techniques, and fundamental concepts you need to know to make an impact as a data scientist. You'll work through messy, real-world data sets to gain experience across the data science stack: data munging, exploration, modeling, validation, visualization, and communication. Our unique setting in the Galvanize community of startups and tech companies is the perfect place to learn and expand your network! Find out more at https://new.galvanize.com/new-york/data-science AGENDA 6:00pm: "Ask me Anything" with Galvanize Data Science Instructor Sean Read 7:00pm: Course Overview, Application Process, and Financing Options 8:00pm: Question & Answer. Chat with Instructors, Students and Galvanize Staff ABOUT GALVANIZE Galvanize is the premiere dynamic learning community for technology. With campuses located in booming technology sectors throughout the country, Galvanize provides a community for each the following: Education – part-time and full-time training in web development, data science, and data engineering Workspace – whether you’re a freelancer, startup, or established business, we provide beautiful spaces with a community dedicated to support your company’s growth Networking – events in the tech industry happen constantly in our campuses, ranging from popular Meetups to multi-day international conferences To learn more about Galvanize, visit galvanize.com. If you have specific questions regarding our Data Science Immersive course, please reach out to our Admissions Department at [masked] or (917)[masked]. • What to bring Laptop, some specific questions, and a desire to learn.

Galvanize NYC

303 Spring Street · New York, NY

What we're about

Galvanize is a co-working space for technology hosting entrepreneurial teams from the biggest fortune 500 companies to single entrepreneurs working to build the next disruptive technology. The reason we're putting all these awesome innovators together under the same roof is to cultivate a fertile, synergistic, and dynamic ecosystem that drives growth not only for our members and member companies, but for our educational programs. Galvanize is a learning space and we are a teaching company. We offer full-time and part-time programs in Data Science and Web Development. Our data science immersive program is an incredibly intense, three-month full-time firehose of a program that has graduated over 50 cohorts and over 500 students, requiring prerequisites of Python fluency and statistical literacy and is taught at the level of a masters program. Graduates of our immersive have gone on to pursue careers in data science at every major player you've heard of, as well as those that are changing the game in ways you didn't even know were happening. In addition to our SoHo location, we're located in San Francisco, Seattle, Denver (x2 as Galvanize is a Colorado based company), Austin, Boulder, and Phoenix. Special shout out to our primary partners here in New York -- IBM, PwC, Invesco, and State Farm -- but of course thanks as well as to all our members and member companies for all the amazing, awesome, and exciting work they're doing and allowing us to help them on their path.

For all inquiries, including food or event space sponsorship, please reach out to Scott Schwartz, Galvanize, at scott[dot]schwartz[at]galvanize[dot]com

If you're interested in learning more about Galvanizes data science immersive check out http://www.galvanize.com

Or. Just have a look here at the domains our immersive program covers:

## 1. A data scientist must be able to acquire data appropriate for their research goals.

- Query SQL databases

- selects

- subsets(where and having)

- joins

- Understand structure of RMDBS systems:

- Table schemas

- Foreign keys

- Normal forms

- Star schemas

- Load data from hard files:

- csv

- json

- Scrape data from documents on the web.

- HTTP requests

- GET

- POST

- Basic HTML structure awareness

- Scrape data from web endpoint api's.

- HTTP requests send/receive JSON

- Developer tools

- Beautiful soup

- Regular expressions

- Selenium(?)

## 2. A data scientist must be able to clean and tidy data into a form appropriate for their research goals.

- Load data from databases, flat files, html into python.

- Use python tools to manipulate data in memory

- Pandas

- Numpy

- Pyspark(?)

- Transform data from non-flat formats into flat

- Nested json

- xml(?)

- Extract features from text:

- Regular expressions

- NLP vectorization

- Exploratory data analysis:

- Summarize data: central tendency, variance, outliers, missing values.

- Visualize: scatter-plots, histograms.

- Summarize missing values.

- Model specific transformations:

- Normalization and Standardization.

- Missing value encoding / imputation.

- Predictor transformations.

- Response Transformations.

- Dealing with missing data

- Missing at random vs. missing not at random

- Single and multiple imputation

- Preserving raw data

## 3. A data scientist must be able to visualize and display data to communicate ideas clearly and concisely.

- Appropriate use of titles, labels, and legends.

- Fundamental plot types and thier appropriate application

- Scatter plots

- Bar plots

- Line plots

- Histograms

- Dot plots (not the histogams with dots, point estimates of statistics by catagory)

- Tables

- Use of visual components to convey information

- Multiple plots in one display to encourage comparisons

- Multiple components in one plot to encourage comparisons

- Use of color to distinguish subgroups

- Awareness of color imparement and its consequences for design

- Use of transparency to highlight or downplay some information

- Use of confidence / variance bands or intervals to quantify uncertainty.

- Know what to avoid:

- Pie charts

- Donut plots

- Dynamite plots

- Anything 3D

- Avoid showing too much on one picture: communicate a message clearly.

- Choice of plots for reports and presentation: tell a focused story.

- Interactivity (?)

## 4. A data scientist must be able to write and speak clearly, concisely, and transparently about their research and results.

- Understand your audience

- Team-mates

- Deep dives

- Try to prove yourself *wrong*

- Lots of figures, code, visuals

- Show and qualify doubt

- Solicit doubts from audience

- Actually address these doubts before moving to the next level

- Other researchers, different team and concerns

- Medium overview

- Conclusions with data to support them

- Figures ok, visuals still good

- Focus on statistical support

- Solicit doubts from people with a different point of view

- Actually address these doubts before moving to the next level

- Business stakeholders

- High level overview

- Try to prove yourself *correct*

- Conclusions with light yet solid support

- Light figures, focus on business outcomes

- Qualify uncertainty

- Best, expected, and worst case scenerios

- Solicit doubts from people responsible for the bottom line

- Actually address these doubts

- Reports and Presentations

- Appropriate use of slides

- Good for business types

- High level view of work, light on details

- Lots of visuals

- Appropriate use of white papers and reports

- Deeper dives into work

- Detailed arguments and figures

- Good for scientists and technologists from other business/research areas

- Appropriate use of notebooks

- Deepest dive into work

- Incluses code

- Don't lie, ever

- Seriously, never ever lie

## 5. A data scientist must be able to build predictive models appropriate for their business or research goals.

- Regression for prediction.

- Linear and Logistic regression.

- Transformations of predictors:

- Encoding of categorical features

- Polynomial terms

- Splines(?)

- Regularization

- Ridge regression

- Lasso regression

- Supervised ML algorithms

- K Nearest Neighbors

- Decision (classification and regression) trees.

- Random Forest.

- Boosting.

- Support Vector Machines

- Recommender systems

- User-user and item-item similarity

- Matrix factorization methods

- Neural networks and convolutional neural networks

- Controlling the Bias and Variance of models

- The bias and variance decomposition

- Cross validation to tune complexity parameters

- Model complexity vs. performance plots

- Training size vs. performance plots

- Measuring the predictive performance of models

- Residual vs. prediction plots

- Predicted vs. actual plots

- Basic error metrics

- Mean squared error (R^2)

- Mean absolute error

- Log-loss (logistic loss)

- Estimates of out of sample error

- Hold out validation/testing

- Cross validation for estimating hold out error

- In sample estimates of out of sample error

- Adjusted R^2

- AIC, BIC

- Curse of dimensionality

- Density of equally distributed scatter plots

- Recognition of distance based models

- Examples: KNN, clustering

- Non-examples: regression, tree based models

- Use basic techniques to interpret and "look inside" predictive algorithms

- Parameter estimates (for regression)

- Consequences of correlatyion to interpretability

- Observational vs. experimental data

- All else is never equal

- Partial dependence plots

- Relationship to parameter estimates

- Variable importance measurements

## 6. A data scientist must quantify their uncertainty in their estimates, predictions, and decisions.

- Basic probability.

- Counting, equally likely events.

- Random variables

- Definition

- Density functions

- Distribution functions

- Basic distributions, and the stories they tell.

- Bernoulli

- Binomial

- Poisson

- Exponential

- Uniform

- Conditional probability

- Product/chain rule (for conditional probabilities)

- Independence

- Bayes rule

- Population and sample statistics

- Mean / expectation

- Sample and population variance

- Sample and population standard deviation

- Basic sampling

- Simple random sampling

- Empirical distribution function

- Sampling theory of the mean

- Expectation and variance of the sample mean

- The Law of Large Numbers

- The Central Limit theorem

- The Normal distribution

- Confidence intervals for sample mean

- The Bootstrap

- Bootstrap samples

- Estimating variance of sample statistics with the bootstrap

- Bootstrap confidence intervals

- Bayesian methods

- Prior and posterior distributions

- Likelihood

- Bayesian updates from prior to posterior

- Sampling from the posterior: MCMC(?)

## 7. A data scientist must be able to build inferential models appropriate for answering their business or research questions.

- Null hypothesis significance testing

- Set up null hypothesis and alternative hypothesis

- Derive null distribution

- Define and compute p-values

- Define and compute statistical power

- Define the rejection threshold / region

- Use common tests

- Binomial exact test for one population proportion

- Normal approximate test for one population proportion

- Normal approximate test for two population proportions

- Wald test for two population means

- t statistic

- t distribution

- Chi squared test

- Chi squared statistic

- Chi squared distribution

- Use regression to test conditional dependence

- Probabilistic assumptions of linear / logistic regression

- NHSTs for non-zeroness of regression parameter estimates

- Null and alternative hypothesis for regression parameter estimates

- p-values and confidence intervals

- Bayesian regression(?)

- Priors on regression parameters(?)

- Multilevel regression(?)

## 8. A data scientist must be able to use their predictive and inferential models to make decisions.

- Calculation of expectations using predictive models

- Profit / loss optimization using predictive models

- Probabilistic (soft) vs. non-probabilistic (hard) classification

- Calibration of probability models

- Proper scoring rules

- Thresholding probability models

- Hard classification metrics

- Accuracy

- False positive rate

- True positive rate

- Positive predictive power / Precision"

- Recall / Sensitivity

- Sensitivity

- ROC curves and the AUC

- Profit matrices / curves

- Optimizing expected profit

- Rare class problems with hard classification (aka imbalanced classes)

- Problems with naive thresholding

- Problems with accuracy as a performance measure

- Insensitivity of probability models

- Insensitivity of AUC

- AB testing

- Design of experiments

- Control and treatment groups

- Amount of data to collect / length of experiment

- Common mistakes / misinterpretation

- Regression to the mean

- Early stopping

- Case studies

- Clickthrough rate for competing web pages

- Some more...

## 9. A data scientist must be able to derive insights from unlabeled data.

- Linear methods for dimensionality reduction

- Basic linear algebra competency

- Matrices and vectors

- Matrix vector multiplication, matrix matrix multiplication

- Dot products

- Vector projections

- Eigenvalues and eigenvectors of a matrix

- Principal component analysis

- Geometric definition of principal components

- Computation in terms of eigenvectors

- Projection onto principal components

- Variance in data accounted for by principal components

- Singular value decomposition

- Use cases and non-use cases of linear dimensionality reduction

- Visualization of data

- Low dimensional scatterplots

- Image reconstruction

- Curse of dimensionality control

- Poor man's regularization

- Clustering

- Objective of clustering, finding homogeneous classes

- Curse of dimensionality

- Validating clusters

- Clustering methodologies

- K-means

- Hierarchical clustering

- Graph based methods(?)

- Distance measures / Similarity measures

- L1, L2, hamming

- cosine distance

- dynamic time warp

- sequence / tree edit distances

- Graphs

- Definition

- Encodings

- Edge lists

- Adjacency Matrices

- Measures of centrality

- Vertex centrality

- Edge centrality

- Eigenvector centrality

- Page rank

- Partitions

- Communities

- Modularity

- Algorithms

- Depth first search

- Breadth first search

- Shortest paths

## 10. A data scientist must be able to clearly and cleanly express themselves in code.

- Fundamentals of clean coding

- Knowledge and use of data structures

- Lists

- Tuples

- Dictionaries

- Use of functions to cleanly factor work and process

- Do one thing and one thing only

- Use of objects to encapsulate related data and functionality

- Basic object oriented programming

- Abstraction: create consistent interfaces for related tasks

- Polymorphism: choose appropriate algorithm based on data types

- Encapsulation: Data in objects is operated on by methods

- Testing

- Write unit tests for small pieces of functionality

- Environment

- IPython terminal for checking small snippets of functionality

- Notebooks for

- Research

- EDA

- Presentation of results and process

- Medium level code development

- Modules for

- Final products

- Stable functionality

- General libraries

## 11. A data scientist must be able to organize and manage their work product.

- Version control

- Basic git competency

- git add

- git commit

- git branch

- git merge

- Basic github competency

- forking

- Pull requests

- git push, pull

- Working with others

- Basic git/github workflow

- Actually talking to people

- Reproducible research

- High level documentation of project work

- READMEs

- Dependencies

- Versions

- Data Documentation

- Process for re-pulling data

- Pipeline for data cleanup and transformation

- Modeling pipeline

## 12. A data scientist has learned how to learn, and is adaptive to changes in tools and environment.

- Learn how to Learn

- Adapts to the ever changing landscape of tools and resources

- Does not fall in love with their tools

- Does not fall in love with their data

- Courageously learns and grows

- Comfortably attempts work and research when uncertain about their skills

- Pushes their boundaries to constantly be a better data scientist than the day before

- Works well with a diverse set of personalities and skill sets

- Treats co-workers with respect, no matter their job title

- Actually listens, with the intent to understand

- Disagrees respectfully, with evidence and compassion

- Uses feedback constructively to improve work product

- Freely shares credit where earned

- Celebrates the victories of others

- Cultivates awareness of the consequences of their work

- Is aware of the risks of their recommendations, and keeps decision maker fully informed

- Carefully considers the externalities (positive and negative) of their recommendations

- Does not lie, ever

- Tells the stories the data tell, even when uncomfortable

- Strength of recommendations is based on strength of evidence

- Does not plagiarize

- Does not take credit for others work

- Does not avoid responsibility for mistakes or oversights

Members (6,101)

Photos (17)