Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

For our November Meetup, we're thrilled to bring you a presentation about a key technique used when analyzing geospatial data -- detection of outliers. As sensor data gets cheaper and more ubiquitous, as business data becomes precisely geotagged, and as locality becomes key in everything from surveys to log files, the well-rounded data scientist needs to be familiar with techniques for effectively working with latitudes, longitudes, points, and geometric objects. Nathan Danneman will talk about techniques he's used for finding geographic outliers -- points that may be signal in the noise, or perhaps noise in the signal -- when you don't have a good model of the data-generating process.

Agenda:

6:30pm -- Networking, Empenadas, and Refreshments

7:00pm -- Introduction

7:15pm -- Presentations and discussion

8:30pm -- Adjourn for Data Drinks (Circa, 2221 I St.)

Abstract:

This talk describes a method for unsupervised, local outlier detection that does not rely on specifying a parametric model for the unlabelled data. The method is a unique amalgam of several “off-the-shelf” techniques, and creates a potent, flexible, scalable solution for identifying local (Type II) outliers. I apply this model to transponder data from ships in the Strait of Hormuz to demonstrate its capabilities, as well as some of the challenges associated with its use.

Bio:

Nathan Danneman is an analytics engineer at Data Tactics, where he analyzes geospatial, textual, and cyber-related data. He holds a PhD in political science from Emory University (2013), with focus areas in applied statistics and international conflict. Some of his past and current work includes quantitative studies of human rights abuses, formal and quantitative modeling of international conflict mediation, and a book on mining and analyzing social media data.

Sponsors:

This event is sponsored by IntrideaStatistics.com, Elder Research, MemSQL, and ParkMe. 

Parking:

For those driving, we encourage you to find parking for this event via our sponsor, ParkMe. ParkMe will help you find the closest, cheapest parking, and has iPhone and Android apps. Click here for their map of nearby parking for this event.

Join or login to comment.

  • Joshua S.

    Thank you.

    December 17, 2013

  • Joshua S.

    Could someone post a link to this data set? Thanks.

    December 17, 2013

  • Brand N.

    I did a blog that follows up on my question to the community about statistics and data science:
    http://semanticommunity.info/Data_Science/World_Drug_Report_2013#Story

    November 26, 2013

  • Nathan D.

    All, Thanks very much for all of the your feedback. My thoughts, and the method generally, will certainly mature as a result of the talk. Also, the data that goes with the R code is now linked to Dropbox from my website, per several great recommendations! All of the materials can be found at www.nathandanneman.com.

    Nathan

    November 24, 2013

  • Doug_S

    And for testing the quality of models, let's try holding out 20 to 30 percent of the points, chosen at random, developing competing models on the remaining points, and then testing how well the models do on the holdout sample.

    November 23, 2013

    • Nathan D.

      I like the idea of this approach; what metric would we use to evaluate it? That is, how would we know if the points in the holdout set were outliers or not?

      November 24, 2013

  • Doug_S

    On the "data science" side, we need to see more mucking around in the data, looking for patterns -- without artificial random points, which have much potential to confuse matters rather than clarifying them. I'd like to see what he has in six months or a year. Meanwhile, if we're going to talk about advantages and disadvantages of various methods, let's illustrate it with data sets that have been more throughly explored and analyzed.

    November 23, 2013

    • Nathan D.

      Doug, great comment. I like the idea of doing a more heads-up comparison between this method and others. However, the false observations are central to this approach, and are actually key to helping identify non-fake boats!

      November 24, 2013

  • Robert D.

    Great topic, and definitely more on the technical side, as proven by the questions asked after the presentation.

    November 22, 2013

    • Doug_S

      And where are you these days? Rumor has it you left Intridea.

      November 23, 2013

  • Doug_S

    For starters, how about plugging the points into something like ArcGIS and looking (pictorially) for relationships between where things happen and when they happen? ArcGIS will also let you draw density maps of how often things happen, by location -- a way to get some statistical insight without abandoning the visual depiction.

    November 23, 2013

  • Doug_S

    The topic was stimulating, and the presentation introduced it well. However, I don't think this presentation is a good starting point for a discussion of data science versus statistical modeling. What we saw was too preliminary to frame that discussion. We need to see what would happen with exploratory data analysis (a la Tukey and Mosteller, classic text from about 1977), other kinds of statistical models (as another commenter pointed out, logistic regression is not the best modeling approach for outlier detection), maybe nearest-neighbor discriminant analysis.

    November 23, 2013

  • Brand N.

    Continued
    So statistics seems to be inclined to accept a data set and try to model it while data science says look around first to see if there is better data to gain insight into the problem and then do the number crunching.

    What does the community think?

    November 23, 2013

  • Brand N.

    So I am going to write a blog about this presentation because I think it illustrates the difference between statistics and data science. I asked the question after the presentation about combining the 6 column by 600,000 row data set with other data that could/would allow one to be more confident of the results and their value and the presenter agreed that should be done.

    The title seems misleading to me because one does not need or should not use Outlier Detection for Identifying Smugglers in a very limited data set when that has already been done by for example the United Nations Drug Report 2013: http://www.unodc.org/wdr/ with really big geo-spatial data (the entire world for multiple years). One can download a spreadsheet:
    http://www.unodc.org/unodc/secured/wdr/wdr2013/Seizures_2007-2011.xls and do both meaningful statistics and data science.

    November 23, 2013

  • Michael K.

    I just wanted to point out that logistic regression (glm with logit link) is not robust to outliers in the explanatory variables. http://www.sciencedirect.com/science/article/pii/S0167947302003043
    This might not be a problem since you can use any (possibly robust) classifier instead of logistic regression. If you're interested in robust methods R has a lot of useful tools for it: http://cran.r-project.org/web/views/Robust.html

    November 22, 2013

    • Nathan D.

      This is a great point, and was also brought to my attention by another audience member after the talk. I'll note that, given the density of the data, the presence of a few observations with extreme values for one or two covariates is unlikely to present problems in this particular application. That said, there's no reason not to choose a method that would be robust to this concern!

      November 22, 2013

  • Nathan D.

    All,

    Thanks for your wonderful attention, feedback, and company at Data Drinks! R code and slides are now up at my website (http://www.nathandanneman.com/presentations-and-materials). I'd like to share the data as well, but haven't found a clever way to post it to wordpress. Any thoughts?

    Thanks,
    Nathan Danneman

    1 · November 22, 2013

    • Ryan H.

      Agreed, Dropbox works well.

      November 22, 2013

    • Ryan H.

      There's also datahub.io, which sometimes works well.

      November 22, 2013

  • Nevin H.

    Top notch high brow entertainment for all; I guess Emory does more than medical education. If I were a Navy Admiral allocating scare resources to that part of the world, I wonder if I could test my assumptions about "best time of day to look for bad activity" with this kind of model?

    November 22, 2013

  • Eric W

    Fascinating presentation and some really good, thought-provoking questions following it. However, it would have been neat to compare the results of the logistic models to those obtained from alternative methods such as discriminant analysis.

    November 22, 2013

  • Harlan H.

    Thanks to everyone who attended! Would anyone like to write an event review for the blog? Free publicity! We'll have slides and audio available for you (and they'll be posted publicly soon too). Let me know!

    November 22, 2013

  • Bill E.

    Generally good.

    November 21, 2013

  • David J. E.

    I am very happy with this event. The speaker was very engaging, worked on a hard problem with an intriguing data set, and explained an interesting approach to solve his problem which is also able to be generalized to other contexts and data sets.

    On quibble though: the speaker presented a slide with model-fitted probabilities that that the boats were anomalous. These are not meaningful probabilities because they were generated from a logistic regression model with a case-control design (cases being real boats and controls being the randomly-generated fake boats). In this case, the regression coefficients are all statistically valid except for the intercept term, which will be intrinsically fixed from the start by the ratio of cases to controls. Without correcting the intercept term to reflect the true ratio of real boats to fake boats, it is not possible to interpret the probability output of a logistic model. The odds ratio, however, is still meaningful.

    3 · November 21, 2013

  • Robert D.

    Looking forward to the meetup tonight. See you all there!

    November 21, 2013

  • carlos r.

    A conflict came up for me so I cannot make it. I look forward to the online collateral

    November 20, 2013

  • A former member
    A former member

    Can't make it! :( Hack and Tell conflicts. I'd love to see slides if they can be made available afterward though! :)

    November 18, 2013

    • Nathan D.

      That's a shame Aaron. I'll make sure the slides get posted somewhere accessible.

      November 19, 2013

  • Andrea

    Sorry still recovering from a 48 hr hackathon

    November 19, 2013

  • A former member
    A former member

    Looking forward to it!

    November 19, 2013

  • Michael K.

    I'd be interested in talking about outlier detection and potential academic papers with the presenter. My experience in this are can be found @ mathworks.com/matlabcentral/fileexchange/authors/307195

    November 18, 2013

  • Jason H.

    This should be great.

    November 13, 2013

  • Allen

    Looking forward to it

    November 6, 2013

  • Cindy C.

    eager to learn more

    November 3, 2013

  • Fola

    Na

    November 2, 2013

  • Andrew D.

    Brand new to the Analytics world, been doing the GIS for a while.. I look forward to the meeting..

    October 25, 2013

  • Joe C.

    Thanks for the invite.

    October 23, 2013

  • Ajaya U.

    Looking forward to it. Used Winbug for goespatial data for sometime..

    October 23, 2013

  • Tom F.

    Looking forward to it!

    1 · October 23, 2013

  • Cindy C.

    eager to learn more

    October 23, 2013

125 went

Our Sponsors

People in this
Meetup are also in:

Imagine having a community behind you

Get started Learn more
Henry

I decided to start Reno Motorcycle Riders Group because I wanted to be part of a group of people who enjoyed my passion... I was excited and nervous. Our group has grown by leaps and bounds. I never thought it would be this big.

Henry, started Reno Motorcycle Riders

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy