Ted Dunning is Chief Applications Architect at MapR Technologies and a contributor to the Apache Mahout, Apache ZooKeeper, and Apache Drill projects. He's also a mentor to the Apache Storm project. Opinionated about software and data-mining and passionate about open source, he is an active participant of the Hadoop community and loves helping projects get going with new technologies. He contributed to Mahout clustering, classification, and matrix decomposition algorithms and helped expand the new version of Mahout Math library. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems, and built fraud-detection systems for ID Analytics (LifeLock). Ted has a PhD in computing science from University of Sheffield. When he's not doing data science, he plays guitar and mandolin. @ted_dunning
Deep Learning for High Performance Time-series Databases
"Recent developments in deep learning make it possible to improve time series databases. I will show how these methods work and how to implement them using Apache Mahout.
Systems such as the Open Time Series Database (Open TSDB) make good use of the ability of HBase, MapR tables and related databases to store columns sparsely. This allows a single row to store many time samples and allows raw scans to retrieve a large number of samples very quickly for visualization or analysis. Typically, older data points are batched together and compressed to save space. At high insertion rates, this approach falters largely because of the limited insert/update rate of HBase. In such situations, it is often better to short segments of data and insert batches that span short time ranges rather than inserting individual data points.
When inserting compressed batches in this fashion, there are a number of obvious strategies that can be used. General compression utilities such as gzip do not normally provide particularly high compression rates. Bespoke crafted compression systems may provide point solutions with high compression rates, but they are generally fairly time-intensive to develop. I will describe how deep learning and sparse-coding techniques can be used to build systems that have very high compression levels (50x or more is typical) and which have the very interesting property that the resulting compressed data can often be queried or analyzed directly without ever decompressing the data. Moreover, it is possible to selectively decompress signals only from desired time ranges within a compressed batch.
These new techniques for building time series data bases enable some exciting capabilities. The benefits include the ability to do query push-down into the time-series database from systems like Apache Drill, better visualization systems, and the ability to build an interesting form of anomaly detector on top of the time-series database.
I will describe how to build these systems using Apache Mahout and illustrate the results with several real examples".
Special thanks to Saba El-Hilo for organizing this.
• 6:00PM Doors are open, feel free to mingle
• 6:20 Presentation starts
• 8:00 Off to a nearby watering hole (Mr. Brownstone?) for a pint, food, and/or breakout discussions
By transit there a number of high frequency buses (check Google Maps or the Translink site for your particular case) that will get you there. For the drivers, there is a fair bit of street parking (free and pay) in the area, especially after 6.
How to Contact Us / Re Comments
Please note any comments you add to this event (below) will be e-mailed to all members of the group. We're trying to avoid spamming the list, so please do not use comments for jokes, job postings, requests for help programming something or anything else off topic. If you have questions or need to contact us, use the 'contact us' link on the left. Thanks!