AWS at Common Crawl
Stephen Merity, Common Crawl (http://commoncrawl.org/)
Common Crawl (http://commoncrawl.org/) is a non-profit that downloads and releases billions of pages per month, freely downloadable via HTTP and S3. The August crawl (http://commoncrawl.org/august-2014-crawl-data-available/), released this month, is 2.8 billion pages across approximately 200TB.
Our Hadoop cluster fluctuates between 50 and 250 nodes on a common basis, and using spot instances the average spend per day is usually less than $100. The Common Crawl data format, WARC, allows for reasonable compression ratios but also random seeking of specific files very efficiently using the S3 byte range downloads. While more expensive than rolling our own cluster, we allow volunteers to use Amazon EMR when running experiments over our dataset as it really does add simplicity to their lives. We want our volunteers contributing, not managing a cluster.
Building an Anomaly Detection Service for AWS
Scott Purdy, Engineering Manager Numenta (http://numenta.com/)
Grok for IT provides an automated anomaly detection service in AWS. This lightning talk will describe how we designed Grok to be packaged and distributed through the AWS Marketplace and utilize various AWS services like Cloudwatch to provide customers with an anomaly detection system that is trivial to integrate into existing AWS setups. Grok also supports anomaly detection on arbitrary streaming metrics through its custom metrics feature. Sample applications utilizing the anomaly service will be demonstrated.
AWS Data Pipelines at VigLink
Clay Kim, VigLink (http://www.viglink.com/)
Clay will discuss how VigLink uses Data Pipeline for ETL from S3 into RedShift.
Migrating from EC2 Classic to VPC
Ben will discuss specifics of migrations, including:
• Managing Security Groups and Subnets
• Migrating "classic" RDS databases to VPC with minimal downtime
• Switching out load balancers and DNS
Things we missed and would have liked to have done better: Transferring ElasticCache systems, extending the scripting/automating that we use to launch instances and for other tasks to encompass the whole architecture.