Initial MySQL and MongoDB talk: Modern data lake (MySQL,Mongo,HBase,Solr,Hive)


Details
18:00-19:30
Modern data lake: MySQL, MongoDB, HBase and Solr ecosystem for real-time big data overviews
We will describe one of the most challenging projects of our advertising system called Sklik.cz (similar to Google AdWords in Czech). We will talk about our journey from a single database to MySQL shard system cooperating with HBase cluster (used for mass statistical data aggregation and web-based viewing in real-time), MongoDB cluster (used as a history event log server), Solr (used as a real-time full-text search engine) and Hive/Impala (used as a batch BI tool).
The first step was to split one large MySQL database with user relational data into multiple shards. Each shard is one Galera cluster (each has at least eight servers). On top of these Galera clusters (together one and half hundred servers) we built an in-house routing mechanism. Now we are working on Spark jobs which will be able to analyze data over all shards.
Historical and current statistical data were migrated into HBase cluster (fifty servers) where we implemented an aggregation and sorting system using HBase coprocessors and filters. The entire ecosystem is synchronized with Spark jobs.
Data changes are tracked and stored in MongoDB cluster (we use MongoRocks engine). From this place we can access these changes for web-based viewing and real-time historical reconstruction. The data are transformed into ORC format via our MongoDB connector and stored in Hadoop for more demanding analytics through Hive/Impala (twenty servers).
Text phrases stored in MySQL, HBase, and MongoDB are indexed via SolrCloud (forty servers).
Besides databases described above, our data lake contains other three open source databases (Cassandra, Aerospike, CouchBase). The whole data lake is built across two datacenters.
Our presentation will demonstrate a way to transform legacy, slow, and non-scalable system to a modern and highly scalable ecosystem where MySQL is working along with NoSQL databases to process a large amount of data in a very short time. While describing this use case, our data lake, and its monitoring, we’ll briefly introduce HBase and Solr and their most important aspects to MySQL users who haven’t got any chance to meet this database yet.
Speakers:
Michal Kuchta (Senior BigData Developer, Seznam.cz),
Radim Špigel (Senior Relational Database Developer, Seznam.cz),
Michal Fizek (Senior BigData Developer, Seznam.cz),
Tomáš Komenda (Database Architect, Seznam.cz)
If you have interesting topics to share, please let me know :)


Initial MySQL and MongoDB talk: Modern data lake (MySQL,Mongo,HBase,Solr,Hive)