addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscontroller-playcrossdots-three-verticaleditemptyheartexporteye-with-lineeyefacebookfolderfullheartglobegmailgooglegroupshelp-with-circleimageimagesinstagramFill 1light-bulblinklocation-pinm-swarmSearchmailmessagesminusmoremuplabelShape 3 + Rectangle 1ShapeoutlookpersonJoin Group on CardStartprice-ribbonprintShapeShapeShapeShapeImported LayersImported LayersImported Layersshieldstartickettrashtriangle-downtriangle-uptwitteruserwarningyahoo

Initial MySQL and MongoDB talk: Modern data lake (MySQL,Mongo,HBase,­Solr,Hive)

  • Seznam.cz

    Radlická 3294/10, 150 00 Praha 5-Anděl, Czechia, Prague (map)

    50.070435 14.401274

  • There will be guides at reception (of Seznam.cz)
  • 18:00-18:25

    MongoDB from DBA point of view

    We run MongoDB on a lot of services. One of them is a publishing system for our news. We are trying to solve optimal sharding, data replication, containerization, and multi data center spreading.  Sarka Jicinska is going to tell you about her daily work with MongoDB.

    Speaker:

    Šárka Jičinská (DBA and system administrator, Seznam.cz)


    18:30-19:30

    Modern data lake: MySQL, MongoDB, HBase and Solr ecosystem for real-time big data overviews

    We will describe one of the most challenging projects of our advertising system called Sklik.cz (similar to Google AdWords in Czech). We will talk about our journey from a single database to MySQL shard system cooperating with HBase cluster (used for mass statistical data aggregation and web-based viewing in real-time), MongoDB cluster (used as a history event log server), Solr (used as a real-time full-text search engine) and Hive/Impala (used as a batch BI tool).

    The first step was to split one large MySQL database with user relational data into multiple shards. Each shard is one Galera cluster (each has at least eight servers). On top of these Galera clusters (together one and half hundred servers) we built an in-house routing mechanism. Now we are working on Spark jobs which will be able to analyze data over all shards. 

    Historical and current statistical data were migrated into HBase cluster (fifty servers) where we implemented an aggregation and sorting system using HBase coprocessors and filters. The entire ecosystem is synchronized with Spark jobs.

    Data changes are tracked and stored in MongoDB cluster (we use MongoRocks engine). From this place we can access these changes for web-based viewing and real-time historical reconstruction. The data are transformed into ORC format via our MongoDB connector and stored in Hadoop for more demanding analytics through Hive/Impala (twenty servers).  

    Text phrases stored in MySQL, HBase, and MongoDB are indexed via SolrCloud (forty servers).

    Besides databases described above, our data lake contains other three open source databases (Cassandra, Aerospike, CouchBase). The whole data lake is built across two datacenters.

    Our presentation will demonstrate a way to transform legacy, slow, and non-scalable system to a modern and highly scalable ecosystem where MySQL is working along with NoSQL databases to process a large amount of data in a very short time. While describing this use case, our data lake, and its monitoring, we’ll briefly introduce HBase and Solr and their most important aspects to MySQL users who haven’t got any chance to meet this database yet.

    Speakers:

    Michal Kuchta (Senior BigData Developer, Seznam.cz),

    Radim Špigel (Senior Relational Database Developer, Seznam.cz)

    Tomáš Komenda (Database Architect, Seznam.cz)


    If you have interesting topics to share, please let me know :) 

Join or login to comment.

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy