What we're about
Upcoming events (2)
During our 6th event in our “Hello, “ series of introductory Big Data topic meetups, we’re going to introduce you to Apache HBase & Apache Phoenix as available through CDP Operational Database data service. This discussion assumes that you may have heard about them but not that you understand how the combination is similar to or different from a traditional RDBMS.
This presentation will cover the basics of when to use HBase alone vs with Phoenix. We will discuss the strengths and weaknesses of HBase & Phoenix and how you create & deploy an example Java-based application to work with them.
Some of the questions we will address include:
-How exactly does the Apache HBase architecture support both high availability and massive scalability?
-What are the most important use cases which demonstrate Phoenix’s most compelling features?
-How easy it is to build scalable, distributed applications leveraging developers’ existing knowledge of traditional relational databases? We’ll answer these questions and more!
This is still a tricky time for public gatherings, but Future of Data is committed to providing great tech content & facilitating discussions in the "Big Data" space. In order to do our part to fight the spread of COVID-19's Delta variant, this will be an exclusively online event with an originating Time Zone of EDT (the event time displayed on this page will reflect your equivalent local time). Our Providence, RI-focused group is the host venue, we thought it might be of interest to our wider membership (you are welcome to sign up for it here).
Many Spark ETL examples show data transformation through SparkSQL. Not all data engineering tasks fit well into the table-oriented SQL paradigm. In some cases, the data may be in a non-tabular format, such as images, PDFs, or other binary data types. Spark is a well suited to these sorts of data engineering tasks as well.
In this meetup, we will leverage the PySpark framework to read a large number of Digital Imaging and Communications in Medicine (DICOM) medical images produced by MRI scans, use the numpy and pydicom libraries to transform them into a dataset more suitable for training a machine learning model, and write the resultant PNG objects to an output S3 bucket. We are using the data from the "RSNA-MICCAI Brain Tumor Radiogenomic Classification" Kaggle competition, but this approach can be used for general purpose DICOM processing.
Link to Related Tutorial:
Link to Related Video:
Cloudera User's Page:
This is still a tricky time for public gatherings, but Future of Data is committed to providing great tech content & facilitating discussions in the "Big Data" space. In order to do our part to fight the spread of COVID-19's delta variant, this will be an exclusively online event with an originating Time Zone of EDT (the event time displayed on this page will reflect the equivalent local time). Our Northern Virginia-focused group is the host venue, we thought it might be of interest to our wider membership (you are welcome to sign up for it here).