Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames

Nov 5, 2015 · Genève, Switzerland

Chris Freely, who recently left Databricks (Spark people) to join the IBM Spark Technology Center in San Francisco, will present a real-world, open source, advanced analytics and machine learning pipeline using all 20 Open Source technologies listed below.
This Meetup is based on Chris recent "Top-5" Hadoop Summit/Data Science talk called "Spark After Dark". Spark After Dark is a mock online dating site that uses Spark, Spark SQL, DataFrames, MLlib, GraphX, Cassandra, and ElasticSearch - among many other technologies listed below - to generate quality, real-time dating recommendations for its users.
Here are the Spark After Dark slides:
All code - and the entire pipeline runtime - will be dockerized and made publicly available on Github and the Docker Hub Registry.
Technologies to be demo'd:  1) Apache Zeppelin (notebook-based development) 
2) Apache Spark SQL/DataFrames (Data Analysis and ETL) 
3) Apache Spark Streaming + Apache Kafka (Real-time Collection of Live Data from Interactive Demo) 
4) Spark Streaming + Real-time Machine Learning (K-Means Clustering, Log/Lin Regression) 
5) Apache Spark MLlib + GraphX (Generate personalized and non-personalized recommendations using various algorithms and feature engineering techniques including one hot encoding) 
6) MLlib + PMML Integration (Open Standard Markup Language for Predictive Models) 
7) Highly-scalable, NetflixOSS-based Machine Learning Prediction Serving Layer including Service Discover (Eureka) and Circuit Breakers (Hystrix) for Fault Tolerance 
8) Zeppelin + Python-based scikit-learn Machine Learning 
9) Spark + Neo4j = MazeRunner (Real-time Neo4j Graph Updates Beyond GraphX Batch Analytics) 
10) Spark R (Distributed R algorithmns) 
11) Apache Spark JDBC/ODBC Thrift Server (Beeline and Tableau Analytics Explorer Integration) 
12) Tachyon (Off-heap storage) 
13) Spark Job Server (REST API for managing Spark jobs) 
14) Spark + Cassandra (NoSQL, Lambda Arch Speed Layer) 
15) Spark + ElasticSearch (Distributed Search Engine) 
16) Spark + Redis (Distributed, Persistent Key-Value Store Similar to Memcached)
17) Logstash (Log Agent + Collection) 
18) Kibana (ElasticSearch-based Analytics Explorer UI) 
19) HDFS + Parquet (Columnar Storage Format, Tight Compression, Lightning Fast Columnar Aggregations) 
20) Advanced visualizations within Zeppelin using python-based matplotlib and ggplot 
Reminder that we'll be Docker-izing everything for you to reuse. 
Keep an eye on the Github and Docker Hub Registry links under project name "fluxcapacitor":

Event organizers

Are you organizing Realtime Advanced Analytics: Spark Streaming+Kafka, MLlib/GraphX, SQL/DataFrames?

Claim the event and start manage its content.

I am the organizer

based on 0 reviews