3 days. 1300 attendees. 500+ companies. 44% increase in event attendance.
All this for an Open Source project that is not Hadoop?
Taking place in beautiful New York city, a colleague Yana Ponomarova and I, both Data Scientists were invited to deliver a talk at the Spark Summit East.
Spark summit, organized by Databricks is the largest event dedicated to the open source project Apache Spark with 1300 attendees representing 500+ large and medium-sized companies such as Bloomberg, SAP, Capital One, Weather.com (Acquired by IBM), among others.
An Open Source project that is not Hadoop. So what’s all the buzz about?
Spark was originally developed by Matei Zaharia during his Ph.D at the University of California, Berkley. The code was later been donated to the Apache foundation.
Spark is a distributed computing framework such as Hadoop MapReduce. It is based on the concept of RDD (Resilient Distributed Dataset) which is a collection distributed over cluster nodes on which one can do data parallelism with fault tolerance mechanism.
2015 has been a great year for Spark, breaking the thousand contributors to the source code mark, with 66k meetup members and seeing the number of summit attendees growth by 350%.
Apache Spark based on the concept of RDD offers many bricks to build upon:
- Spark SQL to query data with SQL like and manipulating Dataframe a collection column oriented
- Spark Streaming to process data within micro batch and operate with window operations
- MLlib to provide distributed machine learning algorithms implementation
- Graphx to run analytics on distributed graph. (OLAP).
The Event: Spark Summit East 2016
The event took place over the course of three days.
Day 1: Dedicated to Spark training, with sessions for developers, administrators, and data scientists.
Day 2 and 3: These featured talks from a variety of speakers who shared their experiences, good practices, and features of Spark. New Features:
Spark version 2 will be released in April / May and with several new features:
- Structured Streaming
- Better performances 5-10x
- Unifying Datasets & Dataframes
Databricks announced the beta of their Community Edition, a free version of their main product, a micro Spark cluster, a cluster manager and access to their great notebook to learn and build prototypes on their PaaS. You can register on the waiting list.
All talks and slides presented at the event can be viewed on the Spark summit website.
Here are some talks that are my personal recommendations:
- Spark performance: what’s next how the tungsten engine is being optimize and architectural changes
- Building realtime data pipelines with kafka connect & spark streaming a great explanation of Kafka connect a reliable server to consume / produce from anything to a kafka cluster
- Lambda at Weather Scale a best of breed on Spark & Cassandra
- 5 myths about Spark & Big Data a gentle introduction to your spark & big data strategy
- Structuring Spark: Dataframes, Datasets and Streaming A dive into the future of Spark
- TopNotch Systematically Quality Controlling Big Data A best of breed of Spark Testing
- Top 5 mistakes when writing Spark applications if you have not done one of this mistake either you are really good or you are doing something wrong
- Continuous Integration for Spark Apps how to use docker for C.I withtout impacting your production cluster
Relationship Extraction from Unstructured Text based on Stanford NLP with Spark- Session with Yana Ponomarova:
About 80% of the information created and used by an enterprise is unstructured data located in content. This figure is growing at twice the rate of structured data. Therefore, mastering and using the knowledge scattered around the abundance of the unstructured documents in an organization can bring about a lot of value.
In the context of our client, a global Oil & Gas company, the valuable information was scattered within large volumes of the engineering reports. Those reports have been written by engineers, in a free and unconstrained format, often times by non-native English speakers, and focusing on the technical characteristics of Oil & Gas operations.
The primary challenge for the client was to extract the supply chain relationships (supplier, receiver, object of delivery and transport) from those reports in order to evaluate the interdependency between its sites around the Globe and better manage the operational risks. It was obvious, that due to the sheer volume and complexity of these documents, the problem could not have been successfully tackled by company’s analysts. Hence, we have developed an automated solution based on Spark integration of Stanford NLP that processes the semantic structure of the sentences, retrieves pieces of supply chain information, matches those to the pieces of the supply chain coming from other sentences in other reports and, finally, presents it to the final user in a form of a graph. The benefits of Spark implementation allowed to treat entire collection of the reports in memory, easily integrate external Stanford NLP libraries.
$33 million, and 1000 contributors later, Spark is rapidly becoming the poster child of distributed computing frameworks. And that’s not even the main reason. Spark is much faster than MapReduce due to the use of the cluster RAM which minimizes the use of disk and network I/O. What makes it even more attractive to the masses of developers, data scientists, SQL analysts, etc. is the ease of access through simplifying the framework and support for development in multiple languages such as Java, Scala, Python, and R.
So what are you waiting for? Keep coding and Spark on…
You can listen to our session here: http://ow.ly/1029DM
For a deep dive of Spark, follow me on @nicolasclaudon