Large-Scale Data Processing with Apache Spark
By Kaitlin Logie, January 2017
What is Data Processing?
Many refer to data processing as the “collection and manipulation” of data to produce information that is meaningful. Data processing can involve many important steps:
These methods of processing must be rigorously documented to ensure the utility and integrity of the data. Data processing is very important in research data management!
(image by Offshore)
What is Apache Spark?
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.
(image by hops)
Apache Spark is the largest open-source project in data processing. There are many great benefits Apache Spark can have on your research!
- Speed! Engineered from the bottom-up for performance, Spark can be 100x faster than other software for large scale data processing by exploiting in memory computing and other optimizations. It is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting.
- Ease of Use. Spark has easy-to-use APIs (Application Programming Interfaces) for operating on large datasets.
- A unified engine! Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.
- Runs everywhere. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources! Making it extremely flexible and portable.
How do I get started?
Getting started with Apache Spark is not too difficult whether you come from a Java or Python background!
- Download the latest release — you can run Spark locally on your laptop.
- Read the quick start guide.
- Spark Summit 2014 contained free training videos and exercises.
- Learn how to deploy Spark on a cluster.
(image by rcntec)
If you still need more assistance with specific research projects or just want a little one on one help, come along to one of our Hacky Hours! We will assist you the best we can!