Apache Spark | Open Source

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms. Spark can interface with a wide variety of file or storage systems, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, or Amazon S3.

Spark is one of the most actively developed open source projects. It has over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects.

Category

Data Management

Specialization

Hadoop Tools