As defined on the Apache website, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is an extremely fast and general-purpose cluster computing system. It has multi-language support and comes with high-level APIs in Java, Scala, Python, and R. It also supports higher-level tools including Spark SQL for SQL and structured data processing, Spark Streaming for processing near real-time feeds and data, MLlib for machine learning and GraphX for graph processing.
Spark internally works on the map-reduce concept but is highly optimized. Instead of just “map” and “reduce” functions, Spark has a large set of operations called transformations and actions which are ultimately transformed to map/reduce by the Spark execution engine. These transformations and actions are combined to complete the processing.
Spark is developed in Scala, a statically typed high-level programming language that can run on Hadoop in standalone mode using its own default resource manager. In addition, Spark can be used interactively from a modified version of the Scala interpreter, which allows the user to define RDDs, functions, variables, and classes and use them in parallel operations on a cluster. For production deployments, it is usually advised to use a Cluster resource manager like YARN or Mesos. It is not necessary to use Hadoop for Spark, it can also work well with Amazon S3, Azure Blob Storage, Hbase, Cassandra and many more data sources.