Spark VS Hadoop
As mentioned in previous chapters, Spark and Hadoop are two different frameworks, which have similarities and differences. Also, both of them have their unique pros and cons. So, which one is better; Spark or Handoop? There is no exact answer, because, these platforms are different for comparison, and everyone may find some new and useful features in both of them. So let’s start from history of developing of these two.
Spark and Hadoop are frameworks and the main purposes are analytics of general data and distribution of cluster of computer. Memory computations are provided for speed increasing and processing of data. Spark is run on the top of clusters of Hadoop and also is accessed to data store of Hadoop (HDFS).
What about Hadoop? The main aim of Hadoop is running map / reduce jobs so it is a paralleled structured data processing framework. So, main purpose of using Hadoop is framework, that has a support of multiple models, and Spark is only an alternative form of Hadoop MapReduce, but not the replacement of Hadoop.
What to Choose: Spark or Hadoop
As we said above, both of Spark and Hadoop have advantages and disadvantages, but there are some properties, that you should note. The first and main difference is capacity of RAM and using of it. Spark uses more Random Access Memory than Hadoop, but it “eats” less amount of internet or disc memory, so if you use Hadoop, it’s better to find a powerful machine with big internal storage. This small advice will help you to make your work process more comfortable and convenient. But also, don’t forget, that you may change your decision dynamically; all depends on your preferences.
The next difference between Apache Spark and Hadoop Mapreduce is that all of Hadoop data is stored on disc and meanwhile in Spark data is stored in-memory. The third one is difference between ways of achieving fault tolerance. Spark uses Resilent Distributed Datasets (RDD) that is data storage model which provides you with guaranteeing fault tolerance, that’s why it minimizes your network I/O. If you want to find more info about Resilient Distributed Datasets, please, re-read previous chapters.
What’s Better to Learn First: Hadoop or Spark?
I think that this question isn’t correct. If you learn one of it perfectly, you will not have problems to learn another one. But there are two different views on this problem.
The first says: “It’s better to learn Hadoop, because it’s a fundamental”. Yes, sure, learning of Hadoop technologies will give you a lot of fundamental knowledge, theory and practice skills. Also you may find something new using it.
But the second view says “It’s better to learn Spark, because it’s modern”. And yes, it’s true, Spark has a lot of interesting features that will be explained and listed in next paragraphs. Also, don’t forget, that Spark is only framework that runs on top of HDFS.
If you are developer, maybe, you will not feel the differences between Hadoop and Spark. Spark is a framework which includes enabled parallel commutation using function calls, Hadoop is a library, where you have a possibility for writing map / reduce jobs by Java classes.
And if you are operator, who runs a clusters, the only difference, that you should notice is in deployment of code or configuration monitoring.
Original Features of Apache Spark that Hadoop Doesn’t Have
When we start to talk about decisions, it’s better to note some very specific features of Spark that may help you to decide, what framework suits better to you: Apache Spark or Hadoop Mapreduce. So let’s go through the greatest features of the modern framework (also, there are a lot of features that are described in official site of Apache Spark):
It is really the main feature of Spark. It enables apps to run faster for 100x (!) inmemory and for 10 times faster, if it is even launched in disc memory. Also, there is a possibility in Spark that allows reducing the number of read/write on a disc. And the next feature is that Spark stores this intermediate processing data in-memory. As we mentioned earlier Apache Spark uses Resilent Distributed Database (RDD) technology that may help to store data transparently in memory, without using disc storage at all or using it only when it will needed. It also helps to reduce dics read/write, because processing of data is the most time consummator.
Simple to Learn
Spark provides you a possibility to develop applications based on Java, Python and Scala faster. So now, it is more comfortable to run and create apps, which were written in familiar programming languages and building of parallel applications become more convenient. Also, you have a set of 80 high-level operators available that are built in package of framework.
Combination of Old and New Features
New version of Apache Spark has some new features in addition to trivial map/reduce. New ones are SQL, streaming and complex analytics. Also, you have a possibility to combine all of these features in a one single workflow.
Apache spark now supports Hadoop, Mesos, standalone and cloud technologies.
Application Area of These Frameworks
Hadoop is used to process the big data and fast-growth data and is intended for processing unstructured data. Before using it you need to take into that it does not give access to the data in real timethat by itself, entire array data is processed during the formation of requests.
Hadoop is used to build a global intelligence systems, machine learning, correlation analysis of various data, statistical systems. Hadoopcannot be used itself as an operational database. Typically, in a corporate environment Hadoop is used in conjunction with relational databases. To eliminate the basic disadvantages of the framework additional modules and external applications are used.
Spark in Memory Database
Spark in memory database is a specialized distributed system to speed up data in memory. Integrated with Hadoop and compared with the mechanism provided in the Hadoop MapReduce, Spark provides a 100 times better performance when processing data in the memory and 10 times when placing the data on the disks. The engine can run on both nodes in the cluster using Hadoop, Hadoop YARN, and in a separate operation. Supports data processing in storage HDFS, HBase, Cassandra, Hive and any format input Hadoop (InputFormat). Unlike MapReduce Spark does not store intermediate result sets in disk (if they are not too big to fit in RAM). Spark creates RDDs (Resilient Distributed Datasets), which can be stored and processed in memory full or in part. RDDs have no rigid format.