Search

Top Pros and Cons of Hadoop

Big Data is one of the major areas of focus in today’s digital world. There are tons of data generated and collected from the various processes carried out by the company. This data could contain patterns and methods as to how the company can improve its processes. The data also contains feedback from the customer. Needless to say, this data is vital to the company and should not be discarded. But, the entire set is also not useful, a certain amount of data is futile. This set should be differentiated from the useful part and discarded. To carry out this major process, various platforms are used. The most popular among these platforms is Hadoop. Hadoop can efficiently analyse the data and extract the useful information. It also comes with its own set of advantages and disadvantages such as: Pros 1. Range of data sources The data collected from various sources will be of structured or unstructured form. The sources can be social media, clickstream data or even email conversations. A lot of time would need to be allotted in order to convert all the collected data into a single format. Hadoop saves this time as it can derive valuable data from any form of data. It also has a variety of functions such as data warehousing, fraud detection, market campaign analysis etc. 2. Cost effective In conventional methods, companies had to spend a considerable amount of their benefits into storing large amounts of data. In certain cases they even had to delete large sets of raw data in order to make space for new data. There was a possibility of losing valuable information in such cases. By using Hadoop, this problem was completely solved. It is a cost-effective solution for data storage purposes. This helps in the long run because it stores the entire raw data generated by a company. If the company changes the direction of its processes in the future, it can easily refer to the raw data and take the necessary steps. This would not have been possible in the traditional approach because the raw data would have been deleted due to increase in expenses. 3. Speed Every organization uses a platform to get the work done at a faster rate. Hadoop enables the company to do just that with its data storage needs. It uses a storage system wherein the data is stored on a distributed file system. Since the tools used for the processing of data are located on same servers as the data, the processing operation is also carried out at a faster rate. Therefore, you can processes terabytes of data within minutes using Hadoop. 4. Multiple copies Hadoop automatically duplicates the data that is stored in it and creates multiple copies. This is done to ensure that in case there is a failure, data is not lost. Hadoop understands that the data stored by the company is important and should not be lost unless the company discards it. Cons 1. Lack of Preventive Measures When handling sensitive data collected by a company, it is mandatory to provide the necessary security measures. In Hadoop, the security measures are disabled by default. The person responsible for data analytics should be aware of this fact and take the required measures to secure the data. 2. Small Data Concerns There are a few big data platforms in the market that aren’t fit for small data functions. Hadoop is one such platform wherein only large business that generates big data can utilize its functions. It cannot efficiently perform in small data environments. 3. Risky Functioning Java is one of the most widely used programming languages. It has also been connected to various controversies because cyber criminals can easily exploit the frameworks that are built on Java. Hadoop is one such framework that is built entirely on Java. Therefore, the platform is vulnerable and can cause unforeseen damages. Every platform used in the digital world comes with its own set of advantages and disadvantages. These platforms serve a purpose that it vital to the company. Hence, it is necessary to check if the pros outweigh the cons. If they do, then utilize the pros and take preventive measures to guard yourself against the cons. To know more about Hadoop and pursue a career in it, enrol for a big data Hadoop certification. You can also gain better with big data Hadoop training online courses.

Top Pros and Cons of Hadoop

22K
Top Pros and Cons of Hadoop

Big Data is one of the major areas of focus in today’s digital world. There are tons of data generated and collected from the various processes carried out by the company. This data could contain patterns and methods as to how the company can improve its processes. The data also contains feedback from the customer. Needless to say, this data is vital to the company and should not be discarded. But, the entire set is also not useful, a certain amount of data is futile. This set should be differentiated from the useful part and discarded. To carry out this major process, various platforms are used. The most popular among these platforms is Hadoop. Hadoop can efficiently analyse the data and extract the useful information. It also comes with its own set of advantages and disadvantages such as:

Pros

1. Range of data sources

The data collected from various sources will be of structured or unstructured form. The sources can be social media, clickstream data or even email conversations. A lot of time would need to be allotted in order to convert all the collected data into a single format. Hadoop saves this time as it can derive valuable data from any form of data. It also has a variety of functions such as data warehousing, fraud detection, market campaign analysis etc.

2. Cost effective

In conventional methods, companies had to spend a considerable amount of their benefits into storing large amounts of data. In certain cases they even had to delete large sets of raw data in order to make space for new data. There was a possibility of losing valuable information in such cases. By using Hadoop, this problem was completely solved. It is a cost-effective solution for data storage purposes. This helps in the long run because it stores the entire raw data generated by a company. If the company changes the direction of its processes in the future, it can easily refer to the raw data and take the necessary steps. This would not have been possible in the traditional approach because the raw data would have been deleted due to increase in expenses.

3. Speed

Every organization uses a platform to get the work done at a faster rate. Hadoop enables the company to do just that with its data storage needs. It uses a storage system wherein the data is stored on a distributed file system. Since the tools used for the processing of data are located on same servers as the data, the processing operation is also carried out at a faster rate. Therefore, you can processes terabytes of data within minutes using Hadoop.

4. Multiple copies

Hadoop automatically duplicates the data that is stored in it and creates multiple copies. This is done to ensure that in case there is a failure, data is not lost. Hadoop understands that the data stored by the company is important and should not be lost unless the company discards it.

Cons

1. Lack of Preventive Measures

When handling sensitive data collected by a company, it is mandatory to provide the necessary security measures. In Hadoop, the security measures are disabled by default. The person responsible for data analytics should be aware of this fact and take the required measures to secure the data.

2. Small Data Concerns

There are a few big data platforms in the market that aren’t fit for small data functions. Hadoop is one such platform wherein only large business that generates big data can utilize its functions. It cannot efficiently perform in small data environments.

3. Risky Functioning

Java is one of the most widely used programming languages. It has also been connected to various controversies because cyber criminals can easily exploit the frameworks that are built on Java. Hadoop is one such framework that is built entirely on Java. Therefore, the platform is vulnerable and can cause unforeseen damages.

Every platform used in the digital world comes with its own set of advantages and disadvantages. These platforms serve a purpose that it vital to the company. Hence, it is necessary to check if the pros outweigh the cons. If they do, then utilize the pros and take preventive measures to guard yourself against the cons. To know more about Hadoop and pursue a career in it, enrol for a big data Hadoop certification. You can also gain better with big data Hadoop training online courses.

KnowledgeHut

KnowledgeHut

Author

KnowledgeHut is an outcome-focused global ed-tech company. We help organizations and professionals unlock excellence through skills development. We offer training solutions under the people and process, data science, full-stack development, cybersecurity, future technologies and digital transformation verticals.
Website : https://www.knowledgehut.com

Join the Discussion

Your email address will not be published. Required fields are marked *

4 comments

Lisette Bruscato 02 Feb 2017

This is the right blog for anyone who wants to find out about this topic. You realize so much its almost hard to argue with you (not that I actually would want…HaHa). You definitely put a new spin on a topic thats been written about for years. Great stuff, just great!

Hadoop Online Training 05 Apr 2017

Nice to see your post.I had some information about Hadoop please visit my site.

Big Data Training in Hyderabad 07 Apr 2017

Thank you, very usefully information

Neena Joshi 15 Jun 2018

What a fantastic read on Hadoop.This has helped me understand a lot in Hadoop

Suggested Blogs

Types Of Big Data

Big Data is creating a revolution in the IT field, every year the use of analytics is increasing drastically every year. We are creating 2.5 quintillion bytes of data every day hence the field is expanding in B2C apps. Big Data has entered almost every industry today and is a dominant driving force behind the success of enterprises and organizations across the Globe. Let us first discuss- “What is Big Data?” “Data” is defined as ‘the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media’, as a quick google search will show. The concept of Big Data is nothing complex; as the name suggests, “Big Data” refers to copious amounts of data which are too large to be processed and analyzed by traditional tools, and the data is not stored or managed efficiently. Since the amount of Big Data increases exponentially- more than 500 terabytes of data are uploaded to Facebook alone, in a single day- it represents a real problem in terms of analysis. Before we jump into the article, let's have a visual introduction on what is Big data and its types. (Structured Data, Semi-Structured & Unstructured Data) Types of Big Data: Classification is essential for the study of any subject. So Big Data is widely classified into three main types, which are- Structured Unstructured Semi-structured 1. Structured data Structured Data is used to refer to the data which is already stored in databases, in an ordered manner. It accounts for about 20% of the total existing data and is used the most in programming and computer-related activities. There are two sources of structured data- machines and humans. All the data received from sensors, weblogs, and financial systems are classified under machine-generated data. These include medical devices, GPS data, data of usage statistics captured by servers and applications and the huge amount of data that usually move through trading platforms, to name a few. Human-generated structured data mainly includes all the data a human input into a computer, such as his name and other personal details. When a person clicks a link on the internet, or even makes a move in a game, data is created- this can be used by companies to figure out their customer behavior and make the appropriate decisions and modifications. Let’s understand Structured data with an example. Top 3 players who have scored most runs in international T20 matches are as follows: Player Country Scores No of Matches played                Brendon McCullum New Zealand                                 2140                                           71                    Rohit Sharma India     2237          90 Virat Kohli  India      2167          65 2. Unstructured data While structured data resides in the traditional row-column databases, unstructured data is the opposite- they have no clear format in storage. The rest of the data created, about 80% of the total account for unstructured big data. Most of the data a person encounters belong to this category- and until recently, there was not much to do to it except storing it or analyzing it manually. Unstructured data is also classified based on its source, into machine-generated or human-generated. Machine-generated data accounts for all the satellite images, the scientific data from various experiments and radar data captured by various facets of technology. Human-generated unstructured data is found in abundance across the internet since it includes social media data, mobile data, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on YouTube and even the text messages we send all contribute to the gigantic heap that is unstructured data. Examples of unstructured data include text, video, audio, mobile activity, social media activity, satellite imagery, surveillance imagery – the list goes on and on. The following image will clearly help you to understand what exactly Unstructured data is The Unstructured data is further divided into – Captured User-Generated data a. Captured data: It is the data based on the user’s behavior. The best example to understand it is GPS via smartphones which help the user each and every moment and provides a real-time output. b. User-generated data: It is the kind of unstructured data where the user itself will put data on the internet every movement. For example, Tweets and Re-tweets, Likes, Shares, Comments, on Youtube, Facebook, etc. 3. Semi-structured data: The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process, are included in semi-structured data. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily. Big Data analysis has been found to have definite business value, as its analysis and processing can help a company achieve cost reductions and dramatic growth. So it is imperative that you do not wait too long to exploit the potential of this excellent business opportunity. Diagram showing Semi-structured data Difference between Structured, Semi-structured and Unstructured data       Factors      Structured data       Semi-structured data     Unstructured data Flexibility It is dependent and less flexible It is more flexible than structured data but less than flexible than unstructured data It is flexible in nature and there is an absence of a schema Transaction Management Matured transaction and various concurrency technique The transaction is adapted from DBMS not matured No transaction management and no concurrency Query performance Structured query allow complex joining Queries over anonymous nodes are possible An only textual query is possible Technology It is based on the relational database table It is based on RDF and XML This is based on character and library data Big data is indeed a revolution in the field of IT. The use of Data analytics is increasing every year. In spite of the demand, organizations are currently short of experts. To minimize this talent gap many training institutes are offering courses on Big data analytics which helps you to upgrade skills set needed to manage and analyze big data. If you are keen to take up data analytics as a career then taking up Big data training will be an added advantage .
4362
Types Of Big Data

Big Data is creating a revolution in the IT field,... Read More

How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this article, we will cover the installation procedure of Apache Spark on the Ubuntu operating system.PrerequisitesThis guide assumes that you are using Ubuntu and Hadoop 2.7 is installed in your system.Java8 should be installed in your Machine.Hadoop should be installed in your Machine.System requirementsUbuntu OS Installed.Minimum of 8 GB RAM.At least 20 GB free space.Installation ProcedureMaking system readyBefore installing Spark ensure that you have installed Java8 in your Ubuntu Machine. If not installed, please follow below process to install java8 in your Ubuntu System.a. Install java8 using below command.sudo apt-get install oracle-java8-installerAbove command creates java-8-oracle Directory in /usr/lib/jvm/ directory in your machine. It looks like belowNow we need to configure the JAVA_HOME path in .bashrc file..bashrc file executes whenever we open the terminal.b. Configure JAVA_HOME and PATH  in .bashrc file and save. To edit/modify .bashrc file, use below command.vi .bashrc Then press i(for insert) -> then Enter below line at the bottom of the file.export JAVA_HOME= /usr/lib/jvm/java-8-oracle/ export PATH=$PATH:$JAVA_HOME/binBelow is the screen shot of that.Then Press Esc -> wq! (For save the changes) -> Enter.c. Now test Java installed properly or not by checking the version of Java. Below command should show the java version.java -versionBelow is the screenshotInstalling Spark on the SystemGo to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.https://spark.apache.org/downloads.htmlThe page will look like belowOr You can use a direct link to download.https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzCreating Spark directoryCreate a directory called spark under /usr/ directory. Use below command to create spark directorysudo mkdir /usr/sparkAbove command asks password to create spark directory under the /usr directory, you can give the password. Then check spark directory is created or not in the /usr directory using below commandll /usr/It should give the below results with ‘spark’ directoryGo to /usr/spark directory. Use below command to go spark directory.cd /usr/sparkDownload Spark versionDownload spark2.3.3 in spark directory using below commandwget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzIf use ll or ls command, you can see spark-2.4.0-bin-hadoop2.7.tgz in spark directory.Extract Spark fileThen extract spark-2.4.0-bin-hadoop2.7.tgz using below command.sudo tar xvzf spark-2.4.0-bin-hadoop2.7Now spark-2.4.0-bin-hadoop2.7.tgz file is extracted as spark-2.4.0-bin-hadoop2.7Check whether it extracted or not using ll command. It should give the below results.ConfigurationConfigure SPARK_HOME path in the .bashrc file by following below steps.Go to the home directory using below commandcd ~Open the .bashrc file using below commandvi .bashrcNow we will configure SPARK_HOME and PATHpress i for insert the enter SPARK_HOME and PATH  like belowSPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7PATH=$PATH:$SPARK_HOME/binIt looks like belowThen save and exit by entering below commands.Press Esc -> wq! -> EnterTest Installation:Now we can verify spark is successfully installed in our Ubuntu Machine or not. To verify use below command then enter.spark-shell Above command should show below screenNow we have successfully installed spark on Ubuntu System. Let’s create RDD and Dataframe then we will end up.a. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below are the codes. Copy paste it one by one on the command line.val nums = Array(1,2,3,5,6) val rdd = sc.parallelize(nums)Above will create RDD.b. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("num")Above code will create Dataframe with num as a column.To display the data in Dataframe use below commanddf.show()Below is the screenshot of the above code.How to uninstall Spark from Ubuntu System: You can follow the below steps to uninstall spark on Windows 10.Remove SPARK_HOME from the .bashrc file.To remove SPARK_HOME variable from the .bashrc please follow below stepsGo to the home directory. To go to home directory use below command.cd ~Open .bashrc file. To open .bashrc file use below command.vi .bashrcPress i for edit/delete SPARK_HOME from .bashrc file. Then find SPARK_HOME the delete SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7 line from .bashrc file and save. To do follow below commandsThen press Esc -> wq! -> Press EnterWe will also delete downloaded and extracted spark installers from the system. Please do follow below command.rm -r ~/sparkAbove command will delete spark directory from the system.Open Command Line Interface then type spark-shell,  then press enter, now we get an error.Now we can confirm that Spark is successfully uninstalled from the Ubuntu System. You can also learn more about Apache Spark and Scala here.
10765
How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster... Read More

How to install Apache Spark on Windows?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this document, we will cover the installation procedure of Apache Spark on Windows 10 operating systemPrerequisitesThis guide assumes that you are using Windows 10 and the user had admin permissions.System requirements:Windows 10 OSAt least 4 GB RAMFree space of at least 20 GBInstallation ProcedureStep 1: Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.The page will look like below.Step 2:  Once the download is completed unzip the file, to unzip the file using WinZip or WinRAR or 7-ZIP.Step 3: Create a folder called Spark under your user Directory like below and copy paste the content from the unzipped file.C:\Users\\SparkIt looks like below after copy-pasting into the Spark directory.Step 4: Go to the conf folder and open log file called, log4j.properties. template. Change INFO to WARN (It can be ERROR to reduce the log). This and next steps are optional.Remove. template so that Spark can read the file.Before removing. template all files look like below.After removing. template extension, files will look like belowStep 5: Now we need to configure path.Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment VariablesAdd below new user variable (or System variable) (To add new user variable click on New button under User variable for )Click OK.Add %SPARK_HOME%\bin to the path variable.Click OK.Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.You can find winutils.exe from below pageDownload it.Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.C:\winutils\binAdd the user (or system) variable %HADOOP_HOME% like SPARK_HOME.Click OK.Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed in your system. Please follow the below processJava Installation Steps:Go to the official Java site mentioned below  the page.Accept Licence Agreement for Java SE Development Kit 8u201Download jdk-8u201-windows-x64.exe fileDouble Click on Downloaded .exe file, you will the window shown below.Click Next.Then below window will be displayed.Click Next.Below window will be displayed after some process.Click Close.Test Java Installation:Open Command Line and type java -version, then it should display installed version of JavaYou should also check JAVA_HOME and path of %JAVA_HOME%\bin included in user variables (or system variables)1. In the end, the environment variables have 3 new paths (if you need to add Java path, otherwise SPARK_HOME and HADOOP_HOME).2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.C:\tmp\hiveTest Installation:Open command line and type spark-shell, you get the result as below.We have completed spark installation on Windows system. Let’s create RDD and     Data frameWe create one RDD and Data frame then will end up.1. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below is code and copy paste it one by one on the command line.val list = Array(1,2,3,4,5) val rdd = sc.parallelize(list)Above will create RDD.2. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("id")Above code will create Dataframe with id as a column.To display the data in Dataframe use below command.Df.show()It will display the below output.How to uninstall Spark from Windows 10 System: Please follow below steps to uninstall spark on Windows 10.Remove below System/User variables from the system.SPARK_HOMEHADOOP_HOMETo remove System/User variables please follow below steps:Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then select them, and press DELETE button.Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE ButtonSelect % HADOOP_HOME%\bin -> Press DELETE Button -> OK ButtonOpen Command Prompt the type spark-shell then enter, now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.
9515
How to install Apache Spark on Windows?

Apache Spark is a fast and general-purpose cluster... Read More

Useful links