Search

Insights Of Analytics And Big Data

There are a few torching questions which keep running in the brain of an accomplished/yearning investigation experts. The mainstream ones are: Which examination instrument would it be advisable for me to learn now? How much a fresher wins in examination/information science organizations in India? Which city has the most open doors for somebody having abilities set like me? Would I be able to locate a superior occupation in my present city or should I move? How much compensation should I be getting at this moment? What’s more, some more… After the US, India has the biggest request of examination and information science experts. Urban areas like Delhi, Bangalore, Hyderabad, Mumbai and so forth are driving the way at this moment. The one of a kind combination of huge information and machine learning has guaranteed a splendid future for IT experts in our nation and around the globe. Accordingly, an ever increasing number of individuals are moving their employments and endeavouring to get a head begin in examination industry. We have made a select infographic which highlights the real experiences produced from this report. For see these bits of knowledge in detail, you can download the entire report beneath. Enormous Data is rapidly turning into a basically critical driver of business accomplishment crosswise over areas, yet numerous administrators say they don’t think their organizations are prepared to capitalize on it. Bain and Company reviewed administrators at more than 400 organizations around the globe, most with incomes of more than $1 billion. We got some information about their information and investigation capacities and about their basic leadership speed and adequacy. Big data analytics course is used to better understand customers and their behaviors and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers The outcomes were astonishing: We found that lone 4% of organizations are better than average at examination, a world class gather that puts into play the ideal individuals, instruments, information and deliberate core interest. These are the organizations that are as of now utilizing examination bits of knowledge to change the way they work or to enhance their items and administrations. What’s more, the distinction is as of now noticeable. These organizations are: As we depict in a friend brief, “Enormous Data: The hierarchical test,” accomplishing competency in Big Data is a three-section handle that requires setting the desire, developing the investigation ability and sorting out your organization to take advantage of the open door. This concise looks all the more carefully at the second step—developing the investigation ability—to perceive how pioneers utilize Big Data to excel. Types of big data that really aid business predictive analytics and can be used to support sales, marketing. Information, devices, individuals and aim. Pioneers develop their investigation capacities by putting resources into four things: information clever individuals, quality information, best in class instruments, and procedures and impetuses that bolster expository basic leadership. About 33% of organizations don’t do any of these well, and a large number of the rest exceed expectations in just a single or two ranges. In any case, to manufacture a high performing examination machine, you have to do every one of the four well. Accomplishment in every ability relies on upon quality in the others.
Rated 4.5/5 based on 17 customer reviews

Insights Of Analytics And Big Data

3K
Insights Of Analytics And Big Data

There are a few torching questions which keep running in the brain of an accomplished/yearning investigation experts.
The mainstream ones are:

  • Which examination instrument would it be advisable for me to learn now?
  • How much a fresher wins in examination/information science organizations in India?
  • Which city has the most open doors for somebody having abilities set like me?
  • Would I be able to locate a superior occupation in my present city or should I move?
  • How much compensation should I be getting at this moment?

What’s more, some more…

After the US, India has the biggest request of examination and information science experts. Urban areas like Delhi, Bangalore, Hyderabad, Mumbai and so forth are driving the way at this moment. The one of a kind combination of huge information and machine learning has guaranteed a splendid future for IT experts in our nation and around the globe. Accordingly, an ever increasing number of individuals are moving their employments and endeavouring to get a head begin in examination industry.

We have made a select infographic which highlights the real experiences produced from this report. For see these bits of knowledge in detail, you can download the entire report beneath.

Enormous Data is rapidly turning into a basically critical driver of business accomplishment crosswise over areas, yet numerous administrators say they don’t think their organizations are prepared to capitalize on it. Bain and Company reviewed administrators at more than 400 organizations around the globe, most with incomes of more than $1 billion. We got some information about their information and investigation capacities and about their basic leadership speed and adequacy. Big data analytics course is used to better understand customers and their behaviors and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers

The outcomes were astonishing: We found that lone 4% of organizations are better than average at examination, a world class gather that puts into play the ideal individuals, instruments, information and deliberate core interest. These are the organizations that are as of now utilizing examination bits of knowledge to change the way they work or to enhance their items and administrations. What’s more, the distinction is as of now noticeable. These organizations are:

As we depict in a friend brief, “Enormous Data: The hierarchical test,” accomplishing competency in Big Data is a three-section handle that requires setting the desire, developing the investigation ability and sorting out your organization to take advantage of the open door. This concise looks all the more carefully at the second step—developing the investigation ability—to perceive how pioneers utilize Big Data to excel. Types of big data that really aid business predictive analytics and can be used to support sales, marketing.

Information, devices, individuals and aim.

Pioneers develop their investigation capacities by putting resources into four things: information clever individuals, quality information, best in class instruments, and procedures and impetuses that bolster expository basic leadership. About 33% of organizations don’t do any of these well, and a large number of the rest exceed expectations in just a single or two ranges. In any case, to manufacture a high performing examination machine, you have to do every one of the four well. Accomplishment in every ability relies on upon quality in the others.

Jasmini

Jasmini Vinarkar

Blog Author

Jasmini - Has over 7 years of experience in SEO. Currently working at <a />Imarticus Learning</a>education institute offering certified industry-endorsed training in Financial Services, Investment Banking, Business Analysis, IT, Business Analytics & Wealth Management.

Join the Discussion

Your email address will not be published. Required fields are marked *

1 comments

ramakrishnan 03 Aug 2018

Thanks for sharing this valuable information on big data.

Suggested Blogs

Apache Spark Vs Hadoop - Head to Head Comparison

Over the past few years, data science has been one of the most sought-after multidisciplinary fields in the world today. It has established itself as an essential component of numerous industries such as marketing optimisation, risk management, marketing analytics. fraud detection, agriculture, etc. Understandably, this has lead to increasing demand for resorting to different approaches to data.When we talk about Apache Spark and Hadoop, it is really difficult to compare them with each other. We should be aware that both possess important features in the world of data science and big data. Hadoop excels over Apache Spark in some business applications, but when processing speed and ease of use is taken into account, Apache Spark has its own advantages that make it unique. The most important thing to note is, neither of these two can replace each other. However, since they are compatible with each other, they can be used together to produce very effective results for many big data applications.To analyse how important these two platforms are, there is a set of parameters with which we can discuss their efficiencies such as performance, ease of use, cost, data processing, compatibility, fault tolerance, scalability, and security. In this article, we will talk about Apache Spark and Hadoop individually for a bit, followed by stressing these parameters to better understand their significance in data science and big data.What is Hadoop?Hadoop, also known as Apache Hadoop, is a project formed by Apache.org that includes a software library and a framework that enables the usage of simple programming models to distributed processing of large data sets (big data) across computer clusters. Hadoop is quite efficient in scaling up from single computer systems to a lot of commodity system, offering substantial local storage. Due to this, Hadoop is considered as an omnipresent heavyweight in the big data analytics space. There are modules that work together to form the Hadoop framework. Here are the main Hadoop framework modules:Hadoop CommonHadoop Distributed File System (HDFS)Hadoop YARNHadoop MapReduceHadoop’s core is based on the above four modules followed by many others like Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop. These are responsible for improving and extending Hadoop’s power to big data applications and large data set processing.Hadoop is utilised by numerous companies using big data sets and analytics and is the de facto model for big data applications. Initially, it was designed to take care of crawling and searching billions of web pages and collecting their information into a database, This resulted in Hadoop Distributed File System (HDFS), a distributed file system designed to run on commodity hardware and Hadoop MapReduce, a processing technique and a program model for distributed computing based on java.Hadoop comes handy when companies find data sets too large and complex to not being able to process the information in reasonably sufficient time. Since crawling and searching the web are text-based tasks, Hadoop MapReduce comes in handy as it is an exceptional text processing engine.An Overview of Apache SparkAn open-source distributed general-purpose cluster-computing framework, Apache Spark is considered as a fast and general engine for large-scale data processing. Compared to heavyweight Hadoop’s Big Data framework, Spark is very lightweight and faster by nearly 100 times. Although the facts say so, in fact, Spark runs up to 10 times faster on disk. Apart from that, it can perform batch processing but it really is good at streaming workloads, interactive queries, and machine-based learning.✓Streaming workloads✓Interactive queries✓Machine-based learning.Spark engine’s real-time data processing capability has a clear edge over Hadoop MapReduce’s disk-bound, batch processing one. Not only is Spark compatible with Hadoop and its modules, but it is also listed as a module on Hadoop’s project page. And because Spark can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it has its own page and a standalone mode. It can run as a Hadoop module and as a standalone solution which makes it difficult to make direct comparisons.Despite these facts, Spark is expected to diverge and might even replace Hadoop, especially in terms of faster access to processed data. Spark’s cluster computing feature enables it to compete with only Hadoop MapReduce and not the entire Hadoop ecosystem. That is why it can use HDFS despite not having its own distributed file system. To be concise, Hadoop MapReduce uses persistent storage whereas Spark uses Resilient Distributed Datasets (RDDs). What is RDD? This will be stressed in the Fault Tolerance section.The differences between Apache Spark and HadoopLet us have a look at the parameters using which we can compare the features of Apache Spark with Hadoop.Apache Spark vs Hadoop in a nutshellApache SparkParametersHadoopProcesses everything in memoryPerformance-wiseHadoop MapReduce uses batch processingHas user-friendly APIs for multiple programming languagesEase of UseHas add-ons such as Hive and PigSpark systems cost moreCostsHadoop MapReduce systems cost lesserShares every Hadoop MapReduce compatibilityCompatibilityCompliments Apache Spark seamlesslyHas GraphX, its own graph computation libraryData ProcessingHadoop MapReduce operates in sequential stepsSpark uses Resilient Distributed Datasets (RDDs)Fault ToleranceUtilises TaskTrackers to keep the JobTracker tickingComparatively lesser scalabilityScalabilityLarge ScalabilityProvides authentication via shared secret (password authentication)SecuritySupports Kerberos authenticationPerformance-wiseSpark is definitely faster when compared to Hadoop MapReduce. However, they cannot be compared because they perform processing in different styles. Spark is way faster because it processes everything in memory, even using disk for data that does not all fit into memory. The in-memory processing of Spark performs near real-time analytics for data from machine learning, log monitoring, marketing campaigns, Internet of Things sensors, security analytics, and social media sites. Hadoop MapReduce, on the other hand, utilises the batch-processing method so it understandably was never created for mesmerising speed. As a matter of fact, it was initially created to continuously gather information from websites during the times when data in or near real-time were not required.Ease of UseSpark does not only have a good reputation for its excellent performance, but it is also relatively easy to use along with providing additional support for languages like user-friendly APIs for Scala, Java, Python, and Spark SQL. Since Spark SQL is quite comparable to SQL 92, the user requires no additional knowledge to use it.Supported Languages:APIs for ScalaJavaPythonSpark SQL.Additionally, Spark is armed with an interactive mode to allow developers and users get instant feedback for questions and other actions. Hadoop MapReduce makes up for the lack of any interactive mode with add-ons like Hive and Pig, thus easing the workflow of Hadoop MapReduce.CostsApache Spark and Apache Hadoop MapReduce are both free open-source software.However, because Hadoop MapReduce’s processing is disk-based, it utilises standard volumes of memory. This results in companies buying faster disks with a lot of disk space to run Hadoop MapReduce. In stark contrast to this, Spark requires a lot of memory but compensates by settling with a standard amount of disk space running at standard speeds.Apache Spark and Apache Hadoop CompatibilityBoth Spark and Hadoop MapReduce are compatible with each other. Moreover, Spark shares every Hadoop MapReduce compatibility for data sources, file formats, and business intelligence tools via JDBC and ODBC.Data ProcessingHadoop MapReduce is a batch-processing engine. So how does it work? Well, it works in sequential steps.Step 1: Reads data from the clusterStep 2: Performs its operation on the dataStep 3: Writes the results back to the clusterStep 4: Reads updated data from the clusterStep 5: Performs the next data operationStep 6: Writes those results back to the clusterStep 7: Repeat.Spark performs in a similar manner, but the process doesn’t go on. It includes a single step and then to memory.Step 1: Reads data from the clusterStep 2: Performs its operation on the dataStep 3: Writes it back to the cluster.Moreover, Spark has GraphX, its own graph computation library. GraphX presents the same data as graphs and collections. Users have the option to use Resilient Distributed Datasets (RDDs) to transform and join graphs. This will be further addressed below in the Fault Tolerance section.Fault ToleranceThere are two different ways in which Hadoop MapReduce and Spark resolve the fault tolerance issue. Hadoop MapReduce utilises nodes like TaskTrackers to keep the JobTracker ticking. On the process being interrupted, the JobTracker reassigns every pending and in-progress operation to another TaskTracker. Although this process effectively provides fault tolerance, the completion times might get majorly affected even for operations having just a single failure.Spark, in this case, applies Resilient Distributed Datasets (RDDs), fault-tolerant collections of elements that can be operated side by side. References can be provided by RDDs in the form of datasets in an external storage system like shared filesystems, HDFS, HBase, or whatever available data source. This results in allowing a Hadoop InputFormat and Spark can create RDDs from every storage source that is backed by Hadoop. That covers local filesystems or one of those listed earlier.Below-mentioned is five main properties that an RDD possesses:A list of partitionsA function for computing each splitA list of dependencies on other RDDsA Partitioner for key-value RDDs by choice (provided that the RDD is hash-partitioned)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)The persistence of RDDs to cache a dataset in memory across operations enables the speeding up of future actions by possibly ten folds. The cache of Spark is fault-tolerant, it will recomputed automatically by making use of the original transformations provided any partition of an RDD is lost.ScalabilityIn terms of scaling up, both Hadoop MapReduce and Spark are on equal terms in using the HDFS. Reports say that Yahoo holds a 42,000 node Hadoop cluster with no bounds while the most comprehensive Spark cluster holds 8,000 nodes. However, in order to support output expectations, the cluster sizes are expected to grow along with that of big data.SecurityKerberos authentication, considered to be quite hectic to manage is supported by Hadoop. Nevertheless, companies have been assisted by third-party vendors to leverage Active Directory Kerberos and LDAP for authentication and also allow data encrypt for in-flight and data at rest. Access control lists (ACLs) a traditional file permissions model are supported by Hadoop while it provides Service Level Authorization for user control in job submission, resulting in clients having the right permissions without any fail.For Spark though, it presently offers somewhat inadequate security as it provides authentication via shared secret (password authentication). However, if the user runs Spark on HDFS, then it can utilise HDFS ACLs and file-level permissions. Moreover, running Spark on YARN will enable the latter to have the capacity of using Kerberos authentication. That is the security takeaway from using Spark.  ConclusionApache Spark and Apache Hadoop form the perfect combination for business applications. Where Hadoop MapReduce has been a revelation in the big data market for businesses requiring huge datasets to be brought under control by commodity systems, Apache Spark’s speed and comparative ease of use compliments the low-cost operation involving Hadoop MapReduce.Like we discussed at the beginning of this article that neither of these two can replace one another, Spark and Hadoop form a lethal and effective symbiotic partnership. While Hadoop has features like a distributed file system that Spark does not have, the latter presents real-time, in-memory processing for the required data sets. Both Hadoop and Spark form the perfect combination for the ideal big data scenario. Rest assured, in this situation, both working in the same team is what goes in favour of big data professionals.You would be interested to know that Knowledgehut offers world-class training for Apache Spark and Hadoop. Feel free to check these courses to enhance your knowledge about both Apache Spark and Hadoop.
Rated 4.5/5 based on 12 customer reviews
6608
Apache Spark Vs Hadoop - Head to Head Comparison

Over the past few years, data science has been one... Read More

Why a Career in Big Data Is the Right Choice for You?

Are you in that job market where the Big Data skills are more appreciated? Confused about whether to make a career shift in Big Data or not? What will be the next career options available for me after Big Data? Just spend some time reading this blog and know the answers to all these questions and the reasons for making Big Data as a career choice.  “Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.” – Chris LynchReasons to Must-Have Big Data in your career1.Increased Job Opportunities for Big Data professionalsWith the technology reaching greater heights, undoubtedly Big Data is becoming a buzz word and a growing need for the organizations in the upcoming years. But, as Jeanne Harris, a senior executive at Accenture Institute said- “Data is useless without the skill to analyze it.”Today, Big Data professionals have a soaring demand across organizations worldwide. Organizations are making huge use of Big Data to stay ahead of the competitive market. The candidates with Big Data skills and expertise are in high demand. According to IBM, the number of jobs for data professionals in the U.S will increase to 2,720,000 by 2020.2. Salary GrowthThe strong demand for Big Data professionals is affecting the wages for qualified professionals. According to Glassdoor, the salary provided by various organizations based on the employees working in these organizations in the US region are as follows:CompanySalaryJ.P. Morgan$93K – $100KCognizant Technology Solutions$92K – $98KCSAA Insurance Group$133K – $144KZipRecruiter$81K – $89KThe salary of Big Data professionals is directly proportional to the factors like the skills earned, education, experience in the domain, knowledge of technology, etc. Also, one needs to understand and solve the real-world Big Data problems and a good grasp of tools and technologies.   3. Massive Big Data adoptionForbes stated that- Big data adoption in enterprises is increased from 17% in 2015 to 59% in 2018, reaching a Compound Annual Growth Rate (CAGR) of 36%. Big Data is steadily spreading its wings across numerous sectors including sales, marketing, research and development, logistics,  strategic management, etc.According to the 'Peer Research – Big Data Analytics' survey by Intel, the decision has incurred that- Big Data is one of the top priorities of the enterprises taking part in the survey as they believe that it improves the performance of their organizations. From the survey, it is found that 45% of the respondents trust that Big Data will offer more business benefits to rank on the top of the Big data market.    “Bigiota Insight out forecasted that the Big Data market is expected to grow to $80 billion from current $40 billion making a revenue of $187 billion.”4. Various options in job titles and responsibilitiesBig Data professionals have an array of job titles open depending on the skills they have achieved so far. The options for the Big Data job aspirants are many where they are free to align their career paths based on their career interests. Some of the job roles Big Data professionals can play are as follows:Data EngineerBusiness Analyst,Visualization SpecialistMachine Learning ExpertAnalytics ConsultantSolution ArchitectBig Data Solution ArchitectBig Data Analyst5. Usage Across numerous firms/industriesToday, Big Data is used almost in every firm. The top 5 industries recruiting Big Data professionals widely are Professional, Scientific and Technical Services (27%), Information Technology (19%), Manufacturing (15%), Finance and Insurance (9%), Retail Trade (9%) and Others 21%.The career path of a Big Data professionalAlthough the term Big Data is used commonly nowadays, there are many career paths available for the Big Data professionals to stand out in the industries that can be explored as per one’s potentiality and interest. The career paths that Big Data professionals can play are:Data ScientistBig Data EngineerBig Data AnalystData Visualization DeveloperMachine Learning EngineerBusiness Intelligence EngineerBusiness Analytics SpecialistMachine Learning ScientistLet us see them in details:Data Scientist:This is the most sought-after career path in Big Data careers. The Data Scientists are the individuals who use their technical and analytical skills to extract meaning from data. They are responsible for collecting, cleaning, and manipulating data.Big Data Engineer:Big Data Engineer is a well-known and more demanding career option. Data Engineers are the professionals responsible for building the designs created by Solution Architects. They are responsible for developing, testing, managing, and maintaining the big data solutions in the enterprises.Big Data Analyst:Being a command on the big data technologies like Hadoop, Hive, Pig, etc. and analytics skills, Data Analyst finds out relevant information from the datasets. This is also most demanding in Big Data career.Data Visualization Developer:The data visualization developers have the responsibilities of designing, conceptualizing, developing the graphics or data visualization, and supporting the data visualization activities. They should have strong technical skills for implementing visualization using tools.Machine Learning Engineer:Today, Machine Learning has become a crucial part of Big Data. Being an expert in machine learning (Machine Learning Engineer) responsible for building the data analysis software to run the product code without human intervention.Business Intelligence Engineer:Business Intelligence Engineer is in more demand today as around 90 percent of IT professionals are planning to increase spending on BI tools, as stated in the Forbes report. BI engineers are responsible for managing the big data warehouses with the help of Big Data tools and solving complex issues related to Big Data.Business Analytics Specialist:Business Analytics Specialist is an expert in Business Analytics field who aids in developing the scripts to test scripts and carrying out testing. They are also responsible for taking up business research activities to analyze the issues for developing cost-effective solutions.Machine Learning Scientist:Machine Learning Scientist work most probably in the research and development department. They are responsible for developing the algorithms to use in adaptive systems, adding product suggestions, and forecasting the demand for the same.Conclusion:As per Entrepreneur, Businesses that use Big Data saw a profit increase from 8 to10 percent and almost 10% reduction in overall cost. Another survey from Forbes states that IBM predicts demand For Data Scientists will reach 28% by the year 2020. As the data pours in, many high-rated companies like Google, Apple, NetApp, Qualcomm, Intuit, FactSet, The MITRE Corporation, Adobe, Salesforce, and so on are investing in Big Data.   According to the most recent McKinsey report, companies based in the U.S. are seeking for hiring 1.5 million Managers and Data Analysts with the strong knowledge and experience in Big Data. One can attain the most in-demand Big Data skills by taking specialized training in Big Data to go for any of the Big Data careers available in the job market.With the rising demand that industries are witnessing, it is an ideal time to add Big data skills to your curriculum vitae and offer yourself the wings to fly in the job market with the ample of Big Data jobs available today!  
Rated 4.5/5 based on 1 customer reviews
9913
Why a Career in Big Data Is the Right Choice for Y...

Are you in that job market where the Big Data skil... Read More

Apache Spark Vs Hadoop MapReduce

Why we need Big Data frameworksBig data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO:Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.You can read DOMO's full report, including industry-specific breakdowns, here.To store and process even only a fraction of this amount of data, we need Big Data frameworks as the traditional Databases would not be able to store so much of data nor traditional processing systems would be able to process this data quickly. Here comes the frameworks like Apache Spark and MapReduce to our rescue and help us to get deep insights into this huge amount of structured, unstructured and semi-structured data and make more sense out of it.Market Demands for Spark and MapReduceApache Spark was originally developed in 2009 at UC Berkeley by the team who later founded Databricks. Since its launch Spark has seen rapid adoption and growth. Most of the cutting-edge technology organizations like Netflix, Apple, Facebook, Uber have massive Spark clusters for data processing and analytics. The demand for Spark is increasing at a very fast pace. According to marketanalysis.com report forecast, the global Apache Spark market will grow at a CAGR of 67% between 2019 and 2022. The global Spark market revenue is rapidly expanding and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 – 2022).MapReduce has been there for a little longer after being developed in 2006 and gained industry acceptance during the initial years. But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. But, it cannot be said in black and white that MapReduce will be completely replaced by Apache Spark in the coming years. Both the technologies have their own pros and cons as we will see them below. One solution cannot fit at all the places, so MapReduce will have its own takers depending on the problem to be solved.Also, Spark and MapReduce do complement each other on many occasions.Both these technologies have made inroads in all walks of common man’s life. You name the industry and its there. Be it telecommunication, e-commerce, banking, insurance, healthcare, medicine, agriculture, biotechnology, etc.What is Spark?As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.Spark, instead of just “map” and “reduce” functions, defines a large set of operations called transformations and actions for the developers and which are ultimately transformed to map/reduce by the spark execution engine and these operations are arbitrarily combined for highly optimized performance.Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. It is not mandatory to use Hadoop for Spark, it can be used with S3 or Cassandra also. But, in the majority of the cases, Hadoop is the best fit as Spark’s data storage layer.Features of SparkSpeed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimising disk read/write operations for intermediate results and storing in memory and perform disk operations only when essential. Spark achieves this using DAG, query optimizer and highly optimized physical execution engine.Fault Tolerance: Apache Spark achieves fault tolerance using spark abstraction layer called RDD (Resilient Distributed Datasets), which are designed to handle worker node failure.Lazy Evaluation: All the processing(transformations) on Spark RDD/Datasets are lazily evaluated, i.e. the output RDD/datasets are not available right away after transformation but will be available only when an action is performed.Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.Multiple Language Support: Spark provides multiple programming language support and you can use it interactively from the Scala, Python, R, and SQL shells.Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as run ad-hoc queries on streaming state.Machine Learning: Apache Spark comes with out of the box support for machine learning called MLib which can be used for complex, predictive data analytics.Graph Processing: GraphX is Apache Spark's API for graphs and graph-parallel computation. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.Real-Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.Where is Spark usually used?Spark is used by 1000+ organizations in Production. Many of these organizations are known to run Spark clusters of 1000+ nodes. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce (which sorted 100 TB of data in 23 min, using 2100 machines) using 10X fever machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data. Below are some examples where Spark is used across industries:AsiaInfo: Uses Spark Core, Streaming, MLlib and Graphx and Hadoop to build cost-effective data centre solution for our customers in the telecom industry as well as other industrial sectors.Atp: Predictive models and learning algorithms to improve the relevance of programmatic marketing.Credit Karma: Creates personalized experiences using SparkeBay Inc: Using Spark core for log transaction aggregation and analyticsKelkoo: Using Spark Core, SQL, and Streaming. Product recommendations, BI and analytics, real-time malicious activity filtering, and data mining.More examples can be found on Apache’s  Powered By pageSpark Example in Scala (Spark shell can be used for this)// “sc” is a “Spark context” – this transforms the file into an RDD val textFile = sc.textFile("data.txt") // Return number of items (lines) in this RDD; count() is an action textFile.count() // Demo filtering.  Filter is a transform.  By itself this does no real work val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Demo chaining – how many lines contain “Spark”?  count() is an action. textFile.filter(line => line.contains("Spark")).count() // Length of line with most words.  Reduce is an action. textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) // Word count – traditional map-reduce.  collect() is an action val word Counts = text File.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) word Counts.collect()Sample Spark Transformationsmap(func): Return a new distributed dataset formed by passing each element of the source through a function func.filter(func): Return a new dataset formed by selecting those elements of the source on which func returns trueunion(other Dataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.Sample Spark Actionsreduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.count(): Return the number of elements in the dataset.The data is referred from the RDD Programming guide.What is MapReduce?MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.Programmers have been writing parallel programs for a long time in different languages like C++, Java, C#, Python. But, they have their own nuances and maintaining these, is the programmer's responsibility. There are chances of application crashing, performance hit, incorrect results. Also, such systems if grows very large is not very fault tolerant or difficult to maintain.MapReduce has simplified all these. Fault tolerance, parallel execution, resources management is all responsibility of the Resource manager and the framework. Programmers have to only concentrate on business logic by writing only map and reduce functions.Brief Description of MapReduce ArchitectureA MapReduce application has broadly two functions called map and reduce.Map: Mapper process takes input as key/value pair, processes them i.e. performs some computation and then produces intermediate results as key/value pairsi.e. map(k1,v1) ---> list(k2,v2)Reduce: Reducer process receives an intermediate key and a set of values in sorted order. It processes these and generates output key/value pairs by grouping values for each key.i.e. reduce(k2, list(v2)) ---> list(v3)Can also define an option function “Combiner” (to optimize bandwidth)If defined, runs after Mapper & before Reducer on every node that has run a map taskCombiner receives as input all data emitted by the Mapper instances on a given nodeCombiner output sent to the Reducers, instead of the output from the MappersIs a "mini-reduce" process which operates only on data generated by one machineHow does MapReduce work?MapReduce is usually applied to huge datasets. A MapReduce job splits the input data into smaller independent chunks called partitions and then processes them independently using map tasks and reduce tasks. Below is an example.MapReduce Word Count (Pseudocode)map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));MapReduceApply a function to all the elements of Listlist1=[1,2,3,4,5]; square x = x * x list2=Map square(list1) print list2 -> [1,4,9,16,25]0Combine all the elements of list for a summarylist1 = [1,2,3,4,5]; A = reduce (+) list1 Print A -> 15Apache Spark vs MapReduceAfter getting off hangover how Apache Spark and MapReduce works, we need to understand how these two technologies compare with each other, what are their pros and cons, so as to get a clear understanding which technology fits our use case.As we can see, MapReduce involves at least 4 disk operations whereas Spark only involves 2 disk operations. This is one reason for Spark to be much faster than MapReduce. Spark also caches intermediate data which can be used in further iterations helping Spark improve its performance further. The more iterative the process the better is the Spark performance due to in-memory processing and caching. This is where MapReduce performance not as good as Spark due to disk read/write operations for every iteration.Let’s see a comparison between Spark and MapReduce on different other parameters to understand where to use Spark and where to use MapReduceAttributesMapReduceApache SparkSpeed/PerformanceMapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and process it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.Spark is 10-100 times faster because of in-memory processing and its caching mechanism. It can deliver near real-time analytics. It is used in Credit Card Processing, Fraud detection, Machine learning and data analytics, IoT sensors etcCostAs it is part of Apache Open Source there is no software cost.Hardware cost is less in MapReduce as it works with smaller memory(RAM) as compared to Spark. Even commodity hardware is sufficient.Spark also is Apache Open Source so no license cost.Hardware cost is more than MapReduce as even though Spark can work on commodity hardware it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all the data in Memory for optimal performance. Cluster needs little high-end commodity hardware with lots of RAM else performance gets hitEase of UseMapReduce is a bit complex to write. MapReduce is written in Java and the APIs are a bit complex to code for new programmers, so there is a steep learning curve involved. The Pig has SQL like syntax and it is easier for SQL developers to get onboard easily. Also, there is no interactive mode available in MapReduceSpark has APIs in Scala, Java, Python, R for all basic transformations and actions. It also has rich Spark SQL APIs for SQL savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Also, Spark has scope for writing User Defined Analytical Functions and Functions (UDF/UDAF) for anyone who would like to have custom functions.CompatibilityMapReduce is also compatible with all data sources and file formats Hadoop supports. But MapReduce needs another Scheduler like YARN or Mesos to run, it does not have any inbuilt Scheduler like Spark’s default/standalone scheduler.Apache Spark can in standalone mode using default scheduler. It can also run on YARN or Mesos. It can run on-premise or on the cloud. Spark supports most of the data formats like parquet, Avro, ORC, JSON, etc. It also supports multiple languages and has APIs for Java, Scala, Python, R.Data ProcessingMapReduce can only be used for batch processing where throughput is more important and latency can be compromised.Spark supports Batch as well as Stream processing, so fits both use cases and can be used for Lambda design where applications need both Speed layer and slower layer/data processing layerSecurityMapReduce has more security features.MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry.Spark is a bit bare at the moment. Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.Fault ToleranceMapReduce uses replication for fault tolerance. If any slave daemon fails, master daemons reschedule all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with a single failure alsoIn Spark, RDDs are the building blocks and Spark also uses it RDDs and DAG for fault tolerance. If an RDD is lost, it will automatically be recomputed by using the original transformations.LatencyMapReduce has high latencySpark provides low latency performanceInteractive ModeMapReduce does not have any interactive mode of operation.Spark can be used interactively also for data processing. It has out-of-the-box support for spark shell for scala/python/RMachine Learning/Graph ProcessingNo support for these. A mahout has to be used for MLSpark has dedicated modules for ML and Graph processingBoth these technologies MapReduce and Spark have pros and cons:MapReduce is best suited for Analysis of archived data where data size is huge and it is not going to fit in memory, and if the instant results and intermediate solutions are not required. MapReduce also scales very well and the cluster can be horizontally scaled with ease using commodity machines.Offline Analytics is a good fit for MapReduce like Top Products per month, Unique clicks per banner.MapReduce is also suited for Web Crawling as well as Crawling tweets at scale and NLP like Sentiment Analysis.Another use case for MapReduce is de-duplicating data from social networking sites, job sites and other similar sites.MapReduce is also heavily used in Data mining for Generating the model and then classifying.Spark is fast and so can be used in Near Real Time data analysis.A lot of organizations are moving to Spark as their ETL processing layer from legacy ETL systems like Informatica. Spark as very good and optimized SQL processing module which fits the ETL requirements as it can read from multiple sources and can also write to many kinds of data sources.Spark can also handle Streaming data so its best suited for Lambda design.Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. Spark has great support for Graph processing using GraphX module.Almost all machine learning algorithms work iteratively. Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverage iterations and yields better results than one pass approximations sometimes used on MapReduce.Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.Conclusion:Hadoop Mapreduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and Mapreduce.
Rated 4.5/5 based on 1 customer reviews
8702
Apache Spark Vs Hadoop MapReduce

Why we need Big Data frameworksBig data is primari... Read More

Useful links