Search

4 Types Of Data Analytics To Improve Decision-Making

If you are on CSE stack portal, there’s a good chance that you are already well acquainted with the general terms like ‘Data Analytics’, ‘Big Data’ and ‘Business Intelligence’ lead to different things in different circumstances. But have you thought what would be the right BI platform to hack through a wide number of solutions for business success? In this article, I will knuckle down disambiguating the term ‘Data Analytics’ by splitting it down into 4 different types and aligning them with decision-making objectives. Descriptive Analytics: What happened? The commonest of the common type of Analytics, Descriptive Analytics offers the analyst a comprehensive view of key metrics and measures within an organization. It analyses the data available in real-time as well as historical data to derive meaningful insights regarding the future of a company. The main aim of this basic type of analytics is to discover the reasons behind pretentious success or failure in the past, as a result it is also known as ‘Reporting Bedrock’. A business learns from its past behaviors, and draws inceptions based on those observations about its future outcomes, how they are going to affect. Descriptive Analytics is clouted the best when a business is on its way to understand the overall performance of the organization at an aggregate level and perceive the various aspects. The best example of this would be a profit and loss statement. In the same way, analysts can possess data on a huge population of customers – delving deeper into mastering the demographic information of these customers can be classified as ‘descriptive analytics’. Diagnostic Analytics: What made it happen? The next stop to understand the intricacies of Data Analytics after Descriptive Analytics is Diagnostic Analytics. After assessing descriptive data, brilliant diagnostic analytical tools enable an analyst to go deeper into the problem, with the help of drilldowns and queries to eradicate the root-cause of the trouble. In simple words, in this analytics, historical data are ascertained against other data to reveal the answer of the question ‘why it happened’. With Diagnostic Analytics, the companies are now able to make breakthroughs, to pick out the dependencies and to discern patterns. Organizations prefer this type of analytics as it gives them a deeper perception regarding a specific problem. On the other hand, the organizations should keep all the detailed information by their side, otherwise data collection may turn out to be time-consuming. Effectively designed, well-integrated Business Information (BI) dashboards that assimilate the readings of time-series data, and participating filters and drilldown capabilities are deemed perfect for such analysis. Predictive Analytics: What is going to happen? It is all in the right predictions. Predictive Analytics involve analysis of past data patterns and trends to accurately forecast the future business outcome. It helps in determining realistic goals for the company and its effective execution and moderating expectations, by manipulating the findings of Descriptive and Diagnostic Analytics. Thanks to Predictive Analytics, as it is now easy to identify tendencies, clusters and exceptions, while predicting future trends – all of this makes this analytics an extremely valuable tool of help. By employing numerous machine learning algorithms and statistical approaches, Insight Analytics eventually predicts the likelihood of an event happening in the future, but remember, these assumptions are based on predictions and probabilities, hence not 100% accurate. Big conglomerates like Amazon and Walmart leverage this high-in-value type of analytics to decipher future sales trend, customer behaviors, purchase patterns and lot more. Prescriptive Analytics: What is to be done? This is where Big Data and Artificial Intelligence gets into action. The main objective of Prescriptive Analytics is to prescribe what action is to be taken to address the future problem. It is the next stop after Predictive Analytics to help business understand the underlying reasons of complications and devise the best of course of action. It shares insights on possible results and outcomes that eventually maximize chief business metrics. It works by combining mathematical models, data and numerous business rules. The data can be external as well as internal, while business rules are boundaries, preferences, best practices and other restraints. Machine learning, natural language processing, operations research and statistics area few examples of mathematical models. Though complex in nature, Prescriptive Analytics when used by companies can have a huge impact on the overall operations and future business growth. The best example of this type of analytics is a traffic application that enables you to select the easiest route to home, after paying attention to the distance of the route, the speed of travelling and prevailing traffic constraints in the city you are travelling. The current trends highlight that an increasing number of companies are appreciating Big Data solutions and looking forward to Data Analytics implementation.However, it is just that they should select the right type of analytics solutions to enhance ROI, increase service quality and lessen operational costs. Do you have any other information or thought on this topic? Feel free to share with us by commenting below.
Rated 4.0/5 based on 20 customer reviews

4 Types Of Data Analytics To Improve Decision-Making

673
4 Types Of Data Analytics To Improve Decision-Making

If you are on CSE stack portal, there’s a good chance that you are already well acquainted with the general terms like ‘Data Analytics’, ‘Big Data’ and ‘Business Intelligence’ lead to different things in different circumstances. But have you thought what would be the right BI platform to hack through a wide number of solutions for business success?

In this article, I will knuckle down disambiguating the term ‘Data Analytics’ by splitting it down into 4 different types and aligning them with decision-making objectives.

Descriptive Analytics: What happened?

The commonest of the common type of Analytics, Descriptive Analytics offers the analyst a comprehensive view of key metrics and measures within an organization. It analyses the data available in real-time as well as historical data to derive meaningful insights regarding the future of a company. The main aim of this basic type of analytics is to discover the reasons behind pretentious success or failure in the past, as a result it is also known as ‘Reporting Bedrock’.

A business learns from its past behaviors, and draws inceptions based on those observations about its future outcomes, how they are going to affect. Descriptive Analytics is clouted the best when a business is on its way to understand the overall performance of the organization at an aggregate level and perceive the various aspects.

The best example of this would be a profit and loss statement. In the same way, analysts can possess data on a huge population of customers – delving deeper into mastering the demographic information of these customers can be classified as ‘descriptive analytics’.

Diagnostic Analytics: What made it happen?

The next stop to understand the intricacies of Data Analytics after Descriptive Analytics is Diagnostic Analytics. After assessing descriptive data, brilliant diagnostic analytical tools enable an analyst to go deeper into the problem, with the help of drilldowns and queries to eradicate the root-cause of the trouble. In simple words, in this analytics, historical data are ascertained against other data to reveal the answer of the question ‘why it happened’.

With Diagnostic Analytics, the companies are now able to make breakthroughs, to pick out the dependencies and to discern patterns. Organizations prefer this type of analytics as it gives them a deeper perception regarding a specific problem. On the other hand, the organizations should keep all the detailed information by their side, otherwise data collection may turn out to be time-consuming.

Effectively designed, well-integrated Business Information (BI) dashboards that assimilate the readings of time-series data, and participating filters and drilldown capabilities are deemed perfect for such analysis.

Predictive Analytics: What is going to happen?

It is all in the right predictions. Predictive Analytics involve analysis of past data patterns and trends to accurately forecast the future business outcome. It helps in determining realistic goals for the company and its effective execution and moderating expectations, by manipulating the findings of Descriptive and Diagnostic Analytics.

Thanks to Predictive Analytics, as it is now easy to identify tendencies, clusters and exceptions, while predicting future trends – all of this makes this analytics an extremely valuable tool of help. By employing numerous machine learning algorithms and statistical approaches, Insight Analytics eventually predicts the likelihood of an event happening in the future, but remember, these assumptions are based on predictions and probabilities, hence not 100% accurate.

Big conglomerates like Amazon and Walmart leverage this high-in-value type of analytics to decipher future sales trend, customer behaviors, purchase patterns and lot more.

Prescriptive Analytics: What is to be done?

This is where Big Data and Artificial Intelligence gets into action. The main objective of Prescriptive Analytics is to prescribe what action is to be taken to address the future problem. It is the next stop after Predictive Analytics to help business understand the underlying reasons of complications and devise the best of course of action.

It shares insights on possible results and outcomes that eventually maximize chief business metrics. It works by combining mathematical models, data and numerous business rules. The data can be external as well as internal, while business rules are boundaries, preferences, best practices and other restraints. Machine learning, natural language processing, operations research and statistics area few examples of mathematical models.

Though complex in nature, Prescriptive Analytics when used by companies can have a huge impact on the overall operations and future business growth. The best example of this type of analytics is a traffic application that enables you to select the easiest route to home, after paying attention to the distance of the route, the speed of travelling and prevailing traffic constraints in the city you are travelling.

The current trends highlight that an increasing number of companies are appreciating Big Data solutions and looking forward to Data Analytics implementation.However, it is just that they should select the right type of analytics solutions to enhance ROI, increase service quality and lessen operational costs. Do you have any other information or thought on this topic? Feel free to share with us by commenting below.

Eshika

Eshika Roy

Blog Author

Eshika Roy is a seasoned copywriter working for DexLab Analyticsby the day, and a hobbyist playing with numbers by the night. She brings to us this new future face of technology and how it would change our world. Beyond this she has an inclination for fiction novels, exploring different cuisines, and confectionery and dessert cooking. LinkedIn

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Installing Apache Spark on windows

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this document, we will cover the installation procedure of Apache Spark on Windows 10 operating systemPrerequisitesThis guide assumes that you are using Windows 10 and the user had admin permissions.Audience:This document can be referred by anyone who wants to install the latest version of Apache Spark on Windows 10.System requirements:Windows 10 OS4 GB RAM20 GB free spaceInstallation ProcedureStep 1: Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.The page will look like below.Step 2:  Once the download is completed unzip the file, to unzip the file using WinZip or WinRAR or 7-ZIP.Step 3: Create a folder called Spark under your user Directory like below and copy paste the content from the unzipped file.C:\Users\\SparkIt looks like below after copy-pasting into the Spark directory.Step 4: Go to the conf folder and open log file called, log4j.properties. template. Change INFO to WARN (It can be ERROR to reduce the log). This and next steps are optional.Remove. template so that Spark can read the file.Before removing. template all files look like below.After removing. template extension, files will look like belowStep 5: Now we need to configure path.Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment VariablesAdd below new user variable (or System variable) (To add new user variable click on New button under User variable for )Click OK.Add %SPARK_HOME%\bin to the path variable.Click OK.Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.You can find winutils.exe from below pageDownload it.Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.C:\winutils\binAdd the user (or system) variable %HADOOP_HOME% like SPARK_HOME.Click OK.Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed in your system. Please follow the below processJava Installation Steps:Go to the official Java site mentioned below  the page.Accept Licence Agreement for Java SE Development Kit 8u201Download jdk-8u201-windows-x64.exe fileDouble Click on Downloaded .exe file, you will the window shown below.Click Next.Then below window will be displayed.Click Next.Below window will be displayed after some process.Click Close.Test Java Installation:Open Command Line and type java -version, then it should display installed version of JavaYou should also check JAVA_HOME and path of %JAVA_HOME%\bin included in user variables (or system variables)1. In the end, the environment variables have 3 new paths (if you need to add Java path, otherwise SPARK_HOME and HADOOP_HOME).2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.C:\tmp\hiveTest Installation:Open command line and type spark-shell, you get the result as below.We have completed spark installation on Windows system. Let’s create RDD and     Data frameWe create one RDD and Data frame then will end up.1. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below is code and copy paste it one by one on the command line.val list = Array(1,2,3,4,5) val rdd = sc.parallelize(list)Above will create RDD.2. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("id")Above code will create Dataframe with id as a column.To display the data in Dataframe use below command.Df.show()It will display the below output.How to uninstall Spark from Windows 10 System: Please follow below steps to uninstall spark on Windows 10.Remove below System/User variables from the system.SPARK_HOMEHADOOP_HOMETo remove System/User variables please follow below steps:Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then select them, and press DELETE button.Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE ButtonSelect % HADOOP_HOME%\bin -> Press DELETE Button -> OK ButtonOpen Command Prompt the type spark-shell then enter, now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.
Rated 4.5/5 based on 1 customer reviews
8614
Installing Apache Spark on windows

Apache Spark is a fast and general-purpose cluster... Read More

Apache Spark Vs Hadoop MapReduce

Why we need Big Data frameworksBig data is primarily defined by the volume of a data set. Big data sets are generally huge – measuring tens of terabytes – and sometimes crossing the threshold of petabytes. It is surprising to know how much data is generated every minute. As estimated by DOMO:Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.You can read DOMO's full report, including industry-specific breakdowns.To store and process even only a fraction of this amount of data, we need Big Data frameworks as the traditional Databases would not be able to store so much of data nor traditional processing systems would be able to process this data quickly. Here comes the frameworks like Apache Spark and MapReduce to our rescue and help us to get deep insights into this huge amount of structured, unstructured and semi-structured data and make more sense out of it.Market Demands for Spark and MapReduceApache Spark was originally developed in 2009 at UC Berkeley by the team who later founded Databricks. Since its launch Spark has seen rapid adoption and growth. Most of the cutting-edge technology organizations like Netflix, Apple, Facebook, Uber have massive Spark clusters for data processing and analytics. The demand for Spark is increasing at a very fast pace. According to marketanalysis.com report forecast, the global Apache Spark market will grow at a CAGR of 67% between 2019 and 2022. The global Spark market revenue is rapidly expanding and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 – 2022).MapReduce has been there for a little longer after being developed in 2006 and gained industry acceptance during the initial years. But at last, 5 years or so with Apache Spark gaining more ground, demand for MapReduce as the processing engine has reduced. But, it cannot be said in black and white that MapReduce will be completely replaced by Apache Spark in the coming years. Both the technologies have their own pros and cons as we will see them below. One solution cannot fit at all the places, so MapReduce will have its own takers depending on the problem to be solved.Also, Spark and MapReduce do complement each other on many occasions.Both these technologies have made inroads in all walks of common man’s life. You name the industry and its there. Be it telecommunication, e-commerce, banking, insurance, healthcare, medicine, agriculture, biotechnology, etc.What is Spark?As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.Spark, instead of just “map” and “reduce” functions, defines a large set of operations called transformations and actions for the developers and which are ultimately transformed to map/reduce by the spark execution engine and these operations are arbitrarily combined for highly optimized performance.Spark is developed in Scala language and it can run on Hadoop in standalone mode using its own default resource manager as well as in Cluster mode using YARN or Mesos resource manager. It is not mandatory to use Hadoop for Spark, it can be used with S3 or Cassandra also. But, in the majority of the cases, Hadoop is the best fit as Spark’s data storage layer.Features of SparkSpeed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk. Spark achieves this by minimising disk read/write operations for intermediate results and storing in memory and perform disk operations only when essential. Spark achieves this using DAG, query optimizer and highly optimized physical execution engine.Fault Tolerance: Apache Spark achieves fault tolerance using spark abstraction layer called RDD (Resilient Distributed Datasets), which are designed to handle worker node failure.Lazy Evaluation: All the processing(transformations) on Spark RDD/Datasets are lazily evaluated, i.e. the output RDD/datasets are not available right away after transformation but will be available only when an action is performed.Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.Multiple Language Support: Spark provides multiple programming language support and you can use it interactively from the Scala, Python, R, and SQL shells.Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as run ad-hoc queries on streaming state.Machine Learning: Apache Spark comes with out of the box support for machine learning called MLib which can be used for complex, predictive data analytics.Graph Processing: GraphX is Apache Spark's API for graphs and graph-parallel computation. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.Real-Time Stream Processing: Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs.Where is Spark usually used?Spark is used by 1000+ organizations in Production. Many of these organizations are known to run Spark clusters of 1000+ nodes. In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce (which sorted 100 TB of data in 23 min, using 2100 machines) using 10X fever machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB. Several production workloads use Spark to do ETL and data analysis on PBs of data. Below are some examples where Spark is used across industries:AsiaInfo: Uses Spark Core, Streaming, MLlib and Graphx and Hadoop to build cost-effective data centre solution for our customers in the telecom industry as well as other industrial sectors.Atp: Predictive models and learning algorithms to improve the relevance of programmatic marketing.Credit Karma: Creates personalized experiences using SparkeBay Inc: Using Spark core for log transaction aggregation and analyticsKelkoo: Using Spark Core, SQL, and Streaming. Product recommendations, BI and analytics, real-time malicious activity filtering, and data mining.More examples can be found on Apache’s  Powered By pageSpark Example in Scala (Spark shell can be used for this)// “sc” is a “Spark context” – this transforms the file into an RDD val textFile = sc.textFile("data.txt") // Return number of items (lines) in this RDD; count() is an action textFile.count() // Demo filtering.  Filter is a transform.  By itself this does no real work val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Demo chaining – how many lines contain “Spark”?  count() is an action. textFile.filter(line => line.contains("Spark")).count() // Length of line with most words.  Reduce is an action. textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) // Word count – traditional map-reduce.  collect() is an action val word Counts = text File.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) word Counts.collect()Sample Spark Transformationsmap(func): Return a new distributed dataset formed by passing each element of the source through a function func.filter(func): Return a new dataset formed by selecting those elements of the source on which func returns trueunion(other Dataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.Sample Spark Actionsreduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.count(): Return the number of elements in the dataset.The data is referred from the RDD Programming guide.What is MapReduce?MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.Programmers have been writing parallel programs for a long time in different languages like C++, Java, C#, Python. But, they have their own nuances and maintaining these, is the programmer's responsibility. There are chances of application crashing, performance hit, incorrect results. Also, such systems if grows very large is not very fault tolerant or difficult to maintain.MapReduce has simplified all these. Fault tolerance, parallel execution, resources management is all responsibility of the Resource manager and the framework. Programmers have to only concentrate on business logic by writing only map and reduce functions.Brief Description of MapReduce ArchitectureA MapReduce application has broadly two functions called map and reduce.Map: Mapper process takes input as key/value pair, processes them i.e. performs some computation and then produces intermediate results as key/value pairsi.e. map(k1,v1) ---> list(k2,v2)Reduce: Reducer process receives an intermediate key and a set of values in sorted order. It processes these and generates output key/value pairs by grouping values for each key.i.e. reduce(k2, list(v2)) ---> list(v3)Can also define an option function “Combiner” (to optimize bandwidth)If defined, runs after Mapper & before Reducer on every node that has run a map taskCombiner receives as input all data emitted by the Mapper instances on a given nodeCombiner output sent to the Reducers, instead of the output from the MappersIs a "mini-reduce" process which operates only on data generated by one machineHow does MapReduce work?MapReduce is usually applied to huge datasets. A MapReduce job splits the input data into smaller independent chunks called partitions and then processes them independently using map tasks and reduce tasks. Below is an example.MapReduce Word Count (Pseudocode)map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));MapReduceApply a function to all the elements of Listlist1=[1,2,3,4,5]; square x = x * x list2=Map square(list1) print list2 -> [1,4,9,16,25]0Combine all the elements of list for a summarylist1 = [1,2,3,4,5]; A = reduce (+) list1 Print A -> 15Apache Spark vs MapReduceAfter getting off hangover how Apache Spark and MapReduce works, we need to understand how these two technologies compare with each other, what are their pros and cons, so as to get a clear understanding which technology fits our use case.As we can see, MapReduce involves at least 4 disk operations while Spark only involves 2 disk operations. This is one reason for Spark to be much faster than MapReduce. Spark also caches intermediate data which can be used in further iterations helping Spark improve its performance further. The more iterative the process the better is the Spark performance due to in-memory processing and caching. This is where MapReduce performance not as good as Spark due to disk read/write operations for every iteration.Let’s see a comparison between Spark and MapReduce on different other parameters to understand where to use Spark and where to use MapReduceAttributesMapReduceApache SparkSpeed/PerformanceMapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and process it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.Spark is 10-100 times faster because of in-memory processing and its caching mechanism. It can deliver near real-time analytics. It is used in Credit Card Processing, Fraud detection, Machine learning and data analytics, IoT sensors etcCostAs it is part of Apache Open Source there is no software cost.Hardware cost is less in MapReduce as it works with smaller memory(RAM) as compared to Spark. Even commodity hardware is sufficient.Spark also is Apache Open Source so no license cost.Hardware cost is more than MapReduce as even though Spark can work on commodity hardware it needs a lot more memory(RAM) as compared to MapReduce since it should be able to fit all the data in Memory for optimal performance. Cluster needs little high-end commodity hardware with lots of RAM else performance gets hitEase of UseMapReduce is a bit complex to write. MapReduce is written in Java and the APIs are a bit complex to code for new programmers, so there is a steep learning curve involved. The Pig has SQL like syntax and it is easier for SQL developers to get onboard easily. Also, there is no interactive mode available in MapReduceSpark has APIs in Scala, Java, Python, R for all basic transformations and actions. It also has rich Spark SQL APIs for SQL savvy developers and it covers most of the SQL functions and is adding more functions with each new release. Also, Spark has scope for writing User Defined Analytical Functions and Functions (UDF/UDAF) for anyone who would like to have custom functions.CompatibilityMapReduce is also compatible with all data sources and file formats Hadoop supports. But MapReduce needs another Scheduler like YARN or Mesos to run, it does not have any inbuilt Scheduler like Spark’s default/standalone scheduler.Apache Spark can in standalone mode using default scheduler. It can also run on YARN or Mesos. It can run on-premise or on the cloud. Spark supports most of the data formats like parquet, Avro, ORC, JSON, etc. It also supports multiple languages and has APIs for Java, Scala, Python, R.Data ProcessingMapReduce can only be used for batch processing where throughput is more important and latency can be compromised.Spark supports Batch as well as Stream processing, so fits both use cases and can be used for Lambda design where applications need both Speed layer and slower layer/data processing layerSecurityMapReduce has more security features.MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry.Spark is a bit bare at the moment. Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.Fault ToleranceMapReduce uses replication for fault tolerance. If any slave daemon fails, master daemons reschedule all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with a single failure alsoIn Spark, RDDs are the building blocks and Spark also uses it RDDs and DAG for fault tolerance. If an RDD is lost, it will automatically be recomputed by using the original transformations.LatencyMapReduce has high latencySpark provides low latency performanceInteractive ModeMapReduce does not have any interactive mode of operation.Spark can be used interactively also for data processing. It has out-of-the-box support for spark shell for scala/python/RMachine Learning/Graph ProcessingNo support for these. A mahout has to be used for MLSpark has dedicated modules for ML and Graph processingConclusion:Both these technologies MapReduce and Spark have pros and cons:MapReduce is best suited for Analysis of archived data where data size is huge and it is not going to fit in memory, and if the instant results and intermediate solutions are not required. MapReduce also scales very well and the cluster can be horizontally scaled with ease using commodity machines.Offline Analytics is a good fit for MapReduce like Top Products per month, Unique clicks per banner.MapReduce is also suited for Web Crawling as well as Crawling tweets at scale and NLP like Sentiment Analysis.Another use case for MapReduce is de-duplicating data from social networking sites, job sites and other similar sites.MapReduce is also heavily used in Data mining for Generating the model and then classifying.Spark is fast and so can be used in Near Real Time data analysis.A lot of organizations are moving to Spark as their ETL processing layer from legacy ETL systems like Informatica. Spark as very good and optimized SQL processing module which fits the ETL requirements as it can read from multiple sources and can also write to many kinds of data sources.Spark can also handle Streaming data so its best suited for Lambda design.Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. Spark has great support for Graph processing using GraphX module.Almost all machine learning algorithms work iteratively. Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverage iterations and yields better results than one pass approximations sometimes used on MapReduce.Hadoop MapReduce is more mature as it has been there for a longer time and its support is also better in the open source community. It can be beneficial for really big data use case where memory is limited and data will not fit the RAM. Most of the time, Spark use case will involve Hadoop and other tools like Hive, Pig, Impala and so when these technologies complement each other it will be a win for both Spark and MapReduce.
Rated 4.5/5 based on 18 customer reviews
8652
Apache Spark Vs Hadoop MapReduce

Why we need Big Data frameworksBig data is primari... Read More

Understanding Big Data- Best Big Data Frameworks

The massive world of ‘BIG DATA’If one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.Why is BIG DATA essential for today’s digital world?Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.Understanding BIG DATA: Experts’ viewpoint BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and VolumeBut due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.The diagram below illustrates each V in detail:Diagram: 6 V’s of Big DataThis 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  “Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.Unstructured ex. Tweets, Facebook posts, emails, etc.Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?Traditional Techniques of Data Processing are:RDBMS (Relational Database Management System)Data warehousing and DataMartOn a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.So, it’s time that we understand how it is being done in actual implementations.only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:Data collectionData storageData analysis andData visualization/outputAll major big data processing framework offerings are based on these building blocks.And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.2 Apache Spark : unified analytics engine for large-scale data processing.Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.5. Apache Samza : distributed Stream processing framework.Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.BIG DATA Processing Framework and technology landscape Big data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.Few of the tools are briefed below for further understanding:  1. Data Collection / Ingestion Layer Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failureKafka: is used for building real-time data pipelines and streaming apps. Event streaming platformFlume: log collector in HadoopHBase: columnar database in Hadoop2. Processing Layer Pig: scripting language in the Hadoop frameworkMapReduce: processing language in Hadoop3. Data Query Layer Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)Hive: Data Warehouse software for data Query and analysisPresto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB4. Analytical EngineTensorFlow: n source machine learning library for research and production.5. Data storage LayerIgnite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodesPhoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing storePolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.Sqoop: ETL toolBig data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel6. Data Visualization LayerMicrosoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.BIG Data Best Practices Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:Answering 5 basic questions help clients know the need for adapting Big Data for organizationTry to answer why Big Data is required for the organization. What problem would it help solve?Ask the right questions.Foster collaboration between business and technology teams.Analyze only what is required to use.Start small and grow incrementally.Big Data industry use-cases We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.Streaming PlatformsAs I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   Apart from Netflix, Spotify is also a known great use case.Advertising and Media / Campaigning /EntertainmentFor decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.Healthcare IndustryHealthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.All this data can help give insights and Big Data can contribute to the industry in the following ways:Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   The list is long, above were few representative examples.SecurityDue to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.Threat analysis and detection could be done with big data.  Travel and TourismFlight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :DTelecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.Challenges faced by Big Data in the real world for adaptationAlthough the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.Let’s discuss a few of them to understand what are they?1. Understanding Big Data and answering Why for the organization one is working with.As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.2. Understanding Data sources for the organizationIn today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."Different tools and technologies are on the rise to address this challenge.3. Shortage if Big Data Talent and retaining themBig Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.4. The Veracity VThis V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.5. SecurityThis aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  6. Gaining Valuable InsightsMachine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.Addressing these discussed challenges would help to gain valuable insights through available solutions.With Big Data opportunities are endless. Once understood, the world is yours!!!!Also, now that you understand BIG DATA, it's worth understanding the next steps:Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”You can also take up Big data and Hadoop training to enhance your skills furthermore.Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?
Rated 4.5/5 based on 11 customer reviews
6600
Understanding Big Data- Best Big Data Frameworks

The massive world of ‘BIG DATA’If one strolls ... Read More

Useful links