In this section we will look at Spark’s Machine Learning library and how to use it. We will also see how to build pipelines using Supervised and Unsupervised learning. However, we will not be able to go into details as the scope of ML is very big,and cannot be covered in this tutorial alone.
Artificial intelligence is changing our present and it is going to shape our future. It can be considered as one of the biggest innovations of this century. In future it is going to be so dominant that anyone who does not understand it is sure tobe left behind in this world.
Law enforcement uses visual recognition and natural language processing to process footage from body cameras. The Mars Rover Curiosity even utilizes AI to autonomously select inspection-worthy soil and rock samples with high accuracy.
Today we can come across many examples where machines are taking up tasks which have been always done by humans. It would not be an exaggeration to think that in future we could see housekeeping deliveries being done by a bot in a hotel room. We already know of experiments being conducted for Pizza deliveries using drones.
Machine learning is a subcategory of artificial intelligence. The goal of Machine Learning is to enable computers to train and learn on their own. A machine’s learning algorithm helps the computer or the machine to pick patterns in the training data, then build models and predict results or outcomes based on the past learning without being programmed explicitly with rules.
In supervised learning, we usually have training examples which have the correct labels associated with them. For example, to classify handwritten digits, supervised learning will have its input as hundreds or thousands of handwritten digits with the correct labels. The ML algorithm will then train on these examples and learn the relationship between the images and the associated numbers so that it can then apply this learning to classify the new numbers without any labels.
EXAMPLES OF SUPERVISED MACHINE LEARNING MODELS
How do you find the underlying structure of a dataset? How do you summarize it and group it most usefully? How do you effectively represent data in a compressed format? These are the goals of unsupervised learning, which is called “unsupervised” because you start with unlabeled data (there’s no Y).
In contrast to supervised learning, it’s not always easy to come up with metrics for how well an unsupervised learning algorithm is doing. “Performance” is often subjective and domain-specific.
EXAMPLES OF UNSUPERVISED MACHINE LEARNING MODELS
When building models used to make predictions, we often train a model based on an existing data set. The model may be re-trained as more and more training data sets become available. For example, we would re-train a recommendation engine based on collaborative filtering as we learned more about the events which led to product sales or targeted engagement metrics.
The goal of MLlib is to make machine learning very easy to use and adapt for every user, and also to make ML more scalable. We have seen that each new release has added new algorithms and also has performance improvements. But apart from these a lot of effort has been put by the developers to make MLlib user-friendly and easy to use. Like Spark Core, MLlib also provides APIs in all the three programming languages i.e. Scala, Java and Python. This makes MLlib adaptable to programmers coming from diverse backgrounds.
Usually a practical Machine Learning pipeline consists of following four stages:
If we take an example of classifying text documents,we might see the following involved: text segmentation, feature extraction, text documents classification and finally training a classification model and doing cross validation. Though this may look easy initially as there are many libraries which are freely available to accomplish these stages, but with huge datasets connecting the stages, building a robust pipeline is not an easy task. Most of the ML libraries will not be able to handle or provide support for distributed computing involving huge datasets. Also may of them may not have native pipeline building and tuning support.
Apache Spark’s new pipeline API can be found in the package “spark.ml”. An ML pipeline consists of multiple stages. The two basic pipeline stages are Transformer and Estimator.
Transformer takes a dataset as its input and the output is another augmented dataset.Estimator on the other hand fits one dataset and it produces a model which is a transformer.
It is very easy to create a pipeline in Spark. We just need to declare the stages, configure the parameters needed and then just chain them in a pipeline object. Below is an example of a simple text classification pipeline. This consists of a tokenizer , a hashing term frequency feature extractor and a logistic regression step.
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
The pipeline itself is an Estimator, and hence we can call fit on the entire pipeline
model.transform(testDataset) .select('text, 'label, 'prediction) .collect() .foreach(println)
If users can implement the pipeline interfaces, they can easily plugin their own versions of the transformers or estimators into a machine learning pipeline. The MLlib APIs are very easy to use as well, as code sharing outside MLlib is also very easy. Those who want to explore complete examples can look in the examples folder in the Spark repository
In the above module we got introduced to Spark’s ML module and learnt how to use it and build pipelines.