All Courses

Machine Learning for Humans

Updated on Oct 7, 2025

20,566 Views

Table of Content

key terms and machine learning algorithms
machine learning pipelines

Introduction

In this section we will look at Spark’s Machine Learning library and how to use it. We will also see how to build pipelines using Supervised and Unsupervised learning. However, we will not be able to go into details as the scope of ML is very big, and cannot be covered in this tutorial alone.

Artificial intelligence is changing our present and it is going to shape our future. It can be considered as one of the biggest innovations of this century. In future it is going to be so dominant that anyone who does not understand it is sure to be left behind in this world.

Law enforcement uses visual recognition and natural language processing to process footage from body cameras. The Mars Rover Curiosity even utilizes AI to autonomously select inspection-worthy soil and rock samples with high accuracy.

Today we can come across many examples where machines are taking up tasks which have been always done by humans. It would not be an exaggeration to think that in future we could see housekeeping deliveries being done by a bot in a hotel room. We already know of experiments being conducted for Pizza deliveries using drones.

Machine learning is a subcategory of artificial intelligence. The goal of Machine Learning is to enable computers to train and learn on their own. A machine’s learning algorithm helps the computer or the machine to pick patterns in the training data, then build models and predict results or outcomes based on the past learning without being programmed explicitly with rules.

Key Terms and Machine Learning Algorithms

In supervised learning, we usually have training examples which have the correct labels associated with them. For example, to classify handwritten digits, supervised learning will have its input as hundreds or thousands of handwritten digits with the correct labels. The ML algorithm will then train on these examples and learn the relationship between the images and the associated numbers so that it can then apply this learning to classify the new numbers without any labels.

Examples of Supervised Machine Learning Models

K-nearest neighbors: This model is used to predict how a person would vote if the voting patterns of his neighbors are known.
Linear regression: This model is used to ascertain if there is any correlation between the 2 variables.
Decision trees: This model is used to represent different number of possibilities and their outcome based on a selection.
Naive Bayes: This model is used to determine the spam emails.

How do you find the underlying structure of a dataset? How do you summarize it and group it most usefully? How do you effectively represent data in a compressed format? These are the goals of unsupervised learning, which is called “unsupervised” because you start with unlabeled data (there’s no Y).

In contrast to supervised learning, it’s not always easy to come up with metrics for how well an unsupervised learning algorithm is doing. “Performance” is often subjective and domain-specific.

Examples of Unsupervised Machine Learning Models

Neural networks: This model is used for facial image detection and handwriting recognition.
Clustering: This model is used to cluster a dataset which is unlabelled. For example, it can be used in city planning to make groups of houses and study their values based on location etc.
Latent Dirichlet Analysis (LDA): This model is used for natural language processing to identify common topics in a set of documents.

When building models used to make predictions, we often train a model based on an existing data set. The model may be re-trained as more and more training data sets become available. For example, we would re-train a recommendation engine based on collaborative filtering as we learned more about the events which led to product sales or targeted engagement metrics.

Machine Learning Pipelines

The goal of MLlib is to make machine learning very easy to use and adapt for every user, and also to make ML more scalable. We have seen that each new release has added new algorithms and also has performance improvements. But apart from these a lot of effort has been put by the developers to make MLlib user-friendly and easy to use. Like Spark Core, MLlib also provides APIs in all the three programming languages i.e. Scala, Java and Python. This makes MLlib adaptable to programmers coming from diverse backgrounds.

Usually a practical Machine Learning pipeline consists of following four stages:

Data pre-processing
Feature extraction
Model fitting
Validation

If we take an example of classifying text documents, we might see the following involved: text segmentation, feature extraction, text documents classification and finally training a classification model and doing cross validation. Though this may look easy initially as there are many libraries which are freely available to accomplish these stages, but with huge datasets connecting the stages, building a robust pipeline is not an easy task. Most of the ML libraries will not be able to handle or provide support for distributed computing involving huge datasets. Also may of them may not have native pipeline building and tuning support.

Apache Spark’s new pipeline API can be found in the package “spark.ml”. An ML pipeline consists of multiple stages. The two basic pipeline stages are Transformer and Estimator.

Transformer takes a dataset as its input and the output is another augmented dataset. Estimator on the other hand fits one dataset and it produces a model which is a transformer.

It is very easy to create a pipeline in Spark. We just need to declare the stages, configure the parameters needed and then just chain them in a pipeline object. Below is an example of a simple text classification pipeline. This consists of a tokenizer , a hashing term frequency feature extractor and a logistic regression step.

val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

The pipeline itself is an Estimator, and hence we can call fit on the entire pipeline

model.transform(testDataset)
.select('text, 'label, 'prediction)
.collect()
.foreach(println)

If users can implement the pipeline interfaces, they can easily plugin their own versions of the transformers or estimators into a machine learning pipeline. The MLlib APIs are very easy to use as well, as code sharing outside MLlib is also very easy. Those who want to explore complete examples can look in the examples folder in the Spark repository

Conclusion

In the above module we got introduced to Spark’s ML module and learnt how to use it and build pipelines.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

10% OFF

Coupon Code "GIFT10"

Coupon Expires 22/12

Copy

Get your free handbook for CSM!!

Recommended Courses