arrow left facebook twitter linkedin medium menu play circle

A Few Key Differences Between Supervised and Unsupervised Machine Learning

By David Ting May 1, 2018

Photo of David Ting

about David Ting
With over 20 years of leadership experience at leading technology companies such as Yahoo, NetEase, and IGN. On the technology front, David has held key executive positions at Yahoo, IBM, IGN, AltaVista and is a big believer in combining cutting-edge innovation with scalability, simplicity, and quick time-to-market. He holds 6 patents and won 2 IBM Outstanding Achievement awards and AltaVista Employee of the Year.

Introduction

There are many technical articles that describe supervised and unsupervised machine learning methods. In this guide, we will explain a few high level differences when it comes to choosing between the two.

Comparison 1: Labels vs. No Labels

If Supervised Machine Learning(SML) is analogous to “learning with a teacher,” then Unsupervised Machine Learning(UML) is “learning without a teacher.” The teachers, in this case, are the labels. The supervised approach begins by training an optimal model based on labels, where the input is any attributions (the so-called features) that can describe the data, and the output is the label. The task of the model is to minimize the difference between the output label and the actual label. This model has the ability to predict: new data is fed into the model and mapped to an output value. On the contrary, without any pre-knowledge, UML analyzes the data directly to build a model. It relies on data exploration without labels as guidance. Although this seems improbable, unsupervised learning can be witnessed in everyday life. For example, when visiting an art exhibition, one may have no knowledge about art, but after seeing many works, the differences between abstract art and hyper-realistic paintings become discernible.

Comparison 2: Classification vs. Clustering

Supervised machine learning consists of classification and regression , while unsupervised machine learning often leverages clustering (the separation of data into groups of similar objects) approaches. When doing classification, model learns from given label data point should belong to which category. The main task of supervised machine learning is to define a model that minimizes prediction error. On the other hand, unsupervised machine learning focuses on estimating connections strength between all data points. Thus, an unsupervised learning algorithm can begin forming clusters once it learns how to recognize similarities.

Comparison 3: Label Upon Classification vs. Label After Clustering

Once a supervised learning algorithm produces results, they will be directly labeled good or bad, i.e.once the classification is completed, the label is created. This can be compared to a herbal medicine pharmacist preparing labels for medicines as they are dispensed.

Figure 1: Dispensing herbal medicine with clear labels.

The result of clustering-based UML is only a group of clusters (association-based UML, such as those used in recommendation systems, won’t be discussed here)This is analogous to separating individual herbal medicines from a mixed pile. All a layman can do is to separate herbs by their how they look. To truly classify the herbs by their medical characteristics, the opinion of a herbal medicine expert is required. In a similar fashion, the quality of an unsupervised learning algorithm will depend on the expertise of the people who build it. .

Figure 2: Separating medicinal herbs into respective piles.

Comparison 4: Non-transparent vs. Explainable

For every output of a supervised algorithm, the result, or label, is always classified into a certain class, e.g., a yes or no. Supervised learning, especially those build on regression models, multiplies each field by a weight vector [w1, w2, w3…wn] to classify each output a certain way. No justification is available to explain why specific weight vectors are used, other than that they are based on the learning algorithm. For scenarios such as anti-money laundering that require understandable rules, it is difficult to purely rely on supervised learning methods. In this case, unsupervised clustering methods could be more helpful in explainability, as they could provide a list of features used to group items, which can be used as rules for forming clusters.

Comparison 5: The Scalability of DataVisor’s Unsupervised Machine Learning Approach

Adding an additional field to the data of a working n-dimensional model so that it has n+1 dimensions would likely break the original classification or clustering system if it is a a very strong feature. The weight values of supervised machine learning will be drastically altered. However, the unsupervised algorithm developed by DataVisor is more scalable and easier to adjust with new fields of data or features.

Choosing Between Supervised Learning and Unsupervised Learning

After understanding the above comparisons, a more informed decision can be made when selecting between a supervised or an unsupervised approach.

  1. If there are no labels on training data, use unsupervised learning. The more the data is comprehended, the more accurate the model will be. The following characteristics of the data should be understood: whether the feature value is discrete or continuous, whether there are missing fields, the causes of missing values in the field, whether there are outliers present in the data, and the frequency of each feature.
  2. Can the quality of the data be improved? In practical applications, even if training data is not readily available, some samples can be manually labelled to improve the conditions for supervised learning. Unsupervised learning should be used when the data cannot be classified manually or manual classification is too expensive. For example, in a bag-of-words model, we use the k-means algorithm to group and represent the data. This algorithm is used because there is a large amount of high-dimensional data. Manually separating them into multiple groups would be too difficult. Imagine if 50 sets of 1,000-piece puzzles were mixed together. Would it be easy to categorize 50,000 puzzle pieces? When these situations arise, unsupervised learning may help better.
  3. With sufficient training labels, supervised learning is a better choice than unsupervised, as having guidance is better than wasting time exploring. For example, even for excellent students, having answers to a mock exam is better than working on a problem not knowing if they are doing the right thing . After completing the mock exam, the students would know the accuracy of the answers. However, if the underlying distribution of the data renders most supervised learning algorithms unfit, it might be more appropriate to choose unsupervised machine learning.

Popular Posts

Intelligent solutions. Informed decisions. Unrivaled results.

DataVisor Fraud Index Report: Q2 2019

Learn More

The DataVisor Q2 2019 Fraud Index Report is here.

Customers online want convenience, ease, and access. Fortunately, your business offers it all. Unfortunately, that’s what fraudsters want too. To a cyber criminal, those features means vulnerabilities. To bring you the very latest and most actionable insights about where the risks are and what you…

Dumb & Dumber vs Ocean’s 11

Learn More

Understand the range of modern fraud attacks to ensure complete coverage for your organization.

Complex and coordinated fraud attacks that are extensively planned, hard to detect, and highly scalable are the new normal for online platforms. Explore and understand the full spectrum of fraud attacks—from simple to sophisticated—and learn how you can defend against each type in this…

Diagnose and Defeat Application Fraud with the Latest AI-Powered Tools

Learn More

Learn how leading financial institutions are using ML to proactively detect card application fraud.

In this insightful webinar, you’ll explore how organizations are leveraging AI-powered fraud management solutions to get tangible, real-world benefits as they work to proactively detect and defeat sophisticated modern fraud attacks. Plus, you’ll discover strategies for empowering cross-team…


Protect your business, your customers, and your data.

Request Demo