arrow left facebook twitter linkedin medium menu play circle

Unsupervised Machine Learning: A 5-Minute Beginner’s Guide

By David Ting June 5, 2018

Photo of David Ting

about David Ting
With over 20 years of leadership experience at leading technology companies such as Yahoo, NetEase, and IGN. On the technology front, David has held key executive positions at Yahoo, IBM, IGN, AltaVista and is a big believer in combining cutting-edge innovation with scalability, simplicity, and quick time-to-market. He holds 6 patents and won 2 IBM Outstanding Achievement awards and AltaVista Employee of the Year.

Background

We get a lot of questions about Unsupervised Machine Learning here at DataVisor, because UML is at the core of our detection platform. In this 5-minute primer on UML, we start by defining the overarching field of Artificial Intelligence, then we drill down to the sub-field of Machine Learning, and lastly we discuss the various machine learning techniques, including UML, and when each ML technique is most effective.

What is Artificial Intelligence?

Artificial intelligence is a broad branch of computer science dealing with the simulation of intelligent behavior in computers. After analyzing the cat in figure A, both a person and a working AI model can identify that Figure B is also a cat.

Cat v0
Figure 1A
Unsupervised Machine Learning Cat B
Figure 1B

This ability is a simulation of human intelligence. The AI model has the ability to identify cartoon cats based on real cats.

But how does the AI model identify that figure B is a cat? A rudimentary method would be for a programmer to manually create an enormous, detailed decision tree, hard coding each branch by hand, that would allow the model to identify the cat. Machine learning branch of artificial intelligence that solves this problem by using training data to “teach” an algorithm how to do a task rather than having to manually hard code it.

What is Machine Learning?

Machine Learning is a branch of Artificial Intelligence that allows algorithms to learn from existing data and then apply that knowledge to new data. In our example of identifying a cat, a machine learning model would analyze large numbers of cat photos and illustrations and it would “learn” to identify cats based on that data.

Figure 2: A training data set of cat photos

Many machine learning algorithms have been developed to help computers identify objects: neural networks, Bayes, decision trees, and clustering algorithms. These algorithms can broadly be grouped into three categories: Supervised learning, reinforcement learning, and unsupervised learning. We’ll cover each in detail and discuss their most common use cases.

There are three primary categories of machine learning techniques:

Supervised Machine Learning (and Deep Learning)

Supervised learning is the most common type of machine learning. It requires labeled training data and the training goal is to be able to label the new data (test data) correctly. For example, to teach an algorithm to label e-mails as spam, we manually label a specific number of e-mails as spam or non-spam and provide these to the supervised machine learning model as training data. The model will learn from the e-mails and labels. Once this is complete, we introduce unlabeled new e-mails and the model will be able to identify whether each e-mail is spam or non-spam based on what it learned from the training data set.

One particularly popular form of supervised learning is called deep learning, in which a computer algorithm simulates the way a human brain learns by creating and reinforcing connections between features in a similar way to how the brain creates and reinforces neural connections. A deep learning model analyze the photos in many different and sometime hidden methods. Each analysis method is called a layer and the deep learning model will create many layers at different levels of abstraction to discover ways to represent the data. Low level layers might include basic color or contrast data. Mid level layers might include edges and shapes. High level layers might be human recognizable features like whiskers, eyes, and ears. By analyzing the layers at different levels, deep learning is able to learn to group cat photos.

Figure 3: Lynx

Unsupervised Machine Learning

Unsupervised learning is often used to discover patterns within large amounts of unlabeled data. Its training data is unlabeled, and the training goal is to identify clusters of similar data points. For example, an unsupervised learning algorithm should be able to distinguish a group of “cat” photos from a large variety of other pictures, based on the characteristics shared by the photos of cats.

DataVisor’s unique anti-fraud algorithm is the use of unsupervised learning. There are three main applications of unsupervised learning: clustering, anomaly detection, and dimensional reduction. Using the clustering method, an algorithm gathers observations into groups one by one, with each group containing one or more features. Properly extracting features is the most critical aspect of unsupervised learning. For example, in the identification of cats, attempts are made to extract the characteristics of cats: fur, limbs, ears, eyes, whiskers, teeth, tongues, and the like. By clustering animals with the same characteristics, cats can be grouped together. But at this time, we don’t know what this group is. We only know that all data in this group belongs to the same category. Rabbits and airplanes are not in this category, since their characteristics do not fit. The validity of features directly determines the effectiveness of the algorithm. If we cluster by weight and ignore body features, it’s difficult to distinguish between rabbits and cats.

DataVisor’s anti-fraud work catches fraudulent elements, including malicious registration, hacking, fraudulent loans, and so on. DataVisor’s strength is modeling user behavior and analyzing relationships between users. It can effectively capture fraud groups and stop fraud in a timely manner.

Reinforcement Learning

Reinforcement learning is often used in robotics. The goal of the algorithm is to train the machine to perform various actions. Most of the time, the machine is placed in a specific environment in which it can self-train continuously, and the environment gives positive or negative feedback. The model continuously improves its decision-making by learning from feedback from past actions.

Which Machine Learning Technique Should I Use?

Different machine learning techniques are appropriate for different situations. So, how do we evaluate the fitness of the algorithm? To start, let’s define a few terms so that we can precisely discuss when an algorithm is successful or unsuccessful.

True Positive (TP): A positive instance that is correctly identified as a positive instance by the model

True Negative (TN): A negative instance that is correctly identified as a negative instance by the model

False Positive (FP): A negative instance that is mis-identified as a positive instance

False Negative (FN): A positive instance that is mis-identified as negative instance

Take the cat’s identification as an example. Assume that the model has acquired certain recognition ability through learning. So, we enter 4 pictures and the model’s predictions are as follows:

Figure 5: Machine Judgement Result

To understand the effectiveness of a machine learning technique, there are three commonly used evaluation indicators: precision, recall, and accuracy.

Precision: What percent of positive identifications were correct? This is calculated as TP/(TP+FP).

Recall: What fraction of all truly positive instance did we identify as positive? This is calculated as TP/(TP+FN).

Accuracy: What fraction of predictions (both positive and negative) were correct? This is calculated as (TP+TN)/(TP+TN+FN+FP)

The higher the three indicators, the more effective the algorithm.

Coordinated fraud is common in today’s online environment and unsupervised algorithms can effectively capture fraud rings. When DataVisor’s unsupervised algorithm is applied to certain fraud scenarios, its accuracy rate can be as high as 99%. This demonstrates the applicability and effectiveness of unsupervised algorithms in the Internet industry.


Popular Posts

Intelligent solutions. Informed decisions. Unrivaled results.

DataVisor Fraud Index Report: Q1 2019

Learn More

Access proprietary data and research results to discover the latest attack techniques and prevention strategies.

Download the Q1 2019 Fraud Index Report from DataVisor to receive unparalleled data-driven insights into the latest attack trends, and the most effective prevention strategies, based on analysis of over 44 billion events, 800 million users, 396 million IP addresses, and more.

Dumb & Dumber vs Ocean’s 11

Learn More

Understand the range of modern fraud attacks to ensure complete coverage for your organization.

Complex and coordinated fraud attacks that are extensively planned, hard to detect, and highly scalable are the new normal for online platforms. Explore and understand the full spectrum of fraud attacks—from simple to sophisticated—and learn how you can defend against each type in this…

Guard Your Online Marketplace Against Fraud

Learn More

Discover AI-powered fraud strategies for preventing financial and reputational damage in this powerful eBook.

Online marketplaces withstand a complicated array of fraud attacks—spam, scam, and all points in between. Only the most comprehensive, proactive AI-powered solutions can fully protect against reputational and financial damage. This eBook details the entire lifecycle of a fraud attack, and lays out…


Protect your business, your customers, and your data.

Request Demo