There are many technical articles that describe supervised and unsupervised machine learning methods. In this guide, we will explain a few high level differences when it comes to choosing between the two.
Comparison 1: Labels vs. No Labels
If Supervised Machine Learning(SML) is analogous to “learning with a teacher,” then Unsupervised Machine Learning(UML) is “learning without a teacher.” The teachers, in this case, are the labels. The supervised approach begins by training an optimal model based on labels, where the input is any attributions (the so-called features) that can describe the data, and the output is the label. The task of the model is to minimize the difference between the output label and the actual label. This model has the ability to predict: new data is fed into the model and mapped to an output value. On the contrary, without any pre-knowledge, UML analyzes the data directly to build a model. It relies on data exploration without labels as guidance. Although this seems improbable, unsupervised learning can be witnessed in everyday life. For example, when visiting an art exhibition, one may have no knowledge about art, but after seeing many works, the differences between abstract art and hyper-realistic paintings become discernible.
Comparison 2: Classification vs. Clustering
Comparison 3: Label Upon Classification vs. Label After Clustering
Once a supervised learning algorithm produces results, they will be directly labeled good or bad, i.e.once the classification is completed, the label is created. This can be compared to a herbal medicine pharmacist preparing labels for medicines as they are dispensed.
Figure 1: Dispensing herbal medicine with clear labels.
Figure 2: Separating medicinal herbs into respective piles.
Comparison 4: Non-transparent vs. Explainable
For every output of a supervised algorithm, the result, or label, is always classified into a certain class, e.g., a yes or no. Supervised learning, especially those build on regression models, multiplies each field by a weight vector [w1, w2, w3…wn] to classify each output a certain way. No justification is available to explain why specific weight vectors are used, other than that they are based on the learning algorithm. For scenarios such as anti-money laundering that require understandable rules, it is difficult to purely rely on supervised learning methods. In this case, unsupervised clustering methods could be more helpful in explainability, as they could provide a list of features used to group items, which can be used as rules for forming clusters.
Comparison 5: The Scalability of DataVisor’s Unsupervised Machine Learning Approach
Adding an additional field to the data of a working n-dimensional model so that it has n+1 dimensions would likely break the original classification or clustering system if it is a a very strong feature. The weight values of supervised machine learning will be drastically altered. However, the unsupervised algorithm developed by DataVisor is more scalable and easier to adjust with new fields of data or features.
Choosing Between Supervised Learning and Unsupervised Learning
After understanding the above comparisons, a more informed decision can be made when selecting between a supervised or an unsupervised approach.
- If there are no labels on training data, use unsupervised learning. The more the data is comprehended, the more accurate the model will be. The following characteristics of the data should be understood: whether the feature value is discrete or continuous, whether there are missing fields, the causes of missing values in the field, whether there are outliers present in the data, and the frequency of each feature.
- Can the quality of the data be improved? In practical applications, even if training data is not readily available, some samples can be manually labelled to improve the conditions for supervised learning. Unsupervised learning should be used when the data cannot be classified manually or manual classification is too expensive. For example, in a bag-of-words model, we use the k-means algorithm to group and represent the data. This algorithm is used because there is a large amount of high-dimensional data. Manually separating them into multiple groups would be too difficult. Imagine if 50 sets of 1,000-piece puzzles were mixed together. Would it be easy to categorize 50,000 puzzle pieces? When these situations arise, unsupervised learning may help better.
- With sufficient training labels, supervised learning is a better choice than unsupervised, as having guidance is better than wasting time exploring. For example, even for excellent students, having answers to a mock exam is better than working on a problem not knowing if they are doing the right thing . After completing the mock exam, the students would know the accuracy of the answers. However, if the underlying distribution of the data renders most supervised learning algorithms unfit, it might be more appropriate to choose unsupervised machine learning.