DataVisor Threat Blog:

Automated Feature Engineering

LinkedIn
Twitter
Swetha Basavaraj

Swetha Basavaraj

Swetha is a Senior Product Manager at DataVisor. She has more than a decade of diverse experience leading teams as a product manager, entrepreneur, and engineer, launching new B2B products at Yahoo, IBX (now Tradeshift), VolvoCars, and IBM. Her ongoing focus is on building scalable enterprise products using the latest technologies, including machine learning.

How to achieve exceptional model accuracy in minutes instead of months with automated feature engineering and unsupervised machine learning.

Sophisticated modeling practices are critical for modern fraud management, and the ability to detect, deter, and defeat massive-scale coordinated attacks is made possible by the power of AI and machine learning.

All modeling processes have three main steps:

  1. Data Collection and Cleansing
  2. Feature Engineering
  3. Model Building and Evaluation

In this post, we’ll discuss feature engineering, which is one of the most important and valuable steps for achieving the highest quality results.

Feature Engineering

Let’s begin by establishing a basic definition of feature engineering. A feature is a characteristic that can help solve a problem using machine learning. The process of extracting such features from a raw dataset is called feature engineering. There is an art to this process, and final results depend on how well this step is managed. Domain expertise and data insights help create the right features that produce the best possible results.

The challenges of manual processes
Feature engineering is often still performed manually by data scientists. A data scientist will analyze data and then, based on their domain expertise and experience, decide what features to create. The goal is better model results, but since many features are available for modeling, overfitting—an overabundance of applied parameters that narrow, and negatively impact, a model’s ability to perform—is a common problem. Adequate tools and technical skills are required for successful feature engineering, and even then, the process can still be labor-intensive and time-consuming.

The benefits of automation
Where there is a clear problem to solve, and domain expertise that can be applied, it is possible to standardize certain features that can be used for building models. These features can be automatically derived or extracted from raw data. For example: IP address is essential for fraud detection. For each IP address in the raw data, we should be able to derive additional features such as: ip prefix, ip city, check_ip_from_datacenter, ip_country, and more. In this way, we can begin to develop automated processes that increase both efficiency and accuracy.

Automated Feature Engineering with DCube

DCube, DataVisor’s comprehensive fraud detection platform, not only provides the necessary tools for modeling (data management, feature engineering, model review) but also automates the feature engineering process by providing hundreds of derived features based on data and mapping. The higher the data quality, the better these derived features will be. These extracted features can include:

  • Transform Features
  • Aggregated Features
  • Global Intelligence Network Features

Transform features
Transform features are created from one or more of the existing attributes of the raw data.

Example: From “event_time,” a user should be able to get derived features such as minute, hour, day, week, month, year, and date, automatically.

Aggregated features
To create aggregated features, records are grouped based on a specific value of the attribute, and a feature is created based on the aggregated data for a specific period of time. There are several out of the box aggregated features calculated automatically by dCube based on the attributes available for feature engineering.

Example: A feature to calculate the total amount of transactions processed from a particular device where the amount of transaction exceeds $500, within a set 7-day period.

Global Intelligence Network features
These features are derived from fraud data and patterns observed in our Global Intelligence Network (GIN), which is comprised of data from more than 4 billion global accounts.

Ex: GIN provides a reputation score for each of the IPs within the raw data, based on global data and distribution. This score is based on a ratio of detected users to total users on a specific IP.

Conclusion

Models are only as good as their data and features, and feature engineering is made more efficient and effective when the most important features necessary for fraud detection—as determined by extensive domain expertise—are automatically created. When paired with sophisticated unsupervised machine learning algorithms, automated feature engineering can deliver exceptional model accuracy in minutes instead of months.