Web Scraping

What is Web Scraping?

Web scraping refers to the automated extraction of large volumes of data from web pages and applications. Once obtained, this data can be used for a wide variety of purposes. Web scraping is often referred to as data scraping, screen scraping, or content scraping. Common scraping tools include bots, off-the-shelf or custom-coded scripts, and third-party scraping services. There are both legitimate and fraudulent forms of scraping. A bot that scrapes web page data for inclusion in search engine listings is a valid form of scraping. A bot that scrapes images and text from social media sites to create fake accounts is not. Fraudsters typically use scrapers to obtain real user data that can help make their fake accounts appear authentic.

What Should Companies Know About Scraping?

Scraped data comes from legitimate sources, so fake accounts camouflaged with scraped data can fool humans as well as AI algorithms. Fraudsters use scraped data and sophisticated techniques to commit many types of fraud. Examples include:

Application Fraud

A fraudster scrapes social media sites to obtain personal information for thousands, even millions, of people. The scraped personal data could include street address, city, birthdate, and occupation. This information is then used to complete and submit fraudulent loan and credit card applications. Scraped content helps the fraudsters create fraudulent applications that appear legitimate.

Fake Product Listings

A fraud ring scrapes product information and images from popular marketplaces such as Amazon, eBay, and Overstock.com. The data is used to create thousands of fake product listings on peer-to-peer (P2P) marketplaces such as Craigslist, OfferUp, and Wallapop. Fraudsters create fake product listings to scam money from buyers, phish for personal information, or trick buyers into purchasing counterfeit products. Fraudsters also use scraped content to create fake product reviews.

Digital Ad Fraud

A fraud ring creates a botnet that uses scraped content to bypass Authorized Digital Sellers (ads.txt) protections. Ads.txt is an Interactive Advertising Bureau (IAB) initiative that aims to prevent the sale of unauthorized digital advertising. A bot scrapes the content of an ad publisher site and creates fake ad publisher pages on another server so that new ad slots are created. The fraudsters then sell these new fraudulent ad slots, which are under falsified URLs, to authorized resellers that are listed in the original publisher’s ads.txt file. The ad slots, built using scraped content, appear to be from a legitimate ad publisher.

Whether it’s application fraud, fake product listings, or digital ad fraud, bad actors use sophisticated techniques to hide the fraudulent activity behind their fake accounts. Analyzing accounts on an individual basis is an ineffective strategy for fighting coordinated fraud. Only through holistic analysis can expect to surface correlated patterns and prevent fake accounts built with scraped data.

A Holistic Approach to Analyzing Accounts with DataVisor

With fraud techniques such as screen scraping, there are essentially three stages to the fraud timeline where there is an opportunity for detection to have an impact. The first stage is the acquisition of data. Accurate fraud detection is virtually impossible at this stage, because with screen scraping, there are perfectly legitimate use cases for scraping data. Skipping ahead to the third stage, this is where scraped data is actively used for malicious purposes. While detection is certainly possible during this stage, it’s also essentially too late, as the damage is already in progress. It is during the second stage where we have an opportunity to make a truly significant impact. This is the preparation stage. Data has been obtained via scraping, and a fraudster is laying the groundwork for using that data in one or more attacks. The actions the fraudster takes during this stage leave a digital footprint that—while likely subtle and cleverly obfuscated—sophisticated fraud management solutions such as dCube can reveal through holistic data analysis. Actions that might seem innocent when viewed in isolation, are shown to be part of coordinated malicious actions. By surfacing these revealing patterns, dCube can flag potentially fraudulent accounts and actions before they actually cause any harm.

Additional References

Blog Post: Emerging Fraud in Marketplaces: How Product Listing Fraud Is Gaining Traction

Blog Post: The Battle of Uncovering Fake Accounts

Solution: Application Fraud

Solution: Marketpla c es

Solution: Social Platforms

Solution: Financial Services

Use Case: Spam & Fake Reviews

Digital Fraud Wiki

Your source for the latest fraud intelligence, insights, research, and commentary.