Fighting Fraud with Big Data

December 22, 2016

Fraud comes in many forms whether through misrepresentation, concealment or intent to deceive. Traditional methods of identifying and fighting fraud have relied on data analysis to detect anomalies which signal a fraud event has taken place. Detecting anomalies falls into two categories; known and unknown.

Known Fraud Schemes

Known fraud schemes can be easy to identify. They have been committed in the past and thus recognizably fit a pattern. Common known fraud schemes over the web include purchase fraud, internet marketing, and retail fraud. Methods to identify patterns for these types of fraud include tracking user activity, location, and behavior. One example for tracking location might be through IP, determining whether a user is concealing their identity, or is executing a transaction from a high-risk international location. A correlation can be made based on location if it is determined to be High Risk. Another case for location tracking is a physical address. In the past, fraudsters have used unoccupied addresses to accept delivered goods purchased through online and retail stores. Identifying an unoccupied address through DOTS Address Validation DPV notes provides real-time notification of vacant addresses which can be considered a red flag.

Identifying the Unknown

Unknown fraud schemes, on the other hand, are much more difficult to identify. They do not fall into known patterns making detection more challenging. This is starting to change with the paradigm shift from reactive to proactive fraud detection made possible through Big Data technologies. With Big Data, the viewpoint becomes much larger, analyzing each individual event vs sampling random events to attempt to identify an anomaly.

So What is Big Data?

Big Data is generally defined as datasets which are larger or more complex than traditional data processing applications ability to handle them. Big Data can be described by the following characteristics: Volume, Variety, Velocity, Variability, and Veracity.

Volume: The quantity of generated and stored data.

Variety: The type and nature of the data.

Velocity: The speed at which data is generated and processed.

Variability: Inconsistency of the data set.

Veracity: The quality of captured data varies.

Tackling Big Data

With the advent of distributed computing tools such as Hadoop, wrangling these datasets into manageable workloads has become a reality. Spreading the workload across a cluster of nodes provides the throughput and storage space necessary to process such large datasets within an acceptable timeframe. Cloud hosting providers such as Amazon provide an affordable means to provision an already configured cluster; perform data processing tasks, and immediately shut down, reducing infrastructure costs and leveraging the vast hardware resources available through Amazon’s network.

Service Objects Joins the Fight

More recently, Service Objects has been employing Big Data techniques to mine through datasets in the hundreds of terabytes range, collecting information and analyzing results to improve fraud detection in our various services. This ambitious project will provide an industry leading advantage in the sheer amount of data collected, validating identity, location and a host of attributes for businesses. Stay tuned for more updates about this exciting project.