Detecting Fraud Using the AWS Neptune: An Insurance Industry Example

15 min readMar 25, 2024

Insurance providers are tasked with identifying fraudulent claims, which can be challenging to detect when evaluating claims individually.

What we are adopting in this post is to utilize graph technology to enable the analysis of connections between claims and the insured parties involved.

The hypothesis here is that we can uncover fraud networks, where a consortium of fraudsters submits multiple claims, assumes various identities, or engages in numerous minor claims to avoid detection.

This approach leverages interconnected data to reveal patterns and associations indicative of fraudulent activities, enhancing the efficiency of fraud detection within the insurance sector.

To grasp this concept more effectively, let’s consider a scenario commonly encountered in the insurance industry, often referred to as “fraud rings” This term describes groups of fraudsters who collaborate to file multiple fraudulent claims or manipulate their roles across various incidents to exploit insurance benefits.

A Simple Example of a Fraud Ring

Imagine Alice and Bob are two individuals who have filed insurance claims with the same company. Their claims can be viewed in isolation, without considering any broader context or connections between them:

Alice’s Claim: Alice reports a cracked windshield, damaged by a flying object in a public parking lot. She had recently upgraded her insurance policy to include comprehensive coverage, which now covers the damage!
Bob’s Claim: Bob files a claim for damage to his car’s bumper, which he says occurred when he accidentally backed into a pole in a public parking lot. Bob, too, had just upgraded his insurance to cover such types of damage!

If we see that simple figure below, both claims seem legitimate and unrelated. Alice and Bob’s incidents are plausible, and there’s nothing immediately suspicious about either claim when they are viewed separately!

However, when we look beyond the surface and explore additional information, a more complex picture emerges:

Shared Repair Shop: Investigation reveals that both Alice and Bob chose the same repair shop for their repairs. This shop has been previously flagged by the insurance company for investigation due to concerns over billing irregularities and a pattern of repair costs that suspiciously align with the upper limits of policy coverage.
Location Overlap: Further inspection shows that the incidents for which Alice and Bob filed claims both occurred in the same parking lot. This coincidence raises questions about the truthfulness of their separate incidents.
Work Together: A brief review of the clients’ historical records reveals that Alice and Bob previously worked together at the same company. This indicates a relationship they did not disclose.

If we plot that in a graph we can end up with the following plot:

It is important to note that while the connections and patterns identified between Alice and Bob’s claims raise suspicions of potential fraud, it’s important to acknowledge that these findings might not definitively prove fraudulent activity. There exists the possibility that the overlap in repair shop choice and the coincidence of their incidents occurring in the same parking lot could be purely coincidental.

However, given these pieces of interconnected information, the probability of mere coincidence becomes less likely. The combination of recently upgraded insurance policies, the selection of a repair shop previously flagged for irregularities, and their social ties, all contribute to a pattern that warrants further investigation.

In the context of insurance fraud detection, these connections do not serve as direct evidence of fraud but significantly increase the level of suspicion and justify a more thorough investigation.

From Relational Databases to Graph Databases

Now, let’s scale our scenario to reflect the real-world complexity insurance companies face: imagine hundreds of thousands of claims filed by thousands of people, many of whom might share various types of relationships.

Insurance companies typically store vast amounts of data in relational databases, organized into tables such as individuals, claims, repair shops, incident places, and companies.

Each table holds data related to its respective entity — for instance, the individuals’ table lists personal information, the claims table details each claim filed, and so on.

While relational databases excel at managing structured data, they can make it cumbersome to uncover complex patterns and relationships, such as those indicative of insurance fraud.

In such a vast network, manually identifying potential fraud becomes not just challenging, but practically impossible. It’s here that graph databases, such as AWS Neptune, can be leveraged to highlight the connections and patterns hidden by the data’s complexity.

Graph databases are designed to excel in mapping and analyzing relationships, making them uniquely suited for sorting through the complex relationships. We are going to explore using AWS Neptune in this post.

Amazon Neptune: High-performance graph analytics and serverless database for superior scalability and availability

Create a dataset

For this post, we will create our own dataset, keeping it small with only 10 clients.

Remember that the graph representation is not unique, you can define whatever entities to represent your nodes and whatever relationships to represent your edges. But for this post, we will have the following representation:

Node types:

We will define Five node types:

Individual Nodes: Represent the clients of the insurance company.
Claim Nodes: Detail each claim filed, linked to both individuals and specific incidents.
Repair Shop Nodes: Represent the repair shops where the damages were fixed.
Incident Place Nodes: Indicate the locations where the incidents occurred.
Workplace Nodes: Reflect companies where individuals work, revealing connections.

Edge types:

To capture the repationships we define these Three types of edges:

Individuals file claims, connecting individual nodes to claim nodes.
Claims are repaired at specific repair shops and occur at particular incident places, linking claim nodes to repair shop and incident place nodes.
Individuals work at companies, indicating their employment.

To download this small dataset you can access it on my github here.

You find two csv files, nodes.csv and edges.csv that looks like the following:

Amazon Neptune Database Architecture

Although the AWS documentation thoroughly explains their Neptune database service, I have tried to make this post comprehensive, so you can gain a complete understanding and hands-on experience with our specific use case and dataset.

For those familiar with Amazon Aurora’s cloud-native framework, Neptune’s architecture will seem similar. If Aurora’s architecture is new to you, it’s worth noting both systems share a robust, distributed structure designed to enhance data reliability and accessibility, as shown in the figure below.

This architecture distributes six copies of data across three Availability Zones (AZs), ensuring high availability and fault tolerance.

The system operates on a quorum-based approach for its transactions, requiring a “three of six read quorum” for reads and a “four of six write quorum” for writes, enhancing both resilience and efficiency.

At the heart of Neptune is a singular instance acting as the writer or master node, with support from up to 15 replicas that serve as compute nodes.

Thanks to shared storage, these replicas are exempt from individual data replication, bolstering read performance. The storage architecture benefits from a log-structured design that prioritizes incremental log records, speeding up data transfer between compute and storage layers.

Real-time, continuous data backup to S3 is a key feature, ensuring compute node performance remains unaffected.

Neptune facilitates cluster management through a variety of endpoints: a cluster endpoint for write operations, a reader endpoint that ensures load balancing across replicas, a loader endpoint for importing data (e.g., from S3), and specialized endpoints for Gremlin and SPARQL queries (Gremlin and SPARQL are graph query languages we show next).

Neptune’s Supported Graph Query Languages

Neptune supports three of the most popular graph data modeling frameworks: Apache TinkerPop, RDF (Resource Description Framework) openCypher, each coming with its own specialized query language.

Apache TinkerPop has Gremlin for graph traversal and manipulation, while RDF has SPARQL for the same purpose.

I am not comparing the two languages at this point, but it’s important to know that Neptune supports both and provides a dedicated endpoint for each.

It’s also important to note that while Neptune supports storing both Gremlin and SPARQL graph data within the same cluster, the data for each query language is stored separately. This means the graph data ingested via one query language can only be queried using that same language. In other words, data inserted with Gremlin queries is accessible only through Gremlin, and similarly, data entered through SPARQL is queryable only with SPARQL.

It is also worth noting that Neptune supports CSV format for Gremlin while SPARQL supports N-triples, Quads, RDF XML, and Turtle. All files must be UTF-8 encoded, you can have a look at all the requiremetns here.

Creating a Neptune Cluster

Well, the obvious requirement is that you have an AWS account, create one if you do not have yet!

To create a Neptune cluster, it’s important to understand its accessibility constraints. Neptune clusters can only be accessed through an EC2 instance residing in the same VPC or via Jupyter notebooks executed on SageMaker.

Therefore, alongside your Neptune instance, you’ll need to set up an EC2 instance within the same VPC, ensuring it’s configured with the correct security group to facilitate connectivity.

While setting up these components manually is entirely possible, Amazon provides a CloudFormation template that we can use. That can be used in Neptune service page, on the user guide under → Neptune Setup → Create a DB cluster → Create aa cluster; here is the direct link! Select a template appropriate for your region, I will select Europe (Ireland), for instance and launch the stack! You will start with something like the following interface:

The next step involves configuring your stack. You’ll be prompted to either stick with the default stack name or choose a new one. Important to note is the Neptune port, which is set to 8182; make sure that your security group is configured to allow inbound connections on this port.

When it comes to choosing the database and EC2 client instance types, we can go for the smaller available options, such as db.t3.medium for the database and t3.medium for the EC2 client, is more than sufficient for our small dataset.

During the setup, you’ll be asked to provide a name for an SSH key pair. If you don’t already have a key pair, you can create one from the EC2 Console.

An interesting part of the configuration involves enabling additional features. You have the option to enable a notebook instance type, for which a t2.medium is a good choice. This action sets up a new EC2 instance with a Jupyter notebook on it, managed by SageMaker whaich can also be used to query your database.

For the supported languages, selecting both the Gremlin console and the RDF (SPARQL) console will ensure that your EC2 instance is equipped with the necessary consoles to interact with your Neptune cluster.

Once you acknowledge and confirm the setup, the creation process begins. This process provisions all the necessary resources for your Neptune cluster.

Neptune CloudFormation Stack Completed and Resources are Created

In fact this creates three main resources:

A Neptune cluster that you can access from the Neptune UI on you AWS console and it looks like this :

A notebook hosted on SageMaker allows you to query your graph database, which can be immensely useful if you’re looking to utilize the graph database for learning on graphs. This access might enable you to develop models based on Graph Neural Networks on your graph datasets.

A Jupyter Notebook is created to access the DB from via SageMaker

The third main resource created is the EC2 instance, as we mentioned earlier that neptune clusters can only be accessed through an EC2 instance residing in the same VPC or via Jupyter notebooks executed on SageMaker. You can look at this EC2 in the EC2 console and it looks like this:

Connecting to Neptune Cluster

To connect to our Neptune cluster, we have the option of using either the Jupyter notebook or the EC2 instance we set up.

Let’s try a quick connection through the EC2. Accessing the EC2 from your local computer can be done using the key pair we created earlier. However, a simpler way is to use EC2 Instance Connect service from the AWS console. And this instance will act as a client to help us talking to our cluster.

As you may remember that we configured our cluster for both Gremlin and SPARQL, and if we run a command ls we should see two are available apache-tinckerpop-gremlin-consol and eclipse-rdf4j as below.

Two consoles available via the instance Gremlin and SPARQL

Of course, we can run Gremlin queries on the command line and create nodes and graphs using Gremlin syntax. However, since our data is in CSV files, it is very convenient and practical to load data in bulk, we will do that from S3 in the next section.

Loading Data into Neptune from S3

Bulk loading data into Neptune is done using what is called loader endpoint. In short the idea is that you initiate an HTTP POST request to this loader endpoint with your request formatted in JSON. This JSON includes the path to your S3 data file, enabling Neptune to access the data directly. But there are some points you must remember:

It is important that your S3 data must be reachable through an S3 VPC endpoint, allowing access from your VPC and Neptune cluster. Setting up the S3 VPC endpoint can be done through the VPC management console.
Additionally, your Neptune cluster requires an IAM role authorized to read data from S3.
It’s essential for the S3 bucket to reside in the same region as your Neptune cluster to facilitate data loading.

So, the setting will be our Neptune cluster and the client EC2 instance sitting within the same VPC, alongside our S3 bucket in the same region. They are connected through a VPC endpoint, as in the figure below.

Now, if we revisit our CSV files (previously viewed as dataframes), they will appear as follows, with the id column being mandatory. Additionally, the tilde sign (~) in the header is required for the essential column names:

We can first load our data into an S3 bucket, which we’ll call fraud-graph-data. It’s important to remember that it should be in the same region as the Neptune cluster! Then you can simply drag and drop the csv files.

nodes.csv and edges.csv in fraud-graph-data S3 bucket which is in the same region as the Neptune cluster

Now back to our Neptune cluster and EC2 instance and we will use Gremlin to load the data. We need to go into Gremlin and load the gremling configeration then access tthe remote console by executing these commands one by one:

cd apache-tinkerpop-gremlin-console-3.4.1

bin/gremlin.sh

:remote connect tinkerpop.server conf/neptune-remote.yaml

:remote console

and we should be seeing this :

Now to load our data we need to execute a bulck loader command below. which you can find in Neptune user guide example here.

curl -X POST \
    -H 'Content-Type: application/json' \
    https://your-neptune-endpoint:port/loader -d '
    {
      "source" : "s3://bucket-name/object-key-name",
      "format" : "format",
      "iamRoleArn" : "arn:aws:iam::account-id:role/role-name",
      "region" : "region",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE"
    }'

You fill this command with parametners. You need:

The endpoint and port can be obtained from your Neptune console.
The path to your data files: writing the folder name loads all files within it; to load only one file, specify the file name. I will use the bucket name.
The format is csv.
You also need the region. Navigate to your cluster and click on the region menu at the top right.
The IAM role can be found in the CloudFormation stack: navigate to Resources, locate the NeptuneStack, and within its Resources, look for the IAM role and copy its ARN. Here is mine:

Once you update the command with your parameters and execute it on the EC2 instance, you can receive a 200 success code.

You get a 200 Ok status, but that does not mean your data is loaded correctly, you should do more check. As you see you also get a loadId which can be used with your endpoint url to check for the status. you execute a curl command:

curl -G https://your-neptune-endpoint:port/loader/load

If every thing goes well then you will see a Load-Completed message like this:

And from there you can query your data! However, what I find more interesting and want you to see next is Neptune Analytics, which became generally available in 2023. That is in Part 2 here.

Visualizing Graph

Let us make a quick visualization of this graph dataset and see iwe can spot that relationship between clients.

Previously we did the loading of our database via our EC2 instance, but sometimes you really want to see the visualization and test some hypothesis. I think having access to the database via SageMaker notebook is really an excellent addtion.
From the Neptone Ui, you can access your Notebook, you will fins so many examples on how to use this service, I feel the AWS team did well on that!

For our use case, you can still access your data (as well as you could also load the data form S3 to Neptune cluster only usong the Notebook) but let us connect to the existing data we just loaded.

You need the following libraries. In most of the tutorials this AoihttpTransport is not required but I find it useful to avoid some errors.

from __future__ import print_function
from gremlin_python.structure.graph import Graph
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.driver.aiohttp.transport import AiohttpTransport

Then you can establish a connection, use your cluster nedpoint and port:

connection = DriverRemoteConnection(f"wss://{SERVER}:{PORT}/gremlin",'g',
                 transport_factory=lambda:AiohttpTransport(call_from_event_loop=True))

graph=Graph()
g = graph.traversal().withRemote(connection)

We can then make some gremlin queries, let us see how many vertices and how many nodes we have:

print("number_of_nodes",g.V().count().toList())

print("number_of_edges",g.E().count().toList())

You can test so many hypothesis to check if there are some suspeciosu fraud rings for example. For example we can query the indifvidual relationsips. with this query:

%%gremlin
g.V().hasLabel("Individual").out().path()

This code finds and shows the connections starting from points marked as “Individual” to other points they’re directly linked to, kind of like mapping out their immediate relationships or journeys. So, it will give a hint if there are strong relationships between individuals. If we run that, we will see this clustered relationships.

As you see, there is a cluster that is somehow interesting and indicates that there are some high intensities of relationships. We can zoom in on that below.

So it seems like there are three individuals with ids (Ind1,Ind2,Ind3) working in the same company with id comp1, which sounds not suspicious as that could be because the company they are working for has some offers from the insurance company. Therefore, it is very likely they have the same insurance company. We can do more investigation wunning this query:

%%gremlin
g.V().hasLabel("Individual").out().out().path()

This will give us a more interesting result. In the following, you can see that Ind1 and Ind2 have the same repair shop, and they also had the accident at the same place. This still can be highly probable, but it needs a little investigation if you have some records about this repair shop’s reputation, etc.

Conclustion

In conclusion, this post on AWS Neptune for fraud detection has provided a glimpse into the potential of graph databases.

Through a crafted (well a made up) example, while not entirely reflective of real-world complexities, I hope to have established an understanding of their utility.

It’s clear that what we’ve explored is just the tip of the iceberg. This simple dataset alone showcases the power of graph databases, yet the true value lies in the integration with graph-based neural networks.

While this post merely scratches the surface, aiming to inspire curiosity and exploration, the possibilities with graph databases are vast and varied.