Recently, I ran into a bug causing corrupt data in an application that our team was maintaining. The application was running stable for several years. What happened? In this blog, I will show what I did to find the cause of the bug, and how to fix it.

Background

A while ago, working for a big retailer, my team took an application into maintenance. This application is part of the logistical landscape and stores records for all the shipments the warehouse sends. This data is stored to be able to serve it (through a REST endpoint) to customers, who can get a status overview of all shipments sent to them.

The application gets its data from a PubSub topic where the warehouse’s shipments are published. Usually, this happens in peaks because the warehouse sends the shipments in batches.

Enter jOOQ

The application is a Spring Boot application, using a PostgreSQL database with jOOQ as ORM layer, running on Java 8 with a mixture of Java and Kotlin classes.

jOOQ is a database-mapping library that uses a ‘database first’ approach. This means that you can generate Java code from your database and use this to write SQL queries in a typesafe and fluent way.

The shipments consist of the shipment itself with one or more lines (stored in a SHIPMENT and SHIPMENT_LINES table). The shipment and its lines are persisted in a transaction.

So, as mentioned, the application ran fine for several years. But, one day, processing a batch of shipments, there was a small database hiccup, rendering our database inaccessible for a few minutes. Checking the logs and data afterwards we noticed that there were some corrupt shipments: shipments without shipments lines.

In the data of the years before, there were no corrupt shipments to be found. So what happened?

In the following paragraph, I will show the core part of the setup that caused the bug. Then I will show how to test and fix it using an example application.

For demonstration purposes I created an example application with the same basic setup as our ‘subject of research’.

Analysis

To determine how the corrupt shipments originated you need to know that for some shipments the database hiccup started after persisting the shipment, but before persisting the shipment line(s) for the shipment. As persisting is done in a transaction our first action was to try to prove if the transaction was not being applied correctly. It’s hard to test hiccups predictably, so we had to find a way to simulate this.

To see what happened, and how to fix it, I added a small repository with the same setup here.

Let’s dive right in

Spring’s DataSourceAutoConfiguration  is used to configure our datasource. This datasource is wrapped in a jOOQ DataSourceConnectionProvider to create the DSLContext (I’ll get back to that later) used by jOOQ.

The setup for the data source of the application is as follows:

Saving a shipment by the repository is done by first storing the shipment and using the returned id (generated by a database sequence) to insert the shipment lines with a reference to the shipment. This runs in a transaction:

According to my theory, there was a problem in the transaction, so let’s try to prove this. This is a bigger challenge, as we need to create a predictable way of allowing the insert into the SHIPMENT_LINES table fail with a database error. In an ideal world all database constraints would be checked already in the application, so it’s hard to write a test for it.

Let’s create a simple application and use the following database tables:

The check constraint for the quantity is not implemented in the code; I will use this to create a test to prove that the inserts are not done within a transaction.

The corrupt shipment will be as follows (the shipment is for an order for a PlayStation 5 with quantity 0 as that’s the only way the order is accepted these days 🙂 ).

Because the standard query cannot be used, since this uses an inner join on shipment lines, I will create a separate query to get the shipments (without joining shipment lines). I created a separate TestShipmentRepository  within our test code, so it does not cause any confusion in our production code:

Now the test. When storing the corruptShipment  we want to check that a database exception is thrown, due to the database constraint. Both results for the queries for the shipments with lines (as used by the regular repository) as well as the test query for the shipments will have to be empty when fetching these from the database. This gives the following test:

The last assertion statement fails, so apparently the shipment is persisted while the shipment line is not. That proves that the transaction is incorrectly configured!

Diving into this, I noticed that jOOQ did not discover the transaction which resulted in this behaviour. Some configuration on this helped us in getting to the following ConnectionProvider  configuration:

So wrapping our datasource in a TransactionAwareDataSourceProxy should help us in linking our jOOQ configuration with our transaction.

Re-running the tests indeed proves this. The shipment is rolled back, the test runs correctly and everything is fine again.

Almost there

Earlier, I promised to get back to the DSLContext configuration above. According to the documentation Spring’s Autoconfiguration should autoconfigure a proper DSLContext. Checking Spring’s JooqAutoConfiguration  we indeed see that the TransactionAwareDataSourceProxy  is used to wrap the DataSourceConnectionProvider .

So it looks like we have been over configuring our application from the start! Doing less is probably what makes stuff better here…

I will write another test for it. The DSLContext should have been autoconfigured by Spring all along. So let’s omit all configuration and just go with Spring’s magic. Spring should try to automatically configure jOOQ based on the added dependency for jOOQ that we configured.

And… it works!

Added ‘bonus’: the exception being thrown is not an exception from jOOQ, but from Spring meaning our code can be a little more generic.

The code in the repository contains three tests to show all cases using profiles as well as an endpoint to play with inserting some shipments.

Aftermath

Quite a ride there! For me this was typical proof that systems can run perfectly fine for several years, while basic assumptions on the code setup are wrong. In a high load environment this can and most likely will bite you eventually.

Trying to understand what went wrong in order to fix it was the hardest, and most time-consuming part. The solution itself serves as a fine example that adding very little code (or even better: just removing some configuration) often makes a huge difference.

So, lessons learned: never underestimate the power of autoconfiguration. Make sure to understand what is going on. And never take @Transactional  (or any explicit code for adding transactions) for granted, without properly checking your application’s configuration.