The joy of sets

Stored procedures are just too procedural

I've recently seen impressive performance gains in a data-centric process, which is a generic enough concept to be of general interest.

Imagine a system which consolidates the trades done in 10 different branches of a supermarket chain. We receive these trades on a batch basis: a file per branch every night. This file contains de-normalised rows of data containing information about a customer (e.g. identified through their loyalty card number), a product, and how much was purchased. We need to convert this data into a normalised schema where product and customer have their own set of relational tables. Our database doesn't store an exhaustive set of every product sold by the supermarket chain, so we only bother to store information about products when they appear the input files. Let's also assume we're doing this by hand through stored procedures -- no fancy ETL tools here.

The iterative-programmer's style of dealing with this problem is likely to be looping through the input file, and having conditional checks on each row, e.g.:

For each row in the input file:

  • if the product doesn't exist in our database, insert the product
  • if the customer doesn't exist, insert the customer
  • finally, insert the price and quantity which was purchased

We apply the above algorithm to each file we receive from each supermarket branch we are importing data for, and, because major relational databases pride themselves on their ability to deal with concurrent access, we have a thread (or process) per branch, where each thread imports all its trades in its own transaction. We kick them off all at once, hence finishing in approximately as much time as it takes to import the branch with the most data. Right?

Wrong. When certain conditions arise, this process serialises so it takes 10 times longer than expected, as if we'd done each branch one at a time. (In fact, with naive error handling, this approach may never work due to race conditions).

If one product appears for the first time in 2 different branches, let's say the West London branch and the South Manchester branch, then whichever branch encounters that product first in the feed file will insert it. When the other branch then tries to insert this product, it needs to wait for the first branch's transaction to commit or rollback before knowing if it can create the same row (in our database we naturally have primary key constraints to ensure each product is represented by 1 unique row). So the second branch waits for the whole of the first branch's transaction to commit!

When you're writing PL/SQL and you're thinking at the level of cursors and if-then-else statements, the problems outlined above may not seem obvious. If you're purely a Java/C#/C++ programmer you may start thinking along the lines of "well, perhaps we should reduce the size of our transaction", or "perhaps we can ignore this row and come back to it". These are also illustrations of thinking about the problem in the wrong way.

The correct approach is not to consider the input data as a succession of rows, but rather as a set of sets: identify the set of products you are importing, the set of customers, and the set of prices and quantities. Logic can then be applied to different sets: identify the union of new customers, and the union of all new products, and insert these before inserting the price and volume information in a concurrent manner.

The overhead of the pre-processing required is quickly recouped by the excellent performance achievable once the problem has been reduced in such a way that the lion's share of the work is embarrassingly parallel.

The original problem here arose again due to a paradigm clash: procedural programming, such as the constructs at your fingertips in languages like PL/SQL, encourage you to think of solutions to problems which are probably better solved using relational algebra, which ANSI SQL captures pretty well. Of course, PL/SQL, T-SQL and the like will always have their place. But if you consider yourself an iterative or object oriented programmer, and you find yourself writing stored procedures, the most elegant solutions will often be achieved if you approach the problem thinking in sets, not loops.

About the author

Kenneth RoperKenneth Roper is a development team leader at tier-1 investment bank. He is interested in applications with low-latency requirements or large memory footprints. He spends a lot of time reading garbage collection logs and snow reports.

E-mail : kenneth.roper at codingthearchitecture.com


Re: The joy of sets

This is an very common problem. Most people in the programming world don't think in terms of sets because they deal with data in a row by row fashion within their code/application. It's always a shame when you see stored procs loaded with cursors because someone doesn't have a basic grasp of the power of set theory. As you have discovered, the benefits in terms of performance are fantastic.

Re: The joy of sets

During a discussion with the lead developer for a client's batch systems I discovered that he was thinking more along the lines you've just described, only in-database rather than in-memory. He would question why you had a primary key on the products table at all - why not simply remove the duplicates once the batch had finished? You would naturally suspend indexing and constraint checking during batch operations so why is your PK exempt?

The compromise was to do parallel batch loads into one set of unconstrained tables and apply the change set to your "relational" tables once the contention is under control.

His argument went a step further, though: since there was no other mechanism for creating products or relationships to those products why ever introduce constraints in the first place? By way of an example he demonstrated how one transaction (pretty much exactly of the type you describe) resulted in the same constraint being checked 6 (!) times for the sake of referential integrity that was already implicit from the source data.

Re: The joy of sets

If you have a system where your data integrity is implicit in the source data, and you trust your source data implicitly, and if your business use case allows you the liberty of having periods in your database when the data integrity is breached (e.g when you disable constraints during batch imports) on the proviso that you re-apply the constraints "by hand", you are duplicating functionality provided in the database product itself. If you can apply this argument to the normal use case of the system -- i.e. you don't need any constraints at all -- then you can argue that a relational database is not suited to your purposes: there are products which will import data faster and easier. I guess the design decision to have the bottom layer in any system diagram to be an RDBMS is often made thinking of the standardised API, and the flexibility this decision will give you. But if you are not using the features of a relational database, then, dare I suggest another form of data store may give you benefits in both performance and programming effort.

Re: The joy of sets

Of course you're right! As you point out, there're a lot of "ifs" between where most schemata are and throwing the RDBMS away.

Clever coding is sometimes (frequently?) used to get around problems with the schema, not problems with the data. My concern is that the code is changed by virtue of its ease of modification rather than it being the right design choice; by doing that you may be implementing features of the database (or its costly extensions!) that you're not even aware of. Hence my deference to an experienced batch developer.

It's worth noting that the situation you describe will lead to duplicate rows, not missing rows - the consequences of the former are often far more tolerable, especially with static data like SKU-level items.

Re: The joy of sets

Nice article - it raises some common issues. I recently removed a large body of badly written java code by replacing it (and a couple of other SQL queries) with a single query that had a..... Union (gasp).

It's only when you've interviewed many candidates that have "SQL" on their CV that you realise how poor most people's knowledge is. Most cannot write an inner join, let alone an outer join or subquery.

SQL is not only performant, when used correctly, but very expressive. A short and simple query can replace a large amount of code.

Re: The joy of sets

I've seen this kind of problem more times than I can count. I once worked on a batch process where the code exported data, pivoted it and then reimported it back to the DB clearly because the original coder didn't understand SQL/set processing properly. I changed the code to use SQL and reduced the batch running time from 12 hours to 6 minutes. The standard ETL process (whether using a fancy tool or not) works very well. 1. Get the data into a table with no constraints/checks 2. Normalise and process the data, look up foreign keys, perform validations etc into loading tables (not the production tables) 3. Once all the validation has been done move the data into the main production tables I'd highly recomend always following a process like this for importing 'untrusted data'

Re: The joy of sets

This is a similar problem to the distributed processing system google developed based on the map/reduce concept (http://en.wikipedia.org/wiki/MapReduce) The basic idea has been around for years, but seems to have found its feet again recently as the web has become richer and the need for processing large, set based data has been realised again. I totally recommend giving Hadoop - an open source Java implementation of the MapReduce concept - a play.. a useful tool in any architect's toolbox :)

Add a comment Send a TrackBack