The joy of sets
Stored procedures are just too procedural
I've recently seen impressive performance gains in a data-centric process, which is a generic enough concept to be of general interest.
Imagine a system which consolidates the trades done in 10 different branches of a supermarket chain. We receive these trades on a batch basis: a file per branch every night. This file contains de-normalised rows of data containing information about a customer (e.g. identified through their loyalty card number), a product, and how much was purchased. We need to convert this data into a normalised schema where product and customer have their own set of relational tables. Our database doesn't store an exhaustive set of every product sold by the supermarket chain, so we only bother to store information about products when they appear the input files. Let's also assume we're doing this by hand through stored procedures -- no fancy ETL tools here.
The iterative-programmer's style of dealing with this problem is likely to be looping through the input file, and having conditional checks on each row, e.g.:
For each row in the input file:
- if the product doesn't exist in our database, insert the product
- if the customer doesn't exist, insert the customer
- finally, insert the price and quantity which was purchased
We apply the above algorithm to each file we receive from each supermarket branch we are importing data for, and, because major relational databases pride themselves on their ability to deal with concurrent access, we have a thread (or process) per branch, where each thread imports all its trades in its own transaction. We kick them off all at once, hence finishing in approximately as much time as it takes to import the branch with the most data. Right?
Wrong. When certain conditions arise, this process serialises so it takes 10 times longer than expected, as if we'd done each branch one at a time. (In fact, with naive error handling, this approach may never work due to race conditions).
If one product appears for the first time in 2 different branches, let's say the West London branch and the South Manchester branch, then whichever branch encounters that product first in the feed file will insert it. When the other branch then tries to insert this product, it needs to wait for the first branch's transaction to commit or rollback before knowing if it can create the same row (in our database we naturally have primary key constraints to ensure each product is represented by 1 unique row). So the second branch waits for the whole of the first branch's transaction to commit!
When you're writing PL/SQL and you're thinking at the level of cursors and if-then-else statements, the problems outlined above may not seem obvious. If you're purely a Java/C#/C++ programmer you may start thinking along the lines of "well, perhaps we should reduce the size of our transaction", or "perhaps we can ignore this row and come back to it". These are also illustrations of thinking about the problem in the wrong way.
The correct approach is not to consider the input data as a succession of rows, but rather as a set of sets: identify the set of products you are importing, the set of customers, and the set of prices and quantities. Logic can then be applied to different sets: identify the union of new customers, and the union of all new products, and insert these before inserting the price and volume information in a concurrent manner.
The overhead of the pre-processing required is quickly recouped by the excellent performance achievable once the problem has been reduced in such a way that the lion's share of the work is embarrassingly parallel.
The original problem here arose again due to a paradigm clash: procedural programming, such as the constructs at your fingertips in languages like PL/SQL, encourage you to think of solutions to problems which are probably better solved using relational algebra, which ANSI SQL captures pretty well. Of course, PL/SQL, T-SQL and the like will always have their place. But if you consider yourself an iterative or object oriented programmer, and you find yourself writing stored procedures, the most elegant solutions will often be achieved if you approach the problem thinking in sets, not loops.
Re: The joy of sets
Re: The joy of sets
During a discussion with the lead developer for a client's batch systems I discovered that he was thinking more along the lines you've just described, only in-database rather than in-memory. He would question why you had a primary key on the products table at all - why not simply remove the duplicates once the batch had finished? You would naturally suspend indexing and constraint checking during batch operations so why is your PK exempt?
The compromise was to do parallel batch loads into one set of unconstrained tables and apply the change set to your "relational" tables once the contention is under control.
His argument went a step further, though: since there was no other mechanism for creating products or relationships to those products why ever introduce constraints in the first place? By way of an example he demonstrated how one transaction (pretty much exactly of the type you describe) resulted in the same constraint being checked 6 (!) times for the sake of referential integrity that was already implicit from the source data.
Re: The joy of sets
Re: The joy of sets
Of course you're right! As you point out, there're a lot of "ifs" between where most schemata are and throwing the RDBMS away.
Clever coding is sometimes (frequently?) used to get around problems with the schema, not problems with the data. My concern is that the code is changed by virtue of its ease of modification rather than it being the right design choice; by doing that you may be implementing features of the database (or its costly extensions!) that you're not even aware of. Hence my deference to an experienced batch developer.
It's worth noting that the situation you describe will lead to duplicate rows, not missing rows - the consequences of the former are often far more tolerable, especially with static data like SKU-level items.
Re: The joy of sets
It's only when you've interviewed many candidates that have "SQL" on their CV that you realise how poor most people's knowledge is. Most cannot write an inner join, let alone an outer join or subquery.
SQL is not only performant, when used correctly, but very expressive. A short and simple query can replace a large amount of code.
Kenneth Roper is a technical architect at 
