I've recently seen impressive performance gains in a data-centric process, which is a generic enough concept to be of general interest.
Imagine a system which consolidates the trades done in 10 different branches of a supermarket chain. We receive these trades on a batch basis: a file per branch every night. This file contains de-normalised rows of data containing information about a customer (e.g. identified through their loyalty card number), a product, and how much was purchased. We need to convert this data into a normalised schema where product and customer have their own set of relational tables. Our database doesn't store an exhaustive set of every product sold by the supermarket chain, so we only bother to store information about products when they appear the input files. Let's also assume we're doing this by hand through stored procedures -- no fancy ETL tools here.
The iterative-programmer's style of dealing with this problem is likely to be looping through the input file, and having conditional checks on each row, e.g.:
For each row in the input file:
We apply the above algorithm to each file we receive from each supermarket branch we are importing data for, and, because major relational databases pride themselves on their ability to deal with concurrent access, we have a thread (or process) per branch, where each thread imports all its trades in its own transaction. We kick them off all at once, hence finishing in approximately as much time as it takes to import the branch with the most data. Right?
Wrong. When certain conditions arise, this process serialises so it takes 10 times longer than expected, as if we'd done each branch one at a time. (In fact, with naive error handling, this approach may never work due to race conditions).
If one product appears for the first time in 2 different branches, let's say the West London branch and the South Manchester branch, then whichever branch encounters that product first in the feed file will insert it. When the other branch then tries to insert this product, it needs to wait for the first branch's transaction to commit or rollback before knowing if it can create the same row (in our database we naturally have primary key constraints to ensure each product is represented by 1 unique row). So the second branch waits for the whole of the first branch's transaction to commit!
When you're writing PL/SQL and you're thinking at the level of cursors and if-then-else statements, the problems outlined above may not seem obvious. If you're purely a Java/C#/C++ programmer you may start thinking along the lines of "well, perhaps we should reduce the size of our transaction", or "perhaps we can ignore this row and come back to it". These are also illustrations of thinking about the problem in the wrong way.
The correct approach is not to consider the input data as a succession of rows, but rather as a set of sets: identify the set of products you are importing, the set of customers, and the set of prices and quantities. Logic can then be applied to different sets: identify the union of new customers, and the union of all new products, and insert these before inserting the price and volume information in a concurrent manner.
The overhead of the pre-processing required is quickly recouped by the excellent performance achievable once the problem has been reduced in such a way that the lion's share of the work is embarrassingly parallel.
The original problem here arose again due to a paradigm clash: procedural programming, such as the constructs at your fingertips in languages like PL/SQL, encourage you to think of solutions to problems which are probably better solved using relational algebra, which ANSI SQL captures pretty well. Of course, PL/SQL, T-SQL and the like will always have their place. But if you consider yourself an iterative or object oriented programmer, and you find yourself writing stored procedures, the most elegant solutions will often be achieved if you approach the problem thinking in sets, not loops.