Random Records in T-SQL
Posted on Jan 30, 2006
Recently, I needed to write a simple query that would randomly grab a single product record from the database. In the past, I have used QueryRandomRows() to do this, but this can be very inefficient if you want to pull a random record from a large data set. I happened upon a solution for MS SQL that is extremely simple (this is an oversimplified version of my query):
<cfquery name="qryProducts" datasource="#application.dsn#"> SELECT TOP 1 id, productName FROM products ORDER BY NEWID() </cfquery>
Obviously, you can alter how many rows you return by changing the TOP 1. Also note that using the RAND() function will not work even though it may seem more logical. This is because "repetitive calls of RAND() with the same seed value return the same results." This means that within a single query, RAND() will return the same result for every row.
I am sure I have seen this blogged elsewhere, but I cannot remember where exactly (so forgive me if this is not new).
Comments
Keep in mind that you're having to generate a new ID for every record in the table, every query, which also has performance implications on large tables.
If you table doesn't add/delete rows too frequently, a better solution is to maintain a sequence id (a key that is always sequential), and then do "...where sequenceID >= #randRange(1, maxSequenceId)#...". The determining factor is whether maintaining the sequence ids is more or less expensive than assigning all those IDs all the time. If you never delete, then you can just use the primary key (assuming it's a sequential integer).
Posted By Barney / Posted on 01/30/2006 at 1:04 PM
Thanks Barney, this is a great point. That is the solution recommended by MSDN I believe. It isn't quite as simple when you are retrofitting a solution on an old db table though. In this case, it would require adding and populating a new column, and then ensuring a solution is in place for deletes (i.e. updating the sequence numbers). In my case, this solution ended up being quite fast and certainly faster than returning the entire recordset and using queryRandomRows().
Posted By Brian Rinaldi / Posted on 01/30/2006 at 1:34 PM
No, I totally understand. Fast enough is always fast enough. You don't actually have to update on delete, if you don't want, unless you MUST have perfect randomization. The sequence will handle holes at the cost of a little bit of randomness. That is, if you delete a number, the next larger number will have twice the odds of selection. Then you can rebuild the sequence periodically, returning it to perfection. Updating the sequence is expesive (a data and key update to an average of n/2 rows), so it's far better to do in batches.
You can also intentionally build spaces into the sequence if you need weighted randomization. Say you've got 5 banner ads that you want randomly displayed, but you want the first 3 to be displayed twice as often. So instead of 1,2,3,4,5 (+1,+1,+1,+1,+1) as your sequence, use 2,4,6,7,8 (+2,+2,+2,+1,+1).
A very powerful technique, but as you say, not warranted in all situations.
Posted By Barney / Posted on 01/30/2006 at 1:56 PM
This would be great addition to Ray's Coldfusion Cookbook (assuming you haven't posted it already). Thanks for the tip.
Michael
http://www.michaelhaynie.com
http://www.coldfusionpodcast.com
Posted By Michael Haynie / Posted on 02/03/2006 at 3:22 PM