4.11. Retrieving a Random Sample of Records

Problem

You want to retrieve a result set containing a random sample of records from a result set using SQL Server 2005 or later.

Solution

Use either the TABLESAMPLE T-SQL clause or a custom method. The solution shows how to use both techniques to return a random sample of 10 percent of records from the Person.Contact table in AdventureWorks. In either case, the top 10 sampled rows are returned.

The C# code in Program.cs in the project RetrieveRandomSampleRecords is shown in Example 4-12.

Example 4-12. File: Program.cs for RetrieveRandomSampleRecords solution

using System;
using System.Data;
using System.Data.SqlClient;

namespace RetrieveRandomSampleRecords
{
    class Program
    {
        static void Main(string[] args)
        {
            string sqlConnectString = "Data Source=(local);" +
                "Integrated security=SSPI;Initial Catalog=AdventureWorks;";

            string sqlSelect;
            DataTable dt;
            SqlDataAdapter da;

            // Get the count of records in Person.Contact.
            sqlSelect = "SELECT COUNT(*) FROM Person.Contact";
            int recCount;
            using (SqlConnection connection = new SqlConnection(sqlConnectString))
            {
                SqlCommand command = new SqlCommand(sqlSelect, connection);
                connection.Open();
                recCount = (int)command.ExecuteScalar();
            }
            Console.WriteLine("Person.Contact record count = {0}\n", recCount);

            // Fill the DataTable with Person.Contact data
            sqlSelect = "SELECT * FROM Person.Contact TABLESAMPLE (10 PERCENT)";
            dt = new DataTable();
            da = new SqlDataAdapter(sqlSelect, sqlConnectString);
            da.Fill(dt);

            // Output the top 10 rows using TABLESAMPLE
            Console.WriteLine("---{0} rows retrieved using TABLESAMPLE. " +
                "Displaying top 10 rows.---", dt.Rows.Count);
            for (int i = 0; i < 10; i++)
                Console.WriteLine("ID = {0}\t{1}, {2}",
                    dt.Rows[i]["ContactID"], dt.Rows[i]["LastName"],
                    dt.Rows[i]["FirstName"]);

            // Fill the DataTable with Person.Contact data
            sqlSelect = "SELECT * FROM Person.Contact WHERE " +
                "0.1 >= CAST(CHECKSUM(NEWID(), ContactID) & 0x7fffffff AS float) " +
                "/ CAST(0x7fffffff AS int)";
            dt = new DataTable();
            da = new SqlDataAdapter(sqlSelect, sqlConnectString);
            da.Fill(dt);

            // Output top 10 rows using custom method
            Console.WriteLine("\n---{0} rows retrieved using custom method. " +
                "Displaying top 10 rows.---", dt.Rows.Count);
            for (int i = 0; i < 10; i++)
                Console.WriteLine("ID = {0}\t{1}, {2}",
                    dt.Rows[i]["ContactID"], dt.Rows[i]["LastName"],
                    dt.Rows[i]["FirstName"]);

            Console.WriteLine("\nPress any key to continue.");
            Console.ReadKey();
        }
    }
}

The output is shown in Figure 4-12.

Discussion

The TABLESAMPLE clause, introduced in SQL Server 2005, quickly returns a random, representative sample of the table expressed as either an approximate number of rows or a percentage of the total rows. Unlike the TOP clause, TABLESAMPLE returns a result set containing a sampling of rows from all rows processed by the query. As you can see from the results in Figure 4-12, the TABLESAMPLE clause does not return a truly random sample at the individual row level.

Output for RetrieveRandomSampleRecords solution

Figure 4-12. Output for RetrieveRandomSampleRecords solution

The TABLESAMPLE clause syntax is:

	TABLESAMPLE [SYSTEM] (sample_number [PERCENT | ROWS])
	  [REPEATABLE (repeat_seed)]

Where:

SYSTEM

An ANSI SQL keyword that specifies a database server-dependent sampling method. Although other databases support additional sampling methods that are database server-independent (e.g., DB2 supports BERNOULLI), SYSTEM is the only method supported by SQL Server and the default value if not specified.

sample_number [PERCENT | ROWS])

A numeric expression that specifies the number of rows to return or the percentage of rows in the result set to return.

REPEATABLE (repeat_seed)

The seed used to select rows to be returned in the sample. REPEATABLE indicates that the selected sample can be returned more than once. If the same seed is used, the same rows will be returned each time the query is run as long as no changes have been made to the data in the table.

The TABLESAMPLE clause cannot be used with views or in an inline table-valued function.

The custom technique calculates a CHECKSUM based on a NEWID() and the ContactID value for each row and scales it to a value between 0 and 1. This value is compared to 0.1 to select the top 10 percent of rows.

Get ADO.NET 3.5 Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.