13/06/17

Common Table Expressions vs. Derived Tables

By Kathi Kellenberger

I teach a lot of people to write T-SQL. Sometimes these are beginners trying to get their first job in tech, and sometimes they are developers who just need to improve their skills. I am frequently asked this question:

“What is the best technique to use when writing queries?”

The only honest answer I can give is that you have a tool box full of techniques, and you must use the right tool for the job at hand. I dread hearing the words “We saw a presentation on [fill in the blank] and are now changing all our queries to use it.” Sometimes you can start with your favorite method. But that favorite method may not work or might perform poorly, and then you should look at other methods.

One of my favorite techniques for complex queries is to use a common table expression (CTE). (I often start with a CTE, but will change direction as needed). CTEs were introduced back in SQL Server 2005. While there is some additional functionality, i.e. recursion, CTEs work similarly to derived tables. Derived tables are subqueries that are used in the FROM clause instead of named tables. I like using CTEs over derived tables because CTEs are so much easier to read. Derived tables can be nested and often are several layers deep, becoming difficult to read and understand.

Following is an example using the AdventureWorks2014 database showing the difference between using a CTE and a derived table to accomplish the same thing. First is the derived table version of the query, followed by the CTE version. Focus on the FROM clause, and you see how much more comprehensible that clause and the main query become as a result of writing the subquery as a CTE.

_{USE AdventureWorks2014;}

_GO

_{--The derived table}

_{SELECT Orders.CustomerID}

_{, SalesOrderID}

_{, OrderDate}

_{, TotalDue}

_{, CustomerTotal}

_{, AverageSale}

_{, MinimumSale}

_{FROM Sales.SalesOrderHeader AS Orders}

_{JOIN (}

_{SELECT SUM(TotalDue) AS CustomerTotal}

_{, AVG(TotalDue) AS AverageSale}

_{, MIN(TotalDue) AS MinimumSale}

_{, MAX(TotalDue) AS MaximumSale}

_{, CustomerID}

_{FROM Sales.SalesOrderHeader}

_{GROUP BY CustomerID ) AS Sales}

_{ON Orders.CustomerID = Sales.CustomerID;}

_{--The CTE}

_{WITH Sales AS (}

_{SELECT SUM(TotalDue) AS CustomerTotal}

_{, AVG(TotalDue) AS AverageSale}

_{, MIN(TotalDue) AS MinimumSale}

_{, MAX(TotalDue) AS MaximumSale}

_{, CustomerID}

_{FROM Sales.SalesOrderHeader}

_{GROUP BY CustomerID}

₎

_{SELECT Orders.CustomerID}

_{, SalesOrderID}

_{, OrderDate}

_{, TotalDue}

_{, CustomerTotal}

_{, AverageSale}

_{, MinimumSale}

_{FROM Sales.SalesOrderHeader AS Orders}

_{JOIN Sales ON Sales.CustomerID = Orders.CustomerID;}

Both queries separate the aggregate query grouped on CustomerID from the outer query. The performance is the same, and they have the same execution plans. Each query does two clustered index scans of the table, one scan for the aggregate query and one for the outer query.

New Content Item

Once you begin nesting derived tables, because one level depends on another level, the beauty of CTEs becomes apparent. You cannot nest CTEs, but you can join a CTE to another previously defined CTE. Instead of digging to find the innermost nesting, CTEs are simple to organize and read. Following is an example showing the difference. Once again, the derived table approach is first, followed by the CTE version of the same query.

_{--Derived table}

_{SELECT CustomerId, SalesOrderId, OrderDate, TotalDue, MonthSales.OrderYear,}

_{MonthSales.OrderMonth, MonthTotal, YearTotal}

_{FROM (}

_{SELECT CustomerID, SalesOrderID, CAST(OrderDate AS DATE) AS OrderDate,}

_TotalDue,

_{YEAR(OrderDate) AS OrderYear, MONTH(OrderDate) AS OrderMonth}

_{FROM Sales.SalesOrderHeader) AS SOH}

_{JOIN (}

_{SELECT YEAR(OrderDate) AS OrderYear,}

_{MONTH(OrderDate) AS OrderMonth,}

_{SUM(TotalDue) AS MonthTotal, YearTotal}

_{FROM Sales.SalesOrderHeader A}

_{JOIN (}

_{SELECT YEAR(OrderDate) AS OrderYear, SUM(TotalDue) AS YearTotal}

_{FROM Sales.SalesOrderHeader}

_{GROUP BY Year(OrderDate)}

_{) AS YearSales ON YearSales.OrderYear = YEAR(A.OrderDate)}

_{GROUP BY YEAR(OrderDate), Month(OrderDate), YearTotal}

₎

_{AS MonthSales ON MonthSales.OrderYear = SOH.OrderYear}

_{AND MonthSales.OrderMonth = SOH.OrderMonth}

_{ORDER BY SalesOrderID;}

_{--Common table expression}

_WITH

_{MonthSales AS ( --Sales by year and month}

_{SELECT YEAR(OrderDate) AS OrderYear,}

_{MONTH(OrderDate) AS OrderMonth,}

_{SUM(TotalDue) AS MonthTotal}

_{FROM Sales.SalesOrderHeader}

_{GROUP BY YEAR(OrderDate), MONTH(OrderDate)}

_),

_{YearSales AS ( --Sales by year}

_{SELECT SUM(MonthTotal) AS YearTotal,}

_OrderYear

_{FROM MonthSales}

_{GROUP BY OrderYear)}

_{SELECT CustomerID, SalesOrderID, CAST(OrderDate AS Date) AS OrderDate,TotalDue,}

_{YEAR(OrderDate) AS OrderYear,}

_{MONTH(OrderDate) AS OrderMonth, MonthTotal, YearTotal}

_{FROM Sales.SalesOrderHeader AS SOH}

_{JOIN YearSales ON YearSales.OrderYear = YEAR(OrderDate)}

_{JOIN MonthSales ON MonthSales.OrderYear = YEAR(OrderDate)}

_{AND MonthSales.OrderMonth = MONTH(OrderDate)}

_{ORDER BY SalesOrderID;}

These queries return the SalesOrderHeader rows along with total amounts at the month and year level. The values could be used to perform calculations such as comparing an individual sale to the total sales that month.

The derived table in this case is still not too difficult to understand because it is only three levels deep, and I have formatted the query to show the levels. To understand the derived table query, you must look at the deepest level, which is a list of sales by year. The next level up joins to the first query and calculates sales by month and year. The third level returns the un-aggregated rows of the table and joins to the middle layer. The outer query displays columns from the second and third levels.

The query using CTEs returns the same results, but the sections are organized from top to bottom instead of inside out. The MonthSales CTE groups the sales by year and month. The YearSales CTE groups the MonthSales values by year. Finally, in the outer query, the SalesOrderHeader table is joined to both the YearSales and MonthSales CTEs.

When doing some timings, I did find that the CTE was faster in this example. What isn’t obvious from the CTE is that the table is scanned three times. The execution plan shows the three scans:

New Content Item

The CTE query is somewhat misleading. When a CTE is used in multiple places in a query, the query inside the CTE must run each time. The YearSales CTE is not really taking the saved results of the MonthSales CTE and grouping on OrderYear. Each time MonthSales is used in the query, the CTE executes within the context of where it is used. In this case, MonthSales runs once when it is used in YearSales and once when it is joined in the outer query. It appears from how the query is written that the results of MonthSales would be retained in memory for both, but that is not the case.

I do like using CTEs. But when a CTE will be used multiple times within the query, I may switch to saving the results in a temp table instead when the CTE execution is expensive.

Are you a new or intermediate T-SQL developer? Check out my book Beginning T-SQL!

About the Author

Kathi Kellenberger is a consultant with Linchpin People. She enjoys writing and speaking on SQL Server topics, having written over two dozen articles, contributed to four books, and presented at many SQL Server events. In her spare time, Kathi enjoys spending time with family and friends, singing and cycling.