Stephen Balkum

Agile Enthusiast, Leader, and Developer

So You Want to Learn SQL… Session 4

no comment

Homework Answers

Generate a list of products that have a bill of material, stating each subcomponent and the quantity used. You should use the Production.BillOfMaterials table and create a human readable phrase. You might find the STR function useful.

SELECT 'A ' + p.Name + ' contains' + STR(bom.PerAssemblyQty) + ' ' + c.Name
	FROM Production.Product p
	JOIN Production.BillOfMaterials bom ON bom.ProductAssemblyID = p.ProductID
	JOIN Production.Product c ON c.ProductID = bom.ComponentID
	ORDER BY 1

Date Functions

GETDATE()

This first function is self-explanatory. It gets the time right now.

SELECT GETDATE()

DATEDIFF(datepart, startdate, enddate)

This function returns the difference between two dates. Returned value is an integer. Here are some useful dateparts are:
datepart year quarter month day week hour minute second
Abbreviations yy, yyyy qq, q mm, m dd, d wk, ww hh mi, n ss, s

SELECT	DATEDIFF(dd, '8/1/2013', '8/10/2013') AS 'days',
	DATEDIFF(dd, '8/10/2013', '8/1/2013') AS 'negative difference',
	DATEDIFF(hh, '8/1/2013 3:00', '8/2/2013 1:00') AS 'hours',
	DATEDIFF(m, 0, '8/1/2013') AS 'since 1/1/1900'

DATEADD(datepart, number, date)

This function adds the specified number of dateparts to the given date and returns the new date. The dateparts are the same as DATEDIFF.

SELECT	DATEADD(dd, 1, '8/1/2013') AS '1 day forward',
	DATEADD(dd, -1, '3/1/2012') AS 'Leap year known',
	DATEADD(yyyy, 1, '8/12/2013 13:30') AS 'keeps time',
	DATEADD(mm, 1363, 0) AS 'since 1/1/1900'

Combine Functions

Frequently, we want to bucket dates. For example, to count orders per day:

SELECT DATEADD(DD, DATEDIFF(DD, 0, o.OrderDate), 0) AS 'Day', COUNT(*) AS 'Count'
	FROM Sales.SalesOrderHeader o
	WHERE o.OrderDate > '7/1/2008'
	GROUP BY DATEADD(DD, DATEDIFF(DD, 0, o.OrderDate), 0)
	ORDER BY Day

Notice, you still have to GROUP BY any column not involved in an aggregate function, regardless of its complexity.

UNION ALL

To produce a single result set that is a combination of two queries, use UNION ALL. All of the columns of the two queries must have exactly the same data types. Here is a single list of all the address types and phone number types we have in our database.

SELECT a.AddressTypeID AS ID, a.Name, 'Address' AS Type
	FROM Person.AddressType a
UNION ALL
SELECT p.PhoneNumberTypeID, p.Name, 'Phone'
	FROM Person.PhoneNumberType p
ORDER BY Name

Subqueries

A subquery, a query within a query, can be used almost anywhere to access data in a more useful way. Useful can be a different structure for joining or it could be for performance purposes.
Let’s take a look at this year’s high value orders product 863. A high value order containing product 863 is >=$40,000. So, we start with our a query using what we already know how to do:

SELECT o.SalesOrderID, o.TotalDue, p.FirstName + ' ' + p.LastName AS CustomerName
	FROM Sales.SalesOrderHeader o
	JOIN Sales.Customer c on c.CustomerID = o.CustomerID
	JOIN Person.Person p ON p.BusinessEntityID = c.PersonID
	JOIN Sales.SalesOrderDetail od ON od.SalesOrderID = o.SalesOrderID
	WHERE o.TotalDue > 40000
		AND od.ProductID = 863
	ORDER BY o.OrderDate

The problem is filtering to product 863 requires the SalesOrderDetail table. If we join to SalesOrderDetail, our rows may expand for any order with multiple line items of that product. We don’t need to display any data from OrderItems, but the data we need to filter on is there.

WHERE EXISTS (subquery)

So, we will add a special WHERE clause that just looks for the existence of data in a query:

SELECT o.SalesOrderID, o.TotalDue, p.FirstName + ' ' + p.LastName AS CustomerName
	FROM Sales.SalesOrderHeader o
	JOIN Sales.Customer c on c.CustomerID = o.CustomerID
	JOIN Person.Person p ON p.BusinessEntityID = c.PersonID
	WHERE o.TotalDue > 40000
		AND EXISTS (SELECT *
				FROM Sales.SalesOrderDetail od
				WHERE od.SalesOrderID = o.SalesOrderID
					AND od.ProductID = 863)
	ORDER BY o.OrderDate

There are two keys takeaways with this query:
1) We didn’t have to do a join to OrderItems and get needless row duplication.
2) The EXISTS check is very fast since it doesn’t really return data, just the existence of rows. Once it finds a single row, it moves on and does not have to find all of the rows.
Now, we can expand our query to show two products: 863 and 854. However, high value orders for product 854 are >= $80,000. We can UNION two queries:

SELECT o.SalesOrderID, o.TotalDue, p.FirstName + ' ' + p.LastName AS CustomerName, 863 AS ProductID
	FROM Sales.SalesOrderHeader o
	JOIN Sales.Customer c on c.CustomerID = o.CustomerID
	JOIN Person.Person p ON p.BusinessEntityID = c.PersonID
	WHERE o.TotalDue > 40000
		AND EXISTS (SELECT *
				FROM Sales.SalesOrderDetail od
				WHERE od.SalesOrderID = o.SalesOrderID
					AND od.ProductID = 863)
UNION ALL
SELECT o.SalesOrderID, o.TotalDue, p.FirstName + ' ' + p.LastName AS CustomerName, 854
	FROM Sales.SalesOrderHeader o
	JOIN Sales.Customer c on c.CustomerID = o.CustomerID
	JOIN Person.Person p ON p.BusinessEntityID = c.PersonID
	WHERE o.TotalDue > 80000
		AND EXISTS (SELECT *
				FROM Sales.SalesOrderDetail od
				WHERE od.SalesOrderID = o.SalesOrderID
					AND od.ProductID = 854)
ORDER BY SalesOrderID

JOINing to a Subquery

With everything we have seen in this session, we can now create a psuedo table and join to it. Our request is for a two-week rolling average for the past four weeks. Since our sample database is all historical data, we will use August 1, 2008, rather than GETDATE().
We will start by creating our time ranges. Here is the current two-week date range:

SELECT	DATEADD(dd, -1, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'StartDate',
	DATEADD(dd, 13, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'EndDate'

Add in the previous 4 weeks:

SELECT	DATEADD(dd, -1, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'StartDate',
	DATEADD(dd, 13, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'EndDate'
UNION ALL
SELECT	DATEADD(dd, -8, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
	DATEADD(dd, 6, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))
UNION ALL
SELECT	DATEADD(dd, -15, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
	DATEADD(dd, -1, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))
UNION ALL
SELECT	DATEADD(dd, -22, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
	DATEADD(dd, -8, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))

Our base query for AOV and order count:

SELECT	AVG(o.TotalDue) AS AOV, COUNT(*) AS Count
	FROM Sales.SalesOrderHeader o

We definitely don’t want to run this query since it looks at all orders. However, we can join to our query of date ranges to remove unwanted data:

SELECT w.startdate, AVG(o.TotalDue) AS AOV, COUNT(*) AS Count
	FROM Sales.SalesOrderHeader o
	JOIN (
		SELECT	DATEADD(dd, -1, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'StartDate',
			DATEADD(dd, 13, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)) as 'EndDate'
		UNION ALL
		SELECT	DATEADD(dd, -8, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
			DATEADD(dd, 6, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))
		UNION ALL
		SELECT	DATEADD(dd, -15, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
			DATEADD(dd, -1, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))
		UNION ALL
		SELECT	DATEADD(dd, -22, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0)),
			DATEADD(dd, -8, DATEADD(ww, DATEDIFF(ww, 0, '8/1/2008'), 0))
) W ON W.StartDate <= o.OrderDate AND w.EndDate > o.OrderDate
GROUP BY w.startdate

Notice that a row in Orders will join to multiple rows in the subquery if it belongs to multiple date ranges. This allows the order to be counted in two different two-week rolling averages.

Homework

1) List the two-week rolling total quantity by territory for the current period and 3 prior periods by week. The result set should have three columns: Period Start Date, Territory Name, and Total Quantity.
2) List the total sales for each day of the week by territory for July 2008. The columns in the result set should be: Territory, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday. Do not display the territory if it had zero sales all week.

Next Session will be just the homework answers as I have exhausted what I consider necessary for the goal of this series.

So You Want to Learn SQL… Session 3

no comment

Homework Answers

List AOV per credit card type and country region code for July, 2008. Drop rows with statistically irrelevant results, i.e. less than 10. Sort by card type, AOV descending, then country.

SELECT cc.CardType, cr.Name AS Country, AVG(oh.TotalDue) AS AOV
	FROM Sales.SalesOrderHeader oh
	JOIN Sales.CreditCard cc ON cc.CreditCardID = oh.CreditCardID
	JOIN Sales.SalesTerritory t ON t.TerritoryID = oh.TerritoryID
	JOIN Person.CountryRegion cr ON cr.CountryRegionCode = t.CountryRegionCode
	WHERE oh.OrderDate >= '7/1/2008' AND oh.OrderDate < '8/1/2008'
 	GROUP BY cc.CardType, cr.Name
 	HAVING COUNT(*) > 10
	ORDER BY AOV DESC, Country

OUTER JOINs

LEFT [OUTER] JOIN and RIGHT [OUTER] JOIN

All rows from one of the tables, plus the rows from the other table.  The columns will contain NULL when a matching row is not found.

LEFT

  • All rows from the existing table, plus matching rows from the joining table
  • This is by far the most common outer join in my experience
  • LEFT [OUTER] JOIN Table2 t2 ON t2.PrimaryKey = t1.ForeignKey

RIGHT

  • All rows from the joining table, plus columns from the existing table
  • This join is rather rare
  • RIGHT [OUTER] JOIN Table2 t2 ON t2.PrimaryKey = t1.ForeignKey

Compare the output, especially the row count, of these queries:

SELECT *
	FROM HumanResources.JobCandidate

SELECT *
	FROM Person.BusinessEntity

SELECT jc.JobCandidateID, be.BusinessEntityID
	FROM HumanResources.JobCandidate jc
	JOIN Person.BusinessEntity be ON be.BusinessEntityID = jc.BusinessEntityID

SELECT jc.JobCandidateID, be.BusinessEntityID
	FROM HumanResources.JobCandidate jc
	LEFT JOIN Person.BusinessEntity be ON be.BusinessEntityID = jc.BusinessEntityID

SELECT jc.JobCandidateID, be.BusinessEntityID
	FROM HumanResources.JobCandidate jc
	RIGHT JOIN Person.BusinessEntity be ON be.BusinessEntityID = jc.BusinessEntityID

FULL [OUTER] JOIN

All rows from both tables

  • Columns will contain NULL when matching row not found
  • This join is rather rare
  • FULL [OUTER] JOIN Table2 t2 ON t2.PrimaryKey = t1.ForeignKey

CROSS JOIN

All rows from existing table matched to all rows in joining table and vice versa

  • So rare I have only ever done this once for a real request in 15 years of working with SQL
  • CROSS JOIN Table2
  • No relationship specified

A contrived example: compare the row counts and data of these two queries:

SELECT *
	FROM Person.AddressType

SELECT at1.Name, at2.Name
	FROM Person.AddressType at1
	CROSS JOIN Person.AddressType at2

In the second query you can see every row in at1 is joined to every row in at2 creating all possible combinations.

LEFT OUTER JOIN in Action

Move thru this progression, notice the result set changes, and continue to note the row count stays the same.  We start with the list of all products.  Our goal is to get a list of the product names, category, subcategory, and model as a human readable phrase.  A product has a name and can have an optional subcategory and optional model.  That means of these two optional foreign keys in Product, either of them could be NULL.

SELECT p.ProductID, p.Name, p.ProductSubcategoryID, p.ProductModelID
	FROM Production.Product p

(504 rows)

Join to ProductModel to get the model name.

SELECT p.ProductID, p.Name, p.ProductSubcategoryID, pm.Name
	FROM Production.Product p
	LEFT JOIN Production.ProductModel pm ON pm.ProductModelID = p.ProductModelID

Join to ProductSubcategory to get the subcategory name.

SELECT p.ProductID, p.Name, sc.Name, pm.Name
	FROM Production.Product p
	LEFT JOIN Production.ProductModel pm ON pm.ProductModelID = p.ProductModelID
	LEFT JOIN Production.ProductSubcategory sc ON sc.ProductSubcategoryID = p.ProductSubcategoryID

Join to ProductCategory to get the category name. Notice we must continue to use a left join to ProductCategory even though ProductSubcategory.ProductCategoryID may not be NULL.

SELECT p.ProductID, p.Name, sc.Name, c.Name, pm.Name
	FROM Production.Product p
	LEFT JOIN Production.ProductModel pm ON pm.ProductModelID = p.ProductModelID
	LEFT JOIN Production.ProductSubcategory sc ON sc.ProductSubcategoryID = p.ProductSubcategoryID
	LEFT JOIN Production.ProductCategory c ON c.ProductCategoryID = sc.ProductCategoryID

Now, let’s build a readable phrase.

SELECT p.ProductID, p.Name + ', a ' + pm.Name + ' model in ' + sc.Name + ' within the ' + c.Name + ' category'
	FROM Production.Product p
	LEFT JOIN Production.ProductModel pm ON pm.ProductModelID = p.ProductModelID
	LEFT JOIN Production.ProductSubcategory sc ON sc.ProductSubcategoryID = p.ProductSubcategoryID
	LEFT JOIN Production.ProductCategory c ON c.ProductCategoryID = sc.ProductCategoryID

Hmmmm. Where we had NULL model and subcategory names, the concatenation is still NULL. Let’s use the COALESCE function to fill in the gaps.

SELECT p.ProductID, p.Name + COALESCE(', a ' + pm.Name, ', an unknown') + ' model in ' + COALESCE(sc.Name + ' within the ' + c.Name, ' no specified') + ' category'
	FROM Production.Product p
	LEFT JOIN Production.ProductModel pm ON pm.ProductModelID = p.ProductModelID
	LEFT JOIN Production.ProductSubcategory sc ON sc.ProductSubcategoryID = p.ProductSubcategoryID
	LEFT JOIN Production.ProductCategory c ON c.ProductCategoryID = sc.ProductCategoryID

The COALESCE function can accept as many arguments as you like. It tests the first argument for a NULL value. If the it is non-NULL, it will return that value and quit. If it is NULL, it will test the second argument similarly. The last argument will be the default value.

Homework

Generate a list of products that have a bill of material, stating each subcomponent and the quantity used. You should use the Production.BillOfMaterials table and create a human readable phrase.  You might find the STR function useful.  Here are a few representative rows:

A BB Ball Bearing contains      1 Lock Ring
A BB Ball Bearing contains      2 Cone-Shaped Race
A Front Derailleur contains      1 Front Derailleur Cage

Moving the sessionState connection string to a separate config file

no comment

Typically, the sessionState for SQL is specified in the web.config something like this:

<sessionState mode="SQLServer"
    sqlConnectionString="Data Source=dbserver;User Id=username;Password=password"
    cookieless="false" timeout="20"/>

However, to make switching connection strings between development, QA, and live environments easier in our build process, we store all of our connection strings in a separate config file which is referenced in the web.config in this way:

<connectionStrings configSource="DevConnections.config"/>

Our build process automatically swaps out the configSource value based on the target environment. It was only logical to want to move the sessionState settings to a separate config file as well.  The build process could make a very similar step to modify the sessionState for the target environment.  So, are web.config now contains this line:

<sessionState configSource="LiveSessionState.config" />

and LiveSessionState.config contains:

<sessionState mode="SQLServer"
    sqlConnectionString="Data Source=dbserver;User Id=username;Password=password"
    cookieless="false" timeout="20"/>

At this point, our live application throws an exception because the sessionState connection string does not specify an Initial Catalog.  To be honest, this is first time I’ve noticed this connection string doesn’t give the Initial Catalog like all of our other connection strings.  This is because sessionState assumes the database name is ASPState.  By moving the settings to a separate config file, we have to specify the database name?  Well, OK.  So, we add it and we get a new exception because we specifed the Initial Catalog.  Frustration begins to set in since we only use a SQL session state server in our live environment.

Digging deeper, there is an additional parameter, allowCustomSqlDatabase, for sessionState that allows us to specify the Initial Catalog in case we didn’t want to use ASPState and we added it:

<sessionState mode="SQLServer" allowCustomSqlDatabase="true"
    sqlConnectionString="Data Source=dbserver;Initial Catalog=ASPState;User Id=username;Password=password"
    cookieless="false" timeout="20"/>

This feels a little quirky, but the problem is solved.

Current State of Azure’s Preview AutoScale

no comment

Today at the Austin Code Camp, Aaron Murray and I gave a presentation on our experience with Azure’s AutoScale which is in preview. Take a look at our slides.

So You Want to Learn SQL… Session 2

no comment

Review

Homework Answer

For orders in July of 2008, list the territory IDs, average total sale and number of sales per territory ID, sorted by best to worst sales

SELECT TerritoryID AS ID, AVG(TotalDue) AS AOV, COUNT(*) AS NumOrders
	FROM sales.SalesOrderHeader
	WHERE --OrderDate > '7/1/2008' AND OrderDate < '2008-8-1'
		OrderDate BETWEEN '7/1/2008' AND '8/1/2008'
	GROUP BY TerritoryID
	ORDER BY AOV DESC

COUNTing

COUNT(*) is very efficient since it has no restrictions and counts every qualifying row.  COUNT(column) does not count NULL values, but otherwise counts every row and pretty fast.  COUNT(DISTINCT column) requires the database engine to track the values it has already seen.  It also does not count NULL values.  This is the least efficient choice, so use it only when needed.

Take a look at the data from this query:

SELECT SalesOrderID, CreditCardID
	FROM Sales.SalesOrderHeader
	WHERE OrderDate > '7/1/2008'

Now, observe the differences between the variations of COUNT:

SELECT COUNT(*), COUNT(distinct CreditCardID), COUNT(CreditCardID)
	FROM Sales.SalesOrderHeader
	WHERE OrderDate > '7/1/2008'

HAVING

The HAVING clause is the equivalent of WHERE for aggregate functions.  Pesky aggregate functions.

Example: How many orders by shipping method since 8/7/2013:

SELECT TerritoryID, COUNT(*) as NumOrders
	FROM Sales.SalesOrderHeader
	WHERE OrderDate > '7/1/2008'
	GROUP BY TerritoryID

But let’s only look at methods that have a statistically relevant number of orders:

SELECT TerritoryID, COUNT(*) as NumOrders
	FROM Sales.SalesOrderHeader
	WHERE OrderDate > '7/1/2008'
	GROUP BY TerritoryID
	HAVING COUNT(*) > 10

Normalization

Without getting technical, one of the biggest goals in transactional database design is to reduce data duplication.  This means a data value, or collection of values like an address, do not appear more than once.  The amount of data duplication in a database schema is known as Normalization.  This is why tables get split out to more tables and we have dependencies from one table to another.

When we design a table like Sales.SalesOrderHeader, we might put the shipping address in the table.  However, once a customer orders a second time from us we find the same shipping address appearing more than once in our data.  So, we pull the shipping address columns out to a separate table and leave behind a reference ID.

Foreign Keys

This reference ID is called a Foreign Key. It has the same value as the Primary Key in the table it references. A table can have a compound Primary Key which is multiple columns that together form a unique ID for the table. Referencing a table with a compound Primary Key requires all of columns involved in the key, thus a compound Foreign Key.

Take a look at the columns in the Orders table in the Object Explorer and you will see several Foreign Keys (FK).

[INNER] JOIN … ON … [AND]

We join tables together because there is desired related data for our query.  For example, to get the name of the customer for an order, we will have to join from the orders table to the customers table.

  • There multiple types of Joins with the most common, and default, being the Inner Join.
  • Draw the columns of another table into the query by specifying the relationship between the query and the incoming table:
    JOIN JoiningTable ON JoiningTable.PrimaryKey = ReferencingTable.ForeignKey
  • For compound keys, specify all of the relationships separated by AND:
    JOIN JoiningTable ON JoiningTable.PrimaryKey1 = ReferencingTable.ForeignKey1 AND JoiningTable.PrimaryKey2 = ReferencingTable.ForeignKey2
  • Any rows that cannot be matched across the join will be dropped from the result set. This usually happens when the FK column contains NULL.
  • Make a habit now of using table aliases to save typing. Table aliases are required for future complex join structures.

Here is a list of the number of orders by territory:

SELECT t.Name, COUNT(*)
	FROM Sales.SalesOrderHeader o
	JOIN Sales.SalesTerritory t ON t.TerritoryID = o.TerritoryID
	WHERE o.OrderDate > '7/1/2008'
	GROUP BY t.Name

Count of orders by shipping method and territory which requires joining from SalesOrderHeader to SalesTerritory and SalesOrderHeader to ShippingMethod:

SELECT sm.Name, t.Name, COUNT(*)
	FROM Sales.SalesOrderHeader o
	JOIN Sales.SalesTerritory t ON t.TerritoryID = o.TerritoryID
	JOIN Purchasing.ShipMethod sm ON sm.ShipMethodID = o.ShipMethodID
	WHERE o.OrderDate > '7/1/2008'
	GROUP BY t.Name, sm.Name

Count of orders by shipping state which requires joining from SalesOrderHeader to Address and then from Address to StateProvice:

SELECT sp.Name, COUNT(*)
	FROM Sales.SalesOrderHeader o
	JOIN Person.Address a ON a.AddressID = o.ShipToAddressID
	JOIN Person.StateProvince sp ON sp.StateProvinceID = a.StateProvinceID
	WHERE o.OrderDate > '7/1/2008'
	GROUP BY sp.Name

Homework Exercise

List AOV per credit card type and country region code for July, 2008. Drop rows with statistically irrelevant results, i.e. less than 10. Sort by card type, AOV descending, then country.