It is often required to retrieve most recent record when building ETL to create data mart. The source table is normally very big, which make us pay attention to performance .The performance could be an issue even with data warehousing appliance like Netezza database. Three methods are introduced in this document. Though many people use the method 1 (Use a Complete Subquery), it is not a good solution, as this method could results in duplicated records, and poor performance. Based on my experience, the method 3 (Use Window Functions) is best option, as it provides you no duplicates and best performance.
Use a Complete Subquery
FROM SourceFact F
INNER JOIN (SELECT SourceID,Max(TIMESTAMP_SUBMITTED) AS LAST_TIMESTAMP_SUBMITTED
GROUP BY SourceID) F1
ON F.SourceID = F1.SourceID
AND F.TIMESTAMP_SUBMITTED = F1.last_TIMESTAMP_SUBMITTED;
if more than one sourceID with the same TIMESTAMP_SUBMITTED, then more than records for this sourceID will be selected, which breaks the uniqueness of this query. Also, the performance is very poor when the table SourceFact is very big.
Use Correlated SubqueriesCorrelated subqueries are subqueries that depend on the outer query. It’s like a for loop in SQL. The subquery will run once for each row in the outer query:
WHERE ( SourceID, TIMESTAMP_SUBMITTED )
IN (SELECT SourceID,Max(TIMESTAMP_SUBMITTED) AS LAST_TIMESTAMP_SUBMITTED
GROUP BY SourceID);
This solution is even worse for big table, as it result in very poor performance.
Use Window FunctionsWindow function is little complicated, but it yields much better performance, 10-100 times for big table. In addition, it will grantee the uniqueness, which is extremely important in data warehouse design.
FROM (SELECT ROW_NUMBER() OVER (
PARTITION BY SourceID
ORDER BY TIMESTAMP_SUBMITTED desc) LAST_TIMESTAMP_SUBMITTED,
FROM SourceFact) F
WHERE F.last_TIMESTAMP_SUBMITTED = 1;