You’ve Come to the Right Place

Magnus PS
7 min readMar 21, 2023

--

The Art of the Start with Window Functions

Photo by Sander Sammy on Unsplash

A lot of the content out there on window functions leaves us high and dry.

Confused. Unclear. And sometimes downright dumb.

Window functions are powerful, yet a lot of the content out there glosses over details or gets too technical, too early.

It can be difficult to know where to start and get answers to questions like:

  1. What are window functions and how are they written?
  2. What are some practical examples?
  3. Where do I go for more on the topic vs. where do I even start?

If you’ve been looking for answers to questions like these, then you’ve come to the right place.

Photo by Gia Oris on Unsplash

What are window functions? And why are they useful?

Window functions perform a calculation across a set of rows related to the current row based on the “window” we specify.

Where a GROUP BY clause returns just aggregate values, a window function returns aggregate values and non-aggregate values side-by-side.

Photo by Julia Kho on Medium

That’s the “X factor”.

Being able to look at the data of interest and your aggregation side-by-side.

It keeps our code clean and performant.

To wrap our head around what all this jibber-jabber means and what a window function looks like, let’s consider a “generalized” code snippet:

Image by author

On line 2 we elect column_01, column_02, and column_03from table_01, whereas on line 3 we notice a lengthier expression.

That is our window function.

Let’s break it down:

  • window_function is where we’d enter the name of the intended window function (ie. SUM). We apply the window function to column_03.
  • OVER is our window function keyword. It tells us “this is a window function”.
  • PARTITION BY determines the “window”. It divides the rows into partitions upon which we operate. We partition by column_01.
  • ORDER BY determines the order of rows within the window. We order by column_02.

There are different ways to customize this generalized code snippet and that depends on the functionality you’re looking for. Which depends on the type of window function.

The two most popular types are aggregate and value window functions:

  1. Aggregate functions perform a calculation across a “window”, the set of rows we specify, and return the related value for the current row. AVG, COUNT, MAX, MIN, and SUM are examples of aggregate window functions
  2. Value functions return a value for the current row from a “window”, the set of rows we specify. LEAD, LAG, NTILE, FIRST_VALUE, LAST_VALUE are examples of value window functions.

There are also ranking window functions, which you can check out on your own, but that’s enough rat-a-tat of the keyboard for now.

Let’s explore some practical examples!

Photo by Sunbolt Recruiting

We’ll start with an aggregate window function.

We want to compare employee’s salaries with the average for their position:

Image by author

We elect the name, position, and salary columns from the employees table and define our window function to take the average salary, partitioned by position and output as the avg_position_salary.

We did not use the ORDER BY clause for this particular window function because it wasn’t necessary to answer the question at hand.

Let’s consider the output:

Image by author

Our data is output with the addition of an avg_position_salary column. With this, we can eye where an employee’s salary lies relative to the average. For example, Willy Washington has an above average salary for his position.

If we wanted to take it a step further, we could calculate the difference between an individual’s salary and the avg_position_salary and ROUND our output to 2 decimal places.

Alright-alright-alright aggregate, let’s see a value window function example.

We want to calculate the time spent in each phase based on our event logs:

Image by author

On line 3 we elect all columns from the event table.

On line 4, we define our window function to take the LEAD start_time (the start time of the next event) partitioned by employee_id and customer_id and ordered by start_time.

  • Our window is defined by employee_id and customer_id. Until either ID changes, we’re in the same window. As soon as one (or both) changes, we enter the next window.
  • Within that partition, the earliest start_time is at the top and the latest is at the bottom. It follows this chronology because we elected start_time as our order by column.
  • The LEAD function returns the next row’s start_time (where applicable) for our window.

On line 5, we reuse our window function to calculate duration, how long the event took place using end_time- start_time.

In this way we get the event, its start_time, end_time, and duration all side-by-side:

Image by author

Our data is output with the addition of 2 columns using our LEAD window function: end_time and duration.

With these columns, we can eye how much time each employee spent in each event for a given customer on a given date.

For example, for employee_id 1, customer_id 1 nearly 3 hours and 30 minutes were spent checked in before the work was complete.

Note: “Issue PO” events don’t have an associated end_time or duration because they’re the last event in the series (Check in → Complete → Issue PO). The LEAD window function doesn’t have a next timestamp to provide so its NULL.

If we wanted to take this query a step further we could calculate how much time was spent in each phase (1) per employee_id or (2) based on the event s themselves.

Let’s consider the 2nd option:

Image by author

Lines 3–7 we’ve already reviewed. The only update is that we bundle our query as a CTE, a temporary result set (times) to be used in our grouping.

In lines 11–15 we group our results by event (“Check In”, “Complete” or “Issue PO”) and calculate the total_duration per group:

Image by author

We take the sum of durations (duration) per event (“Check In”, “Complete” or “Issue PO”) and that gives us our total_duration.

Do you remember when you were a kid and we were taught to count using M&Ms?

It’s the same thing. The colors (ie. red, green, yellow) were our “groups” and the “aggregate function” was the number of M&Ms of each color (the count).

If we turn our eye back to the output above, we observe that more time is spent in the Complete than the Check In phase.

We can interpret this to mean, more time is spent in a state of completing the work, before issuing the PO than is spent doing the actual work. That could identify a hold up in getting paid by the customer sooner …

This is one example of how a window function could be used out in the real world.

Photo by Jens Lelie on Unsplash

With regard to window functions, das es das for your intro :)

Naturally, this leads to thoughts of “What’s next?”.

Where do you go from here?

  1. If you’d like to go deeper on window functions, check out Julia Kho’s in-depth article on the topic. I referred to it a bunch in writing this article.
  2. If your foundation needs work or you want a refresh, check out Nikolai Shuster’s Udemy course. It’s the single best online course I’ve ever taken.

You’ve got a start.

Next up?

A stronger foundation and understanding, right on to crushing that oft-dreaded SQL screening.

Go get it!

I hope you gained from reading this and would love to hear your thoughts on this post and what you’d be interested in learning more about.

Whether DM or comment below, feedback is fully welcome 🙂

Thank you for reading through!
Magnus

--

--

Magnus PS

Writer | Data Analyst | Project Manager | "Health Nut"