David G. Khachatrian

I recently had an epiphany about just how similarly one should read Python list comprehensions and SQL queries.

Comprehending List Comprehensions

First, we should make sure we’re on the same page about reading Python’s list comprehensions.

Let’s say we had a regular nested for-loop:

ll = [list('f00'), list('bar'), list('ba2'), list('f122bu22')]
res = []

for l in ll:  # (1)
    if len(l) < 5:  # (2)
        for ch in l:  # (3)
            if ch.isnum():  # (4)
                res.append(int(ch)**2 + 3)  # we will hoist the argument of append "above" the loop; "(0)"

# assert res == [3, 3, 7]  # passes assertion
# ...

That works, but it’s a bit unwieldy. From its verbosity, you’d expect we’re solving the traveling salesman problem or something – but we’re just mapping some values to other values. A list comprehension helps make it clear that we’re performing a mapping from one iterable to another.

How do you convert the above to a list comprehension? Move the argument of the final append to “above” the loop and then chain together the loop’s statements going from top to bottom. Using the comments above, we chain together from (0) to (4):

ll = [list('f00'), list('bar'), list('ba2'), list('f122bu22')]

# "collapsed", one-liner version of previous loop
res = [int(ch)**2 + 3  #(0)
            for l in ll  #(1)
            if len(l) < 5  #(2)
            for ch in l  #(3)
            if ch.isnum() #(4)
        ]
# without spaces:
# res = [int(ch)**2 + 3 for l in ll if len(l) < 5 for ch in l if ch.isnum()]

Bonus: Dictionary Comprehensions.

We can consider something quite similar for dictionary comprehensions. We’ll do something similar to the above; the integer result will be our value, and a unique, hashable version of the list it came from will be our key:

ll = [list('f00'), list('bar'), list('ba2'), list('f122bu22')]
res_dict = {}

for l in ll:  # (1)
    if len(l) < 5:  # (2)
        for i,ch in enumerate(l):  # (3)
            if ch.isnum():  # (4)
                res_dict[(''.join(l),i)] = int(ch)**2 + 3  # we will hoist the argument of append "above" the loop; "(0)"

# assert res_dict == {('f00', 1): 3, ('f00', 2):3, ('ba2', 2): 7}  # passes assertion
# ...

We convert to a dictionary comprehension with the same “hoisting” process:

ll = [list('f00'), list('bar'), list('ba2'), list('f122bu22')]

# "collapsed", one-liner version of previous loop
res_dict = {(''.join(l), i):int(ch)**2 + 3  # (0)
            for l in ll  # (1)
            if len(l) < 5  # (2)
            for i,ch in enumerate(l)  # (3)
            if ch.isnum() # (4)
}
# without spaces:
# res_dict = {(''.join(l), i): int(ch)**2 + 3 for l in ll if len(l) < 5 for i,ch in enumerate(l) if ch.isnum()}

SQL follows a similar “parsing” process!

You may have difficulty remembering what order to place the different commands of a SQL query: “Does WHERE come before or after a GROUP BY? What about an aggregation command like COUNT?”

You can first think of it as an unrolled “for-loop” of sorts, but using SQL syntax. Say we had a table of transactions. We want to get the transaction_ids and associated customers for any customer (that have unique user_ids) with more than 5 transactions whose purchase_price had been greater than 2000 units of currency. The (syntactically incorrect) “unrolled” format could be:

-- invalid, "unrolled" SQL query
FROM transactions -- (1)
    WHERE purchase_price > 2000 -- (2)
        GROUP BY user_id -- (3)
            HAVING COUNT(user_id) > 5 -- (4)
                SELECT `user_id`, `transaction_id` -- (0)

Notice the parallel with the Python unrolled for-loop:

for (1):
    if (2):
        for (3):
            if (4):
                res.append((0))

Each expression below Line X is acting on the subset of the table that remains after being filtered by Line X. In SQL, the filter may also be based on an aggregation. (This would be trickier to perform in a Python for-loop.) Just like converting the unrolled-for loop into a list comprehension, we can convert the “unrolled” SQL query into a valid query by “hoisting” the SELECT statement to the very top!

-- valid SQL query
SELECT `user_id`, `transaction_id`
    FROM transactions
        WHERE purchase_price > 2000 -- `WHERE` is a filter for *all* records
            GROUP BY user_id
                HAVING COUNT(user_id) > 5  -- `HAVING` filters groups

Since SQL can perform these aggregations with simple one-liners (e.g. GROUP BY), not all SQL queries can be written as list comprehensions (unless you do some weird gymnastics with Python’s functools.reduce which would probably make the code less readable than an “unrolled” for-loop). But the “reading order” and the idea of “hoistng the result upwards” apply to both.

Bonus: SQL Aliasing.

A nicety about having the SELECT portion at the top is that any aliases you may have used for the results can be used in the query itself. In general, any aliases “earlier” in the query can be used in “later” parts of the query. As a silly example, let’s say we wanted to rename user_id to just user in our result. We could write:

-- valid SQL query
SELECT `user_id` AS `user`, `transaction_id`
    FROM transactions
        WHERE purchase_price > 2000 -- `WHERE` is a filter for *all* records
            GROUP BY user -- can also write `GROUP BY user_id`; either the original expression or the alias works
                HAVING COUNT(user_id) > 5  -- `HAVING` filters groups

Bonus$^2$: Even more matching syntax, and SQL appendices.

Python’s with A as X, B as Y context manager with aliasing acts analogously to SQL’s WITH a AS x, b AS y; they provide a context and appropriate aliasing for something that is likely to be used within the contextual expression. (Of course, in Python, context managers are much more flexible than simple aliasing and not all contexts involve aliasing.)

SQL’s JOIN .. ON .. essentially is the start of a “new loop”. The query would look something like the following:

SELECT id,cool_thing FROM table_a

-- now we join our results from above to some table_b
-- in this case, the pivot/matching key is table_a.id == table_b.serial_no
-- (in SQL, comparator is single '=')
OUTER JOIN table_b on table_a.id=table_b.serial_no

OK, but why are they written this way?

This sort of declaration might seem a little alien – in a way, it looks like we’re using a variable before assignment! For example, in res = [x.lower() for x in my_list], it looks like we should get a NameError: 'x' is not defined for calling x.lower() “too soon”. (Of course, a programming language will know its own “idiosyncrasies”, so it’s not like getting the error is actually a concern. It just might feel wrong.)

I suspect the point is that it’s clearer what our result is/looks like, especially if the programmer named things clearly. You don’t have to dig through the filters and aggregations to figure out that we’re getting user_ids and transaction_ids in the SQL example. It’s the first thing you read in the line. Same thing with the Python example: you know you’re getting some ints created by casting some chars, without having to comb through the loops and conditionals. When you want to inspect, modify, maintain, or extend a codebase, you make things much easier if you make your code/logic as “self-explanatory”/self-documenting as possible. And like with any good writing, it helps to “get to the point”. This sort of syntax can help achieve these desiderata (oh man where did that come from) goals.

Further reading.

I was apparently not the first to note the comparison between SQL and Python comprehensions, but it doesn’t seem like the analogy has spread far and wide.

I have more or less recapitulated an explanation of list comprehensions found elsewhere on the Internet, but it doesn’t seem to come up near the top of search results about Python list comprehensions.

There are some Python recipes for doing SQL-like joins using Python list comprehensions.

Wrapping up.

There is quite a bit of similarity in the syntax/”reading order” of SQL’s queries and Python’s comprehensions (themselves borrowed from Haskell). Both follow a declarative style of programming, a style which some may not be too familiar with. Hopefully this connection will help those comfortable with Python’s comprehensions parse (and write!) SQL queries, and vice-versa.

- DK