This is a story about problem solving. True, it involves Python and database code, but eventually what I’m describing is running into a problem, understanding its root cause and finding a creative way to solve it.

To understand this blog post, all you need to know is some SQL. I’ll try to take care of the rest.

Entities & Schema

For the sake of simplicity, we’ll be talking about Books and Authors. The relation between them is many-to-many: a book may have more than one author, and an author can write more than one book.

The data is stored in a PostgreSQL database in 3 tables: Book, Author, and BookToAuthor. Each entity (book and author) has an ID, and the last table simply maps between these IDs.

book_id author_id
1 3
2 9
3 6
3 9

The database code is written in Peewee. Peewee is an ORM library in Python, which means it converts data from certain SQL databases (PostgreSQL is among them) to Python objects and vice versa. When you query the database, you do it by writing object-oriented code instead of raw SQL statements.

The Task

The task I had to perform was simple: find all books whose authors are exactly a certain list of authors, e.g. find all books whose authors are Jane Austen, J.K. Rowling and Harper Lee (all three!).

To do that, I performed a two-level query:

  1. First, for each book ID, aggregate all its author IDs into an array (as seen in the table below).
  2. Then, return only the books whose aggregated array equals the given array.
book_id author_ids
1 [3]
2 [9]
3 [6, 9]

This was my code for the first step:

# First-Level Query
subquery = BookToAuthor.select(
    BookToAuthor.book_id,
    fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)

In SQL, this translates into:

SELECT book_id, 
       ARRAY_AGG(DISTINCT author_id) AS author_ids
FROM BookToAuthor

This would return for each book an aggregated array of author IDs.

Now the code for the whole process. Before reading it, notice that:

  • The certain_authors variable is a list of the author IDs I’m interested in, say [6, 9].
  • The .c operator is Peewee’s way of accessing a subquery result’s column.
  • contains is a Peewee method allowing to test if a database array contains the given elements.
# First-Level Query
subquery = BookToAuthor.select(
    BookToAuthor.book_id,
    fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)

# Second-Level Query
results = BookToAuthor.select(
    BookToAuthor.book_id).join(
    # Join with subquery on the book_id column
    subquery, on=(subquery.book_id == BookToAuthor.book_id)).where(
    # Choose only book IDs whose authors are (certain_authors)
    subquery.c.author_ids.contains(certain_authors))

Whoopie doo. This looked good to me.

It wasn’t. or: The Problem

The problem was I didn’t get any results; I got an error.

ProgrammingError: operator does not exist: integer[] ~~* unknown
LINE 1: ...= "t1"."book_id") WHERE ("t2"."author_ids" ILIKE '%[1...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Peewee was using ILIKE string operator. It was not treating author_ids as an array.

Using Peewee’s .sql() method, I could confirm this by seeing the full query to which my code was translated:

SELECT t1.book_id 
FROM BookToAuthor AS t1
INNER JOIN (SELECT t1.book_id, 
                   ARRAY_AGG(DISTINCT x.author_id) AS author_ids 
            FROM BookToAuthor AS t1 
            GROUP BY t1.book_id) AS t2 
      ON (t2.book_id = t1.book_id) 
WHERE t2.author_ids ILIKE '%<certain_authors>%'

It was time to see why Peewee chose a string operator - a case-insensitive string match - instead of array-contains.

Debugging

With Pycharm I did a lot of F7 and F8 to step into and over the code. Eventually I broke on the following method inside the ColumnBase class:

def contains(self, rhs):
    return Expression(self, 'ILIKE', '%%%s%%' % rhs)

Peewee reached a place I didn’t want it to reach - a class for generic column types. In this class, the method contains treats its operands as strings.

However, Peewee has a more specific class named ArrayField. There, there’s a different implementation for the method contains which is mapped onto a different SQL operator:

def contains(self, *items):
    return Expression(self, '@>', ArrayValue(self, items))

This was exactly what I needed - the @> operator which checks if an array contains one or more of another array’s elements. But Peewee never got there :(

I needed to find a way to override this behavior and tell Peewee that the two operands are arrays, and that contains should be translated into @> and not into ILIKE1.

Using Peewee’s Expression

Peewee allows adding custom expressions. An expression consists of a left-hand and right-hand side operands and an operator.

For example, the last contain method called an expression which used ArrayField values as both operands, and the operator @>.

And so I added my new, custom expression inside a function:

def array_equals(lhs, rhs):
    return Expression(lhs, '@>', rhs)

Good! My new query looked like this:

# First-Level Query (same as before)
subquery = BookToAuthor.select(
    BookToAuthor.book_id,
    fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)

# Second-Level Query
results = BookToAuthor.select(
    BookToAuthor.book_id).join(
    subquery, on=(subquery.book_id == BookToAuthor.book_id)).where(
    # Only this line changed:
    array_equals(subquery.c.author_ids, certain_authors))

I happily ran the query and… meh. An error again:

ProgrammingError: operator does not exist: integer[] @> record
LINE 1: ...= "t1"."book_id") WHERE ("t2"."author_ids" @> (1, 2))
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts

Look at the query though - the operator changed! Peewee was now building my query with the array-contains operator @>, which was a huge step in the process of solving this issue.

However, as the error indicated - something remained incorrect with the arguments’ types. Since the author_ids operand was seen as an integer array (see the first line in the error message), it must have been the right-hand side operand - the list literal - that was incorrectly parsed.

One Last Trick

Looking at the examples in PostgreSQL documentation, I saw that array operations were done with explicit type declaration - arrays are preceded with the keyword ARRAY.

Completely impatient but with a strong feeling of almost-victory, I replaced the python variable with a pure SQL expression:

# Only the relevant line-of-code from before...
array_equals(subquery.c.author_ids, SQL('ARRAY{}'.format(certain_authors)))

This was a hacky solution, but it produced my oh-so-desired SQL query:

SELECT t1.book_id 
FROM BookToAuthor AS t1 
INNER JOIN (SELECT t1.book_id, 
                   ARRAY_AGG(DISTINCT t1.author_id) AS author_ids 
                   FROM BookToAuthor AS t1 
                   GROUP BY t1.book_id) AS t2 
      ON (t2.book_id = t1.book_id)
WHERE (t2.author_ids @> ARRAY[6, 9])'

My almost-victory was now victory.

Lessons Learned

This extremely-specific scenario had important takeaways for me to recall in the future:

  1. Don’t be afraid to debug library code. We often use it as a black box, but in cases like mine it was important to dig in and confirm where the problem lay.
  2. The idea of using Expressions was my colleague’s - Tal Shahnovsky. When things don’t work, I personally tend to go to the place of frustration and think “Nothing will ever solve this”. Someone else’s clear mind may bring some good ideas into the process, give them a chance.
  3. A better knowledge of Peewee would have helped me in coming up with Expression as a solution. If you’re working with a certain library and having trouble - invest some time in getting to know the module and its capabilities better. This may pay off.

If any of you readers has an idea of solving this problem better, more elegantly or just differently, I’ll be pleased to hear from you. Do not hesitate to contact.



1You may think to yourself: “Why not just execute a raw SQL query?”. mat Peewee indeed has this functionality - build a query using a string and execute it with execute_sql. However, the resulting rows are returned as tuples and not as objects, which would make my code ugly AF. But this was what I intended to do as a last resort.