Optimizing Django Database Queries – Part 3

In the last 2 parts I mostly talked about proper indexing. But indexing is not everything. Sure, it might be helpful in some situations, but today I’d like to show you a situation where a search can be improved without them.

Disclaimer: I failed to write queries returning proper results. I have no idea why is it so and if you want to help, you can go to this question on stackoverflow. Nonetheless I haven’t written anything for 4 weeks and I think that in spite of those problems the example I’m going to use still can be used as the example for most important issue.

The Problem

let’s imagine we want to list books and view score they have as opposed to the score of books in the same category.

Models for our problems:

class Category(models.Model):
    name = models.CharField(max_length=100)


class Book(models.Model):
    title = models.CharField(max_length=200)
    authors = models.ManyToManyField('Author', related_name='books')
    category = models.ForeignKey('Category', on_delete=models.CASCADE)

    objects = BookQuerySet.as_manager()


class Author(models.Model):
    # Irrelevant


class Review(models.Model):
    summary = models.CharField(max_length=200)
    book = models.ForeignKey('Book', on_delete=models.CASCADE)
    score = models.PositiveSmallIntegerField(
        default=1,
        validators=[
            MinValueValidator(1),
            MaxValueValidator(5)
        ],
    )

Data:

[{'score': 5, 'book': 5, 'book__category': 3},
 {'score': 2, 'book': 4, 'book__category': 2},
 {'score': 4, 'book': 1, 'book__category': 1},
 {'score': 4, 'book': 2, 'book__category': 1},
 {'score': 4, 'book': 3, 'book__category': 1},
 {'score': 5, 'book': 1, 'book__category': 1},
 {'score': 5, 'book': 2, 'book__category': 1},
 {'score': 5, 'book': 3, 'book__category': 1},
 {'score': 5, 'book': 3, 'book__category': 1}]

Queries

At first we’ll try to use subquery (warning: this is the part that does not work properly). Counting score for the category is easy:

Category.objects.all().annotate(
            cs=models.Avg('book__review__score')
        )

Now we can integrate it with the Book query:

class BookQuerySet(models.QuerySet):
    def with_reviews_per_category(self):
        categories = Category.objects.filter(
            book=models.OuterRef('pk')
        ).annotate(
            cs=models.Avg('book__review__score')
        )
        qs = self.annotate(
            category_score=models.Subquery(categories.values('cs')[:1])
        )
        return qs

Now we can check the efficiency:

In [10]: pprint(Book.objects.with_reviews_per_category().values().explain())
('Seq Scan on reviews_book  (cost=0.00..6052.79 rows=160 width=490)\n'
 '  SubPlan 1\n'
 '    ->  Limit  (cost=37.74..37.76 rows=1 width=36)\n'
 '          ->  GroupAggregate  (cost=37.74..37.78 rows=2 width=36)\n'
 '                Group Key: u0.id\n'
 '                ->  Sort  (cost=37.74..37.74 rows=2 width=6)\n'
 '                      Sort Key: u0.id\n'
 '                      ->  Nested Loop Left Join  (cost=0.44..37.73 rows=2 '
 'width=6)\n'
 '                            Join Filter: (u1.id = u2.book_id)\n'
 '                            ->  Nested Loop  (cost=0.29..16.38 rows=1 '
 'width=8)\n'
 '                                  ->  Index Scan using reviews_book_pkey on '
 'reviews_book u1  (cost=0.14..8.16 rows=1 width=8)\n'
 '                                        Index Cond: (id = reviews_book.id)\n'
 '                                  ->  Index Only Scan using '
 'reviews_category_pkey on reviews_category u0  (cost=0.15..8.17 rows=1 '
 'width=4)\n'
 '                                        Index Cond: (id = u1.category_id)\n'
 '                            ->  Index Scan using '
 'reviews_review_book_id_9a657eea on reviews_review u2  (cost=0.15..17.10 '
 'rows=340 width=6)\n'
 '                                  Index Cond: (book_id = reviews_book.id)')

As you can see the cost is over 6000 units. Can we do better?

Window functions

Let’s take another look at our requirement: display score for each category next to book data. This looks a little bit like a GROUP BY but without actually grouping the results by category. The answer is using a window function.

We can write another query:

class BookQuerySet(models.QuerySet):
    #...

    def with_reviews_per_category_window(self):
        return self.annotate(
            category_score=models.Window(
                expression=models.Avg('review__score'),
                partition_by=[models.F('category_id')]
            )
        ).distinct()

How does it perform?

In [13]: pprint(Book.objects.with_reviews_per_category_window().values().explain
    ...: ())
('HashAggregate  (cost=118.12..124.52 rows=640 width=490)\n'
 '  Group Key: reviews_book.id, reviews_book.title, reviews_book.description, '
 'reviews_book.category_id, avg(reviews_review.score) OVER (?)\n'
 '  ->  WindowAgg  (cost=87.52..105.37 rows=1020 width=490)\n'
 '        ->  Sort  (cost=87.52..90.07 rows=1020 width=460)\n'
 '              Sort Key: reviews_book.category_id\n'
 '              ->  Hash Right Join  (cost=13.60..36.54 rows=1020 width=460)\n'
 '                    Hash Cond: (reviews_review.book_id = reviews_book.id)\n'
 '                    ->  Seq Scan on reviews_review  (cost=0.00..20.20 '
 'rows=1020 width=6)\n'
 '                    ->  Hash  (cost=11.60..11.60 rows=160 width=458)\n'
 '                          ->  Seq Scan on reviews_book  (cost=0.00..11.60 '
 'rows=160 width=458)')

Conclusion

We’ve gone down from over 6000 units to a little bit over 100 just by using a different tool. I’m by no mean an SQL expert (and the fact that I failed to properly write the first query sadly backs up this fact :-P), but I hope that you get what is really important here: sometimes writing a query differently will do the magic for you.

I strongly encourage you to get down to the SQL, try to understand it and search for the best ways of accomplishing your tasks.

Subscribe to the newsletter to get the latest posts.

Additional resources

See also