Before we move on to the topic, I’d like to apologize for the irregular updates, but I’m going through a few courses that take a lot of time. On the other hand, from time to time they result in inspirations for new posts. One of such a case is a course focused on working with PostgreSQL, during which I started to think how I can improve the way I write database queries. In my everyday work I use Django ORM, which I use a lot (and attack from time to time). However, it’s crucial to understand what is executed under the hood. In this post I’d like to show you how you can optimize your Django database queries.
Example Project
We’re going to create a very simple Django project consisting of a single application and 4 models (that you can think of as database tables). Link to the full code is at the end if you wanted to review everything on your own and better understand what I did here. Anyway, we want to be able to add books with some description ad enable ratings for them. So we have a category, author and book. Book can only be in 1 category, but have multiple authors. There is also a review that points to a specific book and rate the book in the scale of 1-5.
Our models are following:
from django.db import models from django.core.validators import MaxValueValidator, MinLengthValidator class Category(models.Model): name = models.CharField(max_length=100) class Book(models.Model): title = models.CharField(max_length=200) description = models.TextField() authors = models.ManyToManyField('Author', related_name='books') category = models.ForeignKey('Category', on_delete=models.CASCADE) class Author(models.Model): first_name = models.CharField(max_length=100) last_name = models.CharField(max_length=100) class Review(models.Model): summary = models.CharField(max_length=200) text = models.TextField() book = models.ForeignKey('Book', on_delete=models.CASCADE) score = models.PositiveSmallIntegerField( default=1, validators=[ MinLengthValidator(1), MaxValueValidator(5) ] )
Database Migrations
Django’s ORM has the ability to create migration files that transform database structure to the desired state. We can guess what is this. Creating 4 tables with proper fields, and constraints like primary and foreign keys. After running ./manage.py makemigrations
we get this file:
# Generated by Django 3.1.3 on 2020-11-29 06:30 import django.core.validators from django.db import migrations, models import django.db.models.deletion class Migration(migrations.Migration): initial = True dependencies = [ ] operations = [ migrations.CreateModel( name='Author', fields=[ ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')), ('first_name', models.CharField(max_length=100)), ('last_name', models.CharField(max_length=100)), ], ), migrations.CreateModel( name='Book', fields=[ ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')), ('title', models.CharField(max_length=200)), ('description', models.TextField()), ('authors', models.ManyToManyField(related_name='books', to='reviews.Author')), ], ), migrations.CreateModel( name='Category', fields=[ ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')), ('name', models.CharField(max_length=100)), ], ), migrations.CreateModel( name='Review', fields=[ ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')), ('summary', models.CharField(max_length=200)), ('text', models.TextField()), ('score', models.PositiveSmallIntegerField(default=1, validators=[django.core.validators.MinLengthValidator(1), django.core.validators.MaxValueValidator(5)])), ('book', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='reviews.book')), ], ), migrations.AddField( model_name='book', name='category', field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='reviews.category'), ), ]
But is this all? In order to get information what SQL will be exactly run, we can run ./manage.py sqlmigrate 0001 reviews
command to display the migration 0001 of “reviews” app.
BEGIN; -- -- Create model Author -- CREATE TABLE "reviews_author" ("id" serial NOT NULL PRIMARY KEY, "first_name" varchar(100) NOT NULL, "last_name" varchar(100) NOT NULL); -- -- Create model Book -- CREATE TABLE "reviews_book" ("id" serial NOT NULL PRIMARY KEY, "title" varchar(200) NOT NULL, "description" text NOT NULL); CREATE TABLE "reviews_book_authors" ("id" serial NOT NULL PRIMARY KEY, "book_id" integer NOT NULL, "author_id" integer NOT NULL); -- -- Create model Category -- CREATE TABLE "reviews_category" ("id" serial NOT NULL PRIMARY KEY, "name" varchar(100) NOT NULL); -- -- Create model Review -- CREATE TABLE "reviews_review" ("id" serial NOT NULL PRIMARY KEY, "summary" varchar(200) NOT NULL, "text" text NOT NULL, "score" smallint NOT NULL CHECK ("score" >= 0), "book_id" integer NOT NULL); -- -- Add field category to book -- ALTER TABLE "reviews_book" ADD COLUMN "category_id" integer NOT NULL CONSTRAINT "reviews_book_category_id_7c540557_fk_reviews_category_id" REFERENCES "reviews_category"("id") DEFERRABLE INITIALLY DEFERRED; SET CONSTRAINTS "reviews_book_category_id_7c540557_fk_reviews_category_id" IMMEDIATE; ALTER TABLE "reviews_book_authors" ADD CONSTRAINT "reviews_book_authors_book_id_author_id_eac937c9_uniq" UNIQUE ("book_id", "author_id"); ALTER TABLE "reviews_book_authors" ADD CONSTRAINT "reviews_book_authors_book_id_7916c3da_fk_reviews_book_id" FOREIGN KEY ("book_id") REFERENCES "reviews_book" ("id") DEFERRABLE INITIALLY DEFERRED; ALTER TABLE "reviews_book_authors" ADD CONSTRAINT "reviews_book_authors_author_id_8ecc127f_fk_reviews_author_id" FOREIGN KEY ("author_id") REFERENCES "reviews_author" ("id") DEFERRABLE INITIALLY DEFERRED; CREATE INDEX "reviews_book_authors_book_id_7916c3da" ON "reviews_book_authors" ("book_id"); CREATE INDEX "reviews_book_authors_author_id_8ecc127f" ON "reviews_book_authors" ("author_id"); ALTER TABLE "reviews_review" ADD CONSTRAINT "reviews_review_book_id_9a657eea_fk_reviews_book_id" FOREIGN KEY ("book_id") REFERENCES "reviews_book" ("id") DEFERRABLE INITIALLY DEFERRED; CREATE INDEX "reviews_review_book_id_9a657eea" ON "reviews_review" ("book_id"); CREATE INDEX "reviews_book_category_id_7c540557" ON "reviews_book" ("category_id"); COMMIT;
What is important to notice here are the indexes set on foreign keys. This makes perfect sense since it speeds up JOIN operations. More can be found in the documentation:
A database index is automatically created on the
From: https://docs.djangoproject.com/en/dev/ref/models/fields/#foreignkeyForeignKey
. You can disable this by settingdb_index
toFalse
. You may want to avoid the overhead of an index if you are creating a foreign key for consistency rather than joins, or if you will be creating an alternative index like a partial or multiple column index.
Writing simple SELECT statements
We’re going to use the interactive shell to play around with the queries. You can load sample data from “dumps” directory using python src/manage.py loaddata dumps/dump.json
command. A common task would be to find books assigned to a specific author. This can be done in a single query, but I’ll split them into 2 for readability and display the SELECT on books by calling query
property on the queryset.
(venv) ➜ src ./manage.py shell Python 3.8.5 (default, Jul 28 2020, 12:59:40) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: from reviews.models import * In [2]: author = Author.objects.get(id=1) In [4]: str(Book.objects.filter(authors=author).query) Out[4]: 'SELECT "reviews_book"."id", "reviews_book"."title", "reviews_book"."description", "reviews_book"."category_id" FROM "reviews_book" INNER JOIN "reviews_book_authors" ON ("reviews_book"."id" = "reviews_book_authors"."book_id") WHERE "reviews_book_authors"."author_id" = 1'
What interest us most, however, is whether the index was actually used. Django has recently introduced an explain()
method which returns the EXPLAIN
results, so that you don’t have to copy the query to psql, Pgadmin or other tool to see what will be returned:
In [6]: print(Book.objects.filter(authors=author).explain()) Hash Join (cost=14.91..26.94 rows=10 width=458) Hash Cond: (reviews_book.id = reviews_book_authors.book_id) -> Seq Scan on reviews_book (cost=0.00..11.60 rows=160 width=458) -> Hash (cost=14.79..14.79 rows=10 width=4) -> Bitmap Heap Scan on reviews_book_authors (cost=4.23..14.79 rows=10 width=4) Recheck Cond: (author_id = 1) -> Bitmap Index Scan on reviews_book_authors_author_id_8ecc127f (cost=0.00..4.23 rows=10 width=0) Index Cond: (author_id = 1)
We’ll stop here for a moment to analyze this output, since it’s very important if you want to properly optimize your Django queries. You can reach for links listen in the Additional Resources section if you want to have more extensive explanation. What interests us most here is that index scanning was actually used. This is indicated by the Bitmap Index Scan
and the Bitmap Heap Scan
near the arrows. The arrows point to a node, which you can of as a function call that executes part of the plan. This should be read inside-out.
Secondly that we should pay attention to are the cost indicators. They show us node execution cost in arbitrary units that can be adjusted (but we won’t go into such details). The first number shows the cost at which a node can start working and the second one is the expected query cost. We can verify this by passing analyze=True
keyword to the explain()
method.
Creating Own Indexes
Now I’d like to show how filtering non-indexed field works and how can indexing improve the search results. Let’s imagine we want to gather reviews that have the score above 3. Our query would be:
In [75]: print(Review.objects.filter(score__gt=3).explain(analyze=True)) Seq Scan on reviews_review (cost=0.00..1.10 rows=3 width=460) (actual time=0.006..0.009 rows=6 loops=1) Filter: (score > 3) Rows Removed by Filter: 44 Planning Time: 0.058 ms Execution Time: 0.017 ms
The estimated time is 1 unit of execution. Can we do better? Let’s index the score column using a default b-tree index.
In [77]: print(Review.objects.filter(score__gt=3).explain(analyze=True)) Seq Scan on reviews_review (cost=0.00..1.62 rows=17 width=460) (actual time=0.007..0.010 rows=6 loops=1) Filter: (score > 3) Rows Removed by Filter: 44 Planning Time: 0.103 ms Execution Time: 0.019 ms
But, there’s more to this. Since we’re creating an index, we need time to calculate its value and store it. This means loosing time on write queries and disk space, while gaining nothing. Unfortunately, when I measured how much time it took, it turned out that the results were not statistically significant. My guess is that for more complex indexes I would be able to tell something more.
To sum up this part, you need to remember that while making a decision about using an index, you need to collect some data about the performance before creating it and afterwards in order to ensure that it actually did help to improve the performance.
More to come
That’s it for today. I’m working on examples that would show how to properly implement text searching and other techniques of improving your queries, so stay tuned. You may want to subscribe to the newsletter in order not to miss those posts and learn some more serious stuff on how to optimize Django queries.