In the previous post about optimizing Django database queries I showed how to detect and improper use of indexes and what Django creates by default. Today I’d like to show a better usage of indexing on a potentially expensive operation which is text searching.
Task: Search an Author By Name
As you might remember the previous model, we had Author model in our system that consisted of 2 character fields. We might want to implement a search in our system that would try to match authors based on the last name. So when user starts typing “kin” they’ll be presented with authors like “King”, “Kinsley”, “Viking” etc. One way of doing this is to make regular search using
SELECT first_name, last_name FROM authors WHERE UPPER(last_name) LIKE UPPER('%kin%');
For large datasets this will probably be very inefficient. What we can do is to make use of indexing to speed up searches. The proper type of index for this type of queries is GIN index with support for trigram operations.
What is GIN index? It stands for Generalized Inverted index and is used to search across composite values like text, geographical points etc.
And what is a trigram? It’s a part of text consisting of 3 letters. So “king” will be split into “kin” and “ing”. The advantage of this type of search is that is more “resistant” to typos, since when the user searches for “kinh”, there would still be a match with “king”. Moreover, it allows us to sort results by the relevance. If we actually have author named “Kinh” in the database, we can sort results and display “Kinh”, “King”, since the first one will be more relevant. as it matches 2 out of 2 trigrams and the second result only 1.
How to Use of Trigrams And GIN Indexes in Django And PostgreSQL?
First, we need to create GIN and trigram extensions. For this we create a proper migrations:
# Generated by Django 3.1.3 on 2020-11-30 21:05 from django.db import migrations from django.contrib.postgres.operations import TrigramExtension, BtreeGinExtension class Migration(migrations.Migration): dependencies = [ ('reviews', '0002_auto_20201129_1839'), ] operations = [ TrigramExtension(), BtreeGinExtension(), ]
Afterwards we need to add the extensions package for postgres to the INSTALLED_APPS
setting:
INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.messages', 'django.contrib.staticfiles', 'django.contrib.postgres', 'reviews', ]
After applying it we can see the extensions in our database:

Before we apply the index, we just need to run a few test queries to ensure that it made the effect we desired. In order to do so, we populate the database with 250 00 records. We’ll be searching for string “ari” with 3 different methods”
- Simple substring search, case insensitive,
- Trigram similarity search,
- Trigram similarity with ordering the results.
In each case I’ll print the result SQL query and query plan generated with EXPLAIN ANALYZE
.
First, substring search, no indexing:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query) SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%) In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True)) Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1) Filter: (upper((last_name)::text) ~~ '%BAR%'::text) Rows Removed by Filter: 258106 Planning Time: 0.046 ms Execution Time: 58.662 ms
Second, trigram similarity search
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query) SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True)) Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1) Workers Planned: 1 Workers Launched: 1 -> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2) Filter: ((last_name)::text % 'bar'::text) Rows Removed by Filter: 129127 Planning Time: 0.039 ms Execution Time: 80.740 ms
And finally trigram with ordering:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari ...: ')).filter(similar__gt=0).order_by('-similar').query) SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True)) Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1) Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC Sort Method: quicksort Memory: 1485kB -> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1) Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision) Rows Removed by Filter: 244717 Planning Time: 0.052 ms Execution Time: 169.319 ms
Applying GIN Index for Optimizing the Database Query
In order to create an index, we need to run the following command:
CREATE INDEX "reviews_aut_last_name_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
What is important here, is that we need to pass the “opclasses”, which in this case is gin_trgm_ops
. Otherwise Postgres won’t use the index for searching. I’ve spent hours banging my head against the wall untill I discovered what was wrong. You can read more about opclasses from the additional resources linked below.
Luckily, the newest version of the ORM allows to create this directly on the model.
class Author(models.Model): first_name = models.CharField(max_length=100) last_name = models.CharField(max_length=100) class Meta: indexes = [ GinIndex(name='review_author_ln_gin_idx', fields=['last_name'], opclasses=['gin_trgm_ops']) ]
Now we can create the migration and apply it using python manage.py makemigrations && python manage.py migrate
. Having done this, we can finally measure the performance again:
In [4]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=True)) Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.213..69.108 rows=4456 loops=1) Filter: (upper((last_name)::text) ~~ '%ARI%'::text) Rows Removed by Filter: 254496 Planning Time: 0.972 ms Execution Time: 69.247 ms In [5]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(analyze=True)) Bitmap Heap Scan on reviews_author (cost=38.01..717.84 rows=259 width=16) (actual time=3.792..6.918 rows=431 loops=1) Recheck Cond: ((last_name)::text % 'ari'::text) Rows Removed by Index Recheck: 2985 Heap Blocks: exact=1375 -> Bitmap Index Scan on review_author_ln_gin_idx (cost=0.00..37.94 rows=259 width=0) (actual time=3.681..3.682 rows=3416 loops=1) Index Cond: ((last_name)::text % 'ari'::text) Planning Time: 0.989 ms Execution Time: 6.979 ms In [6]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari')).filter(similar__gt=0).order_by('-similar').explain(analyze=True)) Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=183.890..185.161 rows=27356 loops=1) Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC Sort Method: quicksort Memory: 2868kB -> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.023..179.401 rows=27356 loops=1) Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision) Rows Removed by Filter: 231596 Planning Time: 0.069 ms Execution Time: 185.979 ms
As you can see, the performance improved just in one case, but it was considerable. From about 80 ms to about 6.
Final thoughts
To sum up, we have seen how can we improve the common task, most of us have probably come across in our developers’ lives. You need to remember, however, that with great speed come great costs. Indexes take space, make creation longer, and, as we’ve seen both in this and in previous post, can have no impact on your database performance. Therefore, you should always measure whether your index did help you and ensure that it was implemented correctly. Otherwise, you’ll end up like me, looking for answers for hours 😉
I am planning to do 1 or 2 more post on the issue, so if I got you attention, you might want to subscribe to the newsletter to get notified about the newest articles.
Additional Resources:
- Project on GitLab
- Postgres docs on trigram search
- Postgres docs on GIN indexes
- My question on Stack Overflow with trouble shooting the problems when applying the index
- opclasses in Postgres
- Previous blog post