How to solve the N+1 problem for non-DB requests in Django?

To remind, the N+1 problem means a code like this:

for comment in Comment.objects.filter(post_id=1).all():
    print(comment.author.avatar_url)

makes N+1 requests to the database (hence the name):

SELECT author_id FROM appname_comment WHERE post_id = 1;
SELECT avatar_url FROM appname_author WHERE id = 1;
...
SEELCT avatar_url FROM appname_author WHERE id = n;

despite only 2 requests were really needed here (not considering a single-request JOIN or nested SELECT for reasons that will be clear later):

SELECT author_id FROM appname_comment WHERE post_id = 1;
SELECT avatar_url FROM appname_author WHERE id IN (1, ..., n);

Indeed, it’s a well-known problem and Django offers a simple solution to it:

for comment in Comment.objects.prefetch_related("author").filter(post_id=1).all():
    print(comment.author.avatar_url)

But what if the “related” data isn’t stored in a DB table? For instance, if the user data such as avatars belongs to a separate service which is available over an HTTP API only.

Consider the public People API by Google being such a service. Even if their user data were stored in a relational database Django could work with, they obviously wouldn’t give us access to it, so we have to request their servers over HTTP:

For simplicity, a dict is used here. In a real-world application, it’s better to return a class instance.

class Comment(models.Model):
    author_id = models.CharField(max_length=21)
    # ...
    @property
    def author(self):
        res = requests.get(
            "https://people.googleapis.com/v1/people/{}?personFields=photos&key={}".format(
                self.author_id, settings.GOOGLE_KEY))
        # ... error checking and caching omitted ...
        return res.json()

Now, we are facing the N+1 problem again:

for post in Comment.objects.filter(post_id=1).all():
    print(comment.author["photos"][0]["url"]) # Google is requested N times.

A typical solution is to prefetch the data manually (sometimes called an “application-level join”). The most effective way is probably using a hash join:

comments = Comment.objects.filter(post_id=1).all()
ids = set(["resourceNames=people/" + comment.author_id for comment in comments])
res = requests.get(
    "https://people.googleapis.com/v1/people:batchGet?personFields=photos&{}&key={}".format(
        "&".join(ids), settings.GOOGLE_KEY))
authors = {}
for response in res.json()["responses"]:
    person = response["person"]
    authors[person["resourceName"]] = person
for post in comments:
    author = authors["people/" + comment.author_id]
    print(author["photos"][0]["url"]) # Google is requested 1 time.

The code looks ugly, doesn’t it? Officially, there is no way to use the beautiful prefetch_related magic as seen above for non-DB data, since it works by requesting a QuerySet from the field’s descriptor under the hood. But examining the source code reveals the fact, that this QuerySet is immediately converted to a list internally, so if a field descriptor returned a list instead, Django wouldn’t spot a difference.

Let’s try to rewrite the Comment class. First, replace the author property with a descriptor (in fact, the @property decorator has already created one for us, but it’s not enough to make prefetch_related work):

class AuthorDescriptor(object):
    # ...

class Comment(models.Model):
    # ...
    author =  AuthorDescriptor()

On prefetch_related, Comment.author.is_cached(post) is called for each post first. For now, let’s skip this step (note, this cache is not about your fancy memcached or redis cluster, it’s an instance-level cache):

class UserDescriptor(object):
    # ...
    def is_cached(self, obj):
        return False # TODO
    # ...

If Django realizes, something isn’t cached yet (and it does, since the method above always returns False), it collects the uncached post instances and calls Prop.author.get_prefetch_queryset(posts, None), which returns a 6-item tuple:

  1. The QuerySet. As mentioned in the beginning, it’s OK to return a list here. In our case, it’s a list of dictionaries we get from response.json()["responses"].
  2. A callable which maps an item from the list above to its unique ID. If you remember the “ugly” code in the beginning, it’s the part where we filled the authors dict. Django does it too behind the scenes, but we only need to provide a mapper.
  3. A callable which maps a comment instance to its author’s ID. If you recall the “ugly” code again, it’s the part where we finally query the authors dict. This way, Django does a hash join too, which is quite effective.
  4. A boolean which tells Django if a single instance or a an authors list is returned per comment.
  5. A field name or a cache key, depending on the boolean below.
  6. If True, the __set__ method of the descriptor specified by the field name above is called with the new value. Otherwise, the value is saved in obj._state.fields_cache[cache_key] automatically.

Having this in mind, we can implement the get_prefetch_queryset method (error checking omitted for simplicity):

class UserDescriptor(object):
    # ...
    def get_prefetch_queryset(self, comments, qs):
        ids = set(
            ["resourceNames=people/" + comment.author_id for comment in comments]
        )
        data = requests.get(
            "https://people.googleapis.com/v1/people:batchGet?personFields=photos&{}&key={}".format(
                "&".join(ids), settings.GOOGLE_KEY)
        ).json()
        return (
            data["responses"],
            lambda response: response["person"]["resourceName"],
            lambda comment: "people/" + comment.author_id,
            True, # Single author instance per comment.
            "author",
            False, # Store the author in comment._state.fields_cache["author"].
        )
    # ...

The descriptor’s __get__ method should return obj._state.fields_cache["author"] at least. A good idea is to replicate the plain old @property behavior to alow single-instance requests as well:

class UserDescriptor(object):
    # ...
    def __get__(self, obj, objtype):
        if not obj:
            return self
        data = obj._state.fields_cache.get("author")
        if not data:
            data = requests.get(
                "https://people.googleapis.com/v1/people/{}?personFields=photos&key={}".format(
                    obj.author_id, settings.GOOGLE_KEY)
            ).json()
            # ... error checking omitted ...
            obj._state.fields_cache["author"] = data
        return data

Now, prefetch_related("author") should work.

As an optimization, the is_cached method can be updated now:

class UserDescriptor(object):
    # ...
    def is_cached(self, obj):
        return "author" in obj._state.fields_cache
    # ...

Further work

Up till now, there have been a real model field named author_id and a “fake” one to request the data (author). It might be a good idea to wrap the code above in a custom field the way ForeignKey works:

class Comment(models.Model):
    # ...
    author = GoogleAccountField() # Creates author_id and requests the API.

This is left as an exercise for the reader.