Land Technologies
< all posts

Embracing the Power of Data Loaders

Posted by DanielM on March 18, 2019 · 6 min read open sourcegraphql

We use Facebook’s implementation of Data Loaders in our GraphQL API, but the concept is very generic and could be of wider interest.

Essentially, Data Loaders exist to optimize function calls. Specifically, they will (a) de-duplicate calls and (b) batch up calls to a given function. Typically the function call will be a request to access something in a database, but in principle it could be any read-only operation.

Lorries loading up

A basic example of using Data Loaders

Before we demonstrate the power of this optimisation, let’s imagine defining a simple Data Loader:

const userLoader = new DataLoader(async ids => {
  // This is pseudo-code rather than a real database query syntax...
  const users = await usersTable.find({ in: ids });
  return ids.map(id => users.find(u => u.id === id));
});

The function we passed into the DataLoader constructor takes an array of ids and returns a matching array of user objects. You can now call userLoader.load(userId) to get a single user for a specific userId. That’s basically all there is to it.

With this optimisation in your toolkit, you can write what would otherwise be nasty looking code. Here is a slightly contrived example of de-duplicating and batching requests for a user object:

async function getUserEmail(userId) {
  // const user = await usersTable.findOne(userId); // without dataLoader
  const user = await userLoader.load(userId); // with dataLoader
  return user.email;
}

async function getUserName(userId) {
  // const user = await usersTable.findOne(userId);  // without dataLoader
  const user = await userLoader.load(userId); // with dataLoader
  return user.name;
}

// ...

// (a) example of de-duplicating
const userId = "123";
console.log(await getUserEmail(userId), await getUserName(userId));

// (b) example of batching
const userIds = ["456", "789"];
userIds.forEach(async id => console.log(await getUserEmail(id)));

In example (a), the userLoader.load function is called twice, both times with userId='123'. Normally this would be wasteful - in our example the call requires a slow network request and puts load on a database - but because we are using a DataLoader, the second call can re-use the stored result of the first call. This kind of optimisation is known as memoization.

In this example, it’s easy to imagine cutting out the second database call simply by passing around the user object rather than a userId, but in real-world examples that is generally more painful to do because of deeply nested function calls and complex conditional branches.

In example (b), again the userLoader.load function is called twice, but here it’s called with different values. Importantly, because Array.forEach does not wait for async functions to resolve, the two calls to userLoader.load within getUserEmail can be batched up into a single query on the databse - see the pseudo-code implementation of userLoader for details. Again, while it might be easy to manually optimize that here without a Data Loader, in real-world examples it is typically a lot more work and ends up making the code a lot more complex.

When to use DataLoaders

Initially, we focused on using Data Loaders for simple queries, like loading a user by id, but we quickly realised that far more complex queries can also benefit from this optimisation pattern.

For example, here we have a loader that counts the number of users on an account, but does it for multiple accounts in one go:

const userCountLoader = new DataLoader(async accountIds => {
  // this is valid Mongoose syntax
  const counts = await userModel.aggregate([
    { $match: { _account: { $in: accountIds } } },
    { $group: { _id: "$_account", count: { $sum: 1 } } }
  ]);
  return accountIds.map(id => counts.find(a => a._id === id).count);
});

We also use loaders with elasticsearch aggregations, and with Postgres queries that use GeoJSON as the “key” rather than simply an id.

In summary, loaders can be used in a lot of places!

Loader Context

One important aspect of Data Loaders that we haven’t mentioned yet is their lifetime, i.e. how long does their internal state exists for. As with any other variable, you can architect whatever scope/lifetime you feel is appropriate. But when, as is typical, Data Loaders are used in a server handling hundreds/thousands of incoming requests, it is recommended that each Data Loader instance be scoped to the lifetime of a specific request/response. This ensures that data cannot leak from one user to another, and makes it unlikely that the data stored within a DataLoader is out of date.

In a GraphQL server specifically, the documentation recommends putting DataLoader instances in the GraphQL “context” object, which is simply an object that is created before the main body of the request is processed, and is available to all GraphQL resolvers during processing. We adopted this approach initially, but found it a bit clunky, so instead we came up with this helper function, that we call loaderWithContext:

// loader-with-context.js

const DataLoader = require("dataloader");

module.exports = function factory(batchFunc, opts) {
  const store = new WeakMap();

  return function getLoader(ctx) {
    let loader = store.get(ctx);
    if (!loader) {
      loader = new DataLoader(keys => batchFunc(keys, ctx), opts);
      store.set(ctx, loader);
    }
    return loader;
  };
};

Now, instead of…

// v1
const ctx = { // lives for lifetime of request/response
    ...
    myLoader: new DataLoader(batchFunc);
}
ctx.myLoader.load(value);

…we do…

// v2
const myLoader = loaderWithContext(batchFunc); // global singleton

myLoader(ctx).load(value);

These two approaches expect the same batchFunc and have very similar syntaxes for calling .load, but they enable quite different usage patterns.

Before we get to that, note how in v1, myLoader contains the state, whereas in v2 it does not. Instead, in v2 the state is linked to the ctx object itself, and persists as long as there is a reference to ctx. If you are not familiar with WeakMap, I recommend reading the docs.

So how does this change usage patterns?

  1. We don’t need to worry about initializing loaders in one place and using them somewhere else. Instead, we import the singleton loader at the point we want to use it and simply pass in the ctx object. By reducing the need to jump between multiple files this makes the code easier to write and easier to read.

  2. Because loaders aren’t properties of the context object itself, they do not need to be exposed for the whole server to see. Instead, loaders can be written that are “private” to some piece of code (often code within a module). This makes it less dangerous to create large numbers of loaders with very specific behaviors.

  3. As you can see in the loader-with-context.js file, we provide the context to the batchFunc as a second argument. This means that if the context contains user/account information, the loader can access it. A typical function might look like this:

    const myItemLoader = loaderWithContext((itemIds, {accountId}) => {
       const items = itemTable.find({id: {in: itemIds}, account: accountId});
       ...
    })

    Something similar can be done with the original syntax, but it generally ends up involving various wrapper methods and extra complications, whereas this is very clear about its inputs and doesn’t require additional wrapping.

  4. We can create a loader that combines other loaders by passing the ctx down into those other loaders:

    const fooLoader = loaderWithContext(...);
    const barLoader = loaderWithContext(...);
    
    const fooOrBarLoader = loaderWithContext((foosOrBars, ctx) => {
       return foosOrBars.map(fb => fb.isFoo ?
                                fooLoader(ctx).load(fb)
                              : barLoader(ctx).load(fb))
    });

    This is hard and/or ugly when done with the original syntax.

In summary, this simple change in syntax makes it much easier to use Data Loaders more widely without sacrificing readability. Incidentally, it may also improve performance slightly because loaders are now initialize on demand rather than on every single request.

Final word

If it wasn’t already clear, we love Data Loaders because they give you optimisation at no cost to readability or complexity. We can’t imagine using GraphQL without them!


open sourcegraphql
Land Technologies

We are the engineers behind LandInsight and LandEnhance. We’re helping property professionals build more houses, one line of code at a time. We're based in London, and yes, we're hiring!