Four Kitchens
Insights

Parallel database reloads with Node.js and Drush

4 Min. ReadDevelopment

Have you ever found yourself working on a large Drupal site, pleased with whatever new piece of functionality you’re adding but simultaneously terrified of inevitably needing to reload the database? The thought of waiting twenty, thirty, forty, or more minutes to pull down a database is completely unacceptable and a huge waste of valuable developer time.

Our friends at Lullabot certainly have been thinking about this problem and created the mysql-parallel set of scripts to address the issue. The idea is to break a database dump and import into per-table tasks that can be run in parallel, greatly improving the overall performance of a database dump and import. I had some problems with the script though: while the general approach is certainly appreciated, it was a little trickier to use than I was happy with and I found myself writing another script to wrap mysql-parallel – because, like a good terminal nerd I’m lazy and don’t like typing out long commands or series of commands.

Wrapping mysql-parallel still wasn’t a great solution for me though, so I decided to take the idea and run with it in a different direction. The result? Two node.js modules, one general and one specific:

  1. drush-reloadp – a node.js module for reloading drush databases in parallel.
  2. drush-node – a node.js module for executing arbitrary drush commands.

drush-reloadp

Aside from being a little tricky to use, one of the issues I had with mysql-parallel is that although the dumps and imports were happening in parallel the script

  1. Does not take into account the number of processors available on the source and destination machines, which could potentially lead to unacceptable load generated during a database dump.

and

  1. Runs the dumps in parallel, then the imports after the dumps are finished.

drush-reloadp addresses both of my concerns.

After you’ve installed the module gobally (npm install -g drush-reloadp) you can execute it against a source and destination as follows:


drush-reloadp -s @fourkitchens.stage -d @fourkitchens.local

The -s and -d flags stand for source and destination and correspond to drush aliases which in turn will contain all the information you need to connect to the source and destination servers and databases.

When the command is executed the module will first determine how many processors are available on the source and destination servers. These values are used to limit the number of processes that will execute in parallel to avoid placing the machines under too much load that could result in task swapping.

After the number of cores are determined the module drops all of the destination database’s tables so that only clean data will be imported. Once that’s complete the module determines the tables that need to be imported from the source. Here’s where things start to get interesting: the module creates a job queue using the node.js async module. This module allows you to execute jobs in parallel without exceeding a certain limit (no swapping here!). Additionally, you can create a queue and fill it up over time, so after we start receiving dumps we immediately start adding to an import queue and begin processing those jobs. The result is a safe but fast execution of all the tasks we need to fill up our database.

After the import is complete drush updb is run against the destination site to run any locally pending update hooks and clear site caches. Some people might not like this. If you don’t open a pull request to add a flag to disable this feature.

drush-node

In true node.js fashion, I separated out the underlying functionality needed for this process into as small and modular pieces as I could. The result was the drush-node module which allows node.js code to run arbitrary drush commands. There’s nothing too fancy here, the module is just running child_process.exec() with whatever drush command you’re asking for.

A nice feature of this though is that the module uses JavaScript promises (via the promised-io module) so commands can be chained together in code like this:

var drush = require(‘drush’);
var group = require('promised-io/promise').all([
  drush.init(),
  drush.exec('updb'),
  drush.exec('fra'),
  drush.exec('cc all')
]);

group.then(function (res) {
  console.log(res.join("r"));
});

Caveats

As mentioned in the drush-reloadp README this solution isn’t without its flaws. Consistency is sacrificed in the name of speed here. In the use cases I’ve had this hasn’t been a problem, but you will want to take this into account in the event that the source site you’re dumping from is highly trafficked – although you might want to consider not dumping directly from a active, highly trafficked site anyway.

Furthermore, this solution won’t work with Pantheon due to the unique way the DROPs architecture works. I’d be interested in ideas for expanding functionality to support Pantheon but an intermediary solution might be to have an overnight job to copy the site database onto a non-pantheon server that your team has access to. Then your team could use drush-reloadp as usual.

The hope with this is that other developers can leverage this general code to create new, interesting, and useful node scripts of their own to integrate node.js and Drupal tasks and that developers with huge databases to import will be able to spend more time coding and less time on reddit looking at pictures of cats.