How to handle large queries in WordPress

ben white 124388

Do you sometimes want to see your code fail? Sometimes I write a piece of code and think: it would be much interesting now if this broke, so then I’d have to dig down and find out why.

I made an update to Wisdom recently to improve the performance of the reporting. As I’ve been gathering data with the plugin for several months now, the queries are getting far too big. Running a query which includes over 20,000 posts is likely to cause issues with most servers so I started looking for more efficient ways of doing things.

I’d done some research on batch processing WordPress queries and how best to cache or store the results. But before trying anything out, I tried something quick and easy, which I thought might work. Strangely though, as it was loading I found myself hoping that it wouldn’t work. I wanted it to fail. Why?

The answer is that I like tinkering with stuff. For some weird reason, I was actually looking forward to playing around with batch processing so I would be disappointed if my quick and easy method was successful. Watching the spinner on the reload button rotate while the query was running, I knew that if it succeeded I could spend the day doing something fruitful and worthwhile.

Fortunately, it failed. Unfortunately, I spent longer than just the rest of the day devising a method that worked.

One solution for working with large queries

The title of this post tends to imply that it provides all the answers, a definitive how-to guide to working with massive queries in WordPress. It’s certainly not that. I think you could equally add a question mark to the end of the title. What I’ve written up here is only one way to work with large query results – there are alternative methods, I’m sure, and probably better ones.

The method I settled on uses AJAX to break the query up into more manageable parts, say querying 1000 posts at a time instead of all 25,000. I found this article by Pippin Williamson on batch processing export data, which was extremely useful and I based my initial code on it. I extended and modified it in a couple of ways, finding ways to slim down the query and using transients to store data. I also looked at Thomas Griffin’s article on handling and retrieving data.

The standard WP_Query

In almost all normal situations, a query like the following is going to work fine. This is the kind of query you’d use to display custom post types on a page or in a widget:

$args = array(
  'post_type'  => 'my-post-type',
  'posts_per_page'  => 10,
);
$new_query = new WP_Query( $args );
if( $new_query->have_posts() ) {
  while( $new_query->have_posts() ) :
    $new_query->the_post();
    // Do things here with the query
  endwhile;
}

In almost all cases, that query is going to be ideal for what you need. However, I wanted to be able to query tens of thousands of posts – data on all the sites being tracked by Wisdom – in order to transform that data into reports. And I found that at that size of query I was starting to see out of memory issues.

Only return post IDs

The first change I made was to use the return fields parameter in the basic query:

$args = array( 
  'post_type' => 'tracked-plugin',
  'posts_per_page' => -1,
  'fields' => 'ids',
  'post_status' => 'publish'
);

Note the 'fields' => 'ids' parameter? This hugely reduces the size of data returned by the query, by only returning the IDs of the posts and not the rest of the post data. Immediately, this meant that queries were no longer timing out, now they were completing successfully. The only problem was that they didn’t contain any of the data I wanted.

What data did I need?

I mentioned that this was part of an update to Wisdom, which is a plugin that allows plugin developers to track how their plugins are being used. It returns data on each plugin user’s site and environment, lets you see what options have been set in your plugin and what other plugins are being used. It presents this data in charts.

All the data is saved in post meta fields: each site that is tracked is saved as post. Hence the high number of posts to query.

Returning just the post IDs had reduced the size of the query but it had also stripped out all the meta field data that I needed for the reports.

Using AJAX with WP_Query

This is where the batch processing comes in. I modified the form that specifies each report’s parameters:

Instead of just a simple GET form, I used jQuery to pass the parameters off to a function in PHP:

( function( $ ) {
  $('body').on( 'submit', '.wisdom-ajax-params', function(e) {
    e.preventDefault();
    $('.wisdom-run-query').prop( 'disabled', true );
    var params = $(this).serialize();
    $('.wisdom-batch-progress').append( '<div class="spinner is-active"></div>' );
    // start the process
    self.process_offset( 0, params, self );
  });

  process_offset = function( offset, params, self ) {
    $.ajax({
      type: 'POST',
      url: ajaxurl,
      data: {
        params: params,
        action: 'wisdom_do_batch_query',
        offset: offset,
      },
      dataType: "json",
      success: function( response ) {
        if( 'done' == response.offset ) {
          window.location = response.url;
        } else {
          self.process_offset( parseInt( response.offset ), params, self );
        }
      }
    });
  }
}( jQuery ) );

So, when you submit the form, jQuery collects the form parameters, e.g. dates, specific plugins, the report types, using $(this).serialize() then passes those parameters to the process_offset function. This in turn passes the parameters to the PHP function, wisdom_do_batch_query. It also sends an offset value, starting at 0.

The PHP function that receives this data can then start to run its smaller batch queries. The code I’ve reproduced below is a much shorter version of the code I ended up with, just as an example of how it’s done:

function wisdom_do_batch_query() {
  $offset = absint( $_POST['offset'] );
  parse_str( $_POST['params'], $params );
  $params = (array) $params;
  $increment = 1000; // You can set your increment value here
  if( ! wp_verify_nonce( $params['wisdom_batch_query'], 'wisdom_batch_query' ) ) {
    die(); // You'll need to ensure you've set a nonce in your form
  }
  $transient = esc_attr( $params['transient'] );
  $plugin_data = array();
  // Record all data in big array
  if( $offset == 0 ) {
    // This is our first pass
    // Save some useful data for the query
    $plugin_data['plugin_ids'] = wisdom_get_query_ids( $params );
    $plugin_data['total_plugins'] = count( $plugin_data['plugin_ids'] );
    // Ensure any existing transient is deleted
    delete_transient( $transient );
  } else {
    // We save the data to this transient in blocks, then pick it back up again at the start of a new batch
    $plugin_data = get_transient( $transient );
    // Set timestamp $plugin_data['last_run'] = time();
  }
  $url = '';
  if( $offset > $plugin_data['total_plugins'] ) {
    $offset = 'done';
    $args = array(
      'post_type' => 'tracked-plugin',
      'page' => $params['page'], // Set other params here as needed
    );
    $url = add_query_arg( $args, admin_url( 'edit.php' ) );
  } else {
    // Query the tracked-plugins
    $args = array(
      'post_type' => 'tracked-plugin',
      'posts_per_page' => $increment,
      'offset' => $offset,
      'fields' => 'ids',
      'no_found_rows' => true,
      'post__in' => $plugin_data['plugin_ids']
    );
    $plugins = get_posts( $args );
    foreach( $plugins as $plugin_id ) {
      // Break these up so we only collect the data we need for each report
      $plugin_slug = get_post_meta( $plugin_id, 'wisdom_plugin_slug', true );
      if( ! empty( $plugin_slug ) ) {
        if( isset( $plugin_data['slugs'][esc_attr( $plugin_slug )] ) ) {
          $plugin_data['slugs'][esc_attr( $plugin_slug )]++;
        } else {
          $plugin_data['slugs'][esc_attr( $plugin_slug )] = 1;
        }
      }
    }
    $offset += $increment; // You can set your increment value here
    set_transient( $transient, $plugin_data );
  }
  echo json_encode( array( 'offset' => $offset, 'url' => $url ) );
  exit;
}

// Returns the array of post IDs to be used by the batch query
function wisdom_get_query_ids( $params=array() ) {
  // Query the IDs only for all tracked-plugins
  $args = array(
    'post_type' => 'tracked-plugin',
    'posts_per_page' => -1,
    'fields' => 'ids',
    'post_status' => 'publish'
  );
  // An example of using the $params to refine the query
  if( ! empty( $params['wisdom_plugin'] ) && $params['wisdom_plugin'] != 'all' ) {
    // This will refine the query to only include meta_key / meta_value pairs
    $args['meta_query'][] = array(
      'key' => 'wisdom_plugin_slug',
      'value' => $params['wisdom_plugin'],
      'compare' => '='
    );
  }
  $id_query = new WP_Query( $args );
  if( $id_query->have_posts() ) {
    return $id_query->posts;
  }
  return false;
}

In essence, what happens here is:

  1. The function receives some data from JavaScript, namely a value for offset and some query parameters
  2. If the offset value is 0, it runs some additional functions to get the post IDs that we want to query, and drops them into the $plugin_data array
  3. If the offset value is greater than 0 but less than the total number of posts we want to query, it grabs data that we’ve already processed from a transient and re-populates the $plugin_data array. New data that is being processed on this pass will get added to that array
  4. It runs a new, smaller query using the post IDs that we collected on our first pass, using 'post_in'. The number of posts and offset values are determined by the $increment value – set this to whatever you feel is best
  5. The actual query is run using get_posts( $args ) and will only return post IDs
  6. We iterate through each ID and grab the data we want, adding that to the $plugin_data array
  7. After the query has run, increase offset value
  8. Save the $plugin_data array to a transient
  9. Return the data to JavaScript
  10. The process will keep running until the value of offset is greater than the total number of posts to query

Now, I think the big thing here is using a transient to store the data between batches. Each time a batch runs, it reloads the data stored in the transient, adds to it, then saves the transient again ready for the next batch.

Why transients?

Yes, that’s a good question and at this point I’d be very happy for anyone to jump in with a comment on the rights and wrongs of using a transient here and, even better, with some alternatives. For a long time while working on this, I fought against using transients as it seemed messy somehow. I considered passing the ever-growing array of data back to JavaScript, and then back again to the PHP function, but that didn’t feel right on several levels. I also considered using a global variable but again it just felt wrong.

Eventually I had to cave into to the notion of using transients because I couldn’t see a better way. Transients seemed the worst of all possible solutions, except for all the others.

Smaller queries for specific reports

Having got this far, I decided that it was going to be better in the long term to run multiple smaller queries and save results in different transients rather than try to collect all data in one huge super-transient. So the initial query that Wisdom runs to get data will be enough for a summary report: it gets the active/deactivated statuses for all tracked sites.

Then, when you decide you want to run a different report, it’ll check whether that report already exists by searching for the transient where the data would be stored. If it exists, it uses this data – no need to run another query. If it doesn’t exist, it runs a smaller query based on the parameters of the report required.

Final thoughts

I’ve gone from a mega query which caused out of memory issues to multiple smaller queries that store their results in a transient. Once the initial query has been run, which may take a few seconds depending on your dataset, the reporting is really quick because all the work has been done up front. I have a nagging feeling that when building the array of results while the batch process is running there is a better solution than storing it in a transient – but for now I’m happy with the way it’s all working.

One more thing

About two days after publishing this article, I came across this podcast from Drew Jaynes who works on the Affiliate WP plugin. He isn’t discussing WP_Query specifically but a lot of interesting stuff, plus links to code.

Two comments

  1. User image

    Very useful post, I’m working on a plugin myself and I need to process potentially unlimited numbers of posts. This technique looks great, and I’m implementing it now.

    PS. theres a missing image just under the first paragraph of the “Using AJAX with WP_Query” section.

Leave a Reply

Your email address will not be published. All fields are required.