Processing thousands of Salesforce records in Drupal with drushd

Drupal/Salesforce Integration

DesignHammer completed a major Salesforce project this year which involved restructuring a client's Drupal/Salesforce integration. The client was using a forked, hacked-apart version of the Salesforce module from 2009 along with a custom implementation of Salesforce Outbound Messaging to do bidirectional syncing of 30,000 Drupal user accounts.

Our initial work involved replacing the forked module with the latest stable release of the Drupal Salesforce module, and we put a significant amount of time into submitting and reviewing patches for the project.

Problems importing the Salesforce records

After the initial work was completed, the client increased the number of Lead/Contact records in SF to 100,000. 75% of those required syncing with Drupal. This caused problems.

The first issue we encountered was the frequency of updates. Every 24 hours, a nightly processing script in Salesforce runs updates on all records, and can affect anywhere from a few thousand to tens of thousands of Contact/Lead records.

Outbound Messaging, which provides real time updates of data from Salesforce to Drupal, moves too fast for Drupal to be able to send back a response acknowledging the message, so the Outbound Messaging queue in Salesforce would quickly fill up and fail. After some consideration, we switched the import implementation to SF Import, a module that is responsible for polling Salesforce for updates in a two part process: grabbing all the Salesforce IDs that have changed since the last poll, and then processing these IDs by passing them through sf_user_import().

But we quickly ran into a new problem, processing the data. Processing is triggered by cron which is limited by PHP's max_execution_time setting. In this case the time limit was 2 minutes. After that, the queue would sit idle until the next cron run 28 minutes later. This meant that, at best, we were only processing a few hundred records per day. The import queue quickly filled up to more than 110,000 items.

Cutting the Salesforce import down to size

To start approaching the problem, we first looked at what could be done to improve the efficiency of the import process. For one thing, we noticed that about 30-40% of the records stored in {sf_import_queue} were failing on import, because they lacked an e-mail address. These records were never meant to sync with Salesforce, so they could be removed from the queue. We wrote a drush script to go through the queue, look up the corresponding record in Salesforce, and if the record did not have an e-mail address, we deleted it from the queue.

Then we looked at preventing these records from syncing in the first place. The SF Import module was using an implementation of the Salesforce API's getUpdated() method, which returns SFIDs for all records of a given type between two time points. We committed a patch to allow using SOQL queries instead of getUpdated which gave us more flexibility in this process. We used this with our own hand written query which checks for an e-mail address when selecting SFIDs to import. This reduced the overall size of the queue by about 40%.

Using drushd for processing

Once we were able to limit the size of the queue we looked for a way to process the queue more efficiently. The major problem with relying on cron is that we were limited by max_execution_time. We experimented with increasing cron frequency but that wasn't enough. We needed to be able to process records from the queue continually.

We looked at a few different options for doing that and settled on drushd, a library for creating daemonized drush commands. Using this library, we were able to create a Salesforce Import Process daemon that listens for additions to the sf_import_queue table and can process them continually. This was a much more efficient way to process the queue.

drushd will occasionally reach a high level of memory usage. If it reaches 85% of the total PHP memory limit it is supposed to refork and continue processing. However, we weren't able to get this working correctly, instead the script just dies. So, we wrote a simple module that implements hook_cron(). The module checks to see if the process ID for the import daemon is still active, and if it isn't, it starts the daemon up again.

Using this method we are able to consistently handle tens of thousands of imports from Salesforce Contacts/Leads to Drupal users every day without a significant hit on performance or responsiveness on the site.

Check out the code for the daemon and module on DesignHammer's Github and let us know what you think. If you've approached this problem from a different angle please let us know in the comments section.

Comments

Permalink

Very interesting post - daemonizing php is hard :-)

Did you look at the Waiting Queue (http://drupal.org/project/waiting_queue) module?

With a queue like Beanstalkd, and something like Supervisord to keep the workers alive, you can process a queue very efficiently, without the crappy cron problems.

We did look at Waiting Queue. The documentation and example provided with drushd made it a much simpler starting point. I don't have any familiarity with Beanstalkd/Supervisord and didn't want to go down that path if I could get drushd to work. I'd be interested to see some examples of how Waiting Queue is being used though.

Permalink

Nice use case for drushd! I'd be curious to know why the process was just dying when it reforks. Can you open an issue on d.o with some steps to replicate the issue? Thanks!

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.