Testing Drupal data migrations with CasperJS

At DesignHammer, we've used the Migrate module on many projects, with a variety of use cases: Drupal-to-Drupal migrations; importing content from CSVs or XML to populate a new site build; importing XML from an eBook as nodes for an online version of the book content, and so on. It's a flexible, stable module that provides a great API for getting content from point A to point B. (Note: if you are not familiar with the Migrate module, you should check out the documentation — for the purposes of this post, you just need to know that it's a framework to help developers move content from an external source into Drupal.)

But while Migrate lets you automate the process of importing the content, *validating* what gets migrated is often a manual review process. In most of the projects we've worked on, if our migration code gets article #123 migrated correctly, then we are pretty confident that article #456 will look fine too, because the data is in a consistent format.

Clearly, this is an optimistic assumption and it doesn't always hold up. On a recent project, the source data is anything but consistent, so while a manual review of one migrated item might look correct, any number of other items could have problems. Adding to the challenge is that this is not a one-time migration, but a regular, nightly import of data, so we can't afford to have errors from the migration showing up in production.

In this post, we'll look at how we can implement a testing framework for ongoing imports of data into Drupal using some of our favorite tools: CasperJS, Migrate module, and Jenkins.

The Scenario

We have a Drupal 7 site where about 90% of the content is pulled from an external data source once per day. The external data source is itself supplying data based on an import from a CRM. So, the workflow is something like: the client edits content in a CRM, that data gets exported to another server as XML, then that server provides a SOAP API for our Drupal site to pull down XML. Finally, for reasons that I won't go into here, we need to convert the XML to CSV format.

There are a couple points where data can get munged as it makes its way from the CRM to the Drupal site. More importantly, the CRM data is not always correct to begin with — sometimes invalid cross-references are created between different entities in the CRM. For example, if we are importing Authors and Publications they have written, we would expect that the Publication XML contains valid Author IDs, *and* that the Author IDs are the correct Author references for the data. This is a nice assumption, but we ran into problems over and over again where it isn't the case. In other cases, fields that should not have HTML in them (e.g. fields that map to node titles) sometimes have HTML.

We can't go ahead with deploying a site where a nightly import has the potential to break a lot of the content, so we need to look at setting up a process to prevent any problems from ever reaching production.

Validating the migration

Preparation

First thing we need to do is get the source data in a format that CasperJS can understand — JSON. As I mentioned earlier, we are importing XML from the data source and converting it to CSV for our Migrate module code. This is done through a custom module that grabs the XML from the external source, then converts the XML to CSV (here's a snippet for anyone interested). To get JSON, we read the CSV back into an array, then write the array as JSON.

API module for locating migrated data

The second thing needed is an API for our Drupal site to return content to CasperJS based on the source ID. For example, let's say we have a Publication with ID 123. After running a migration, we have a Publication node with a node ID of 456. What we want is for CasperJS to be able to find the migrated entity based on the source ID, so we can compare if the data get migrated accurately. To do this, I wrote an API module with a hook_menu() entry for returning content. The idea being that I can now visit https://example.com/api/publication/123 and the API module will redirect me to https://example.com/node/456. What's more, the API module will also set messages or errors if more than one Drupal entity matches a source ID, or if the item is not found — that way CasperJS can run assertions based on those status messages.

Dynamic migration

Next, we need to write a dynamic migration test in CasperJS.

Recently I wrote about Using CasperJS Drush and Jenkins to test Drupal and I recommend glancing through that to see an example of the way a typical test might be written: you define some behaviors that will get run on a predefined set of paths, and then test them.

The problem is, the tests described in that post look at a predefined number of tests/steps to perform. In our case, we have a variable number of items (in fact, thousands of items) that we need to check. Fortunately CasperJS provides an approach for dynamic tests — just what we need.

Let's look at some code to implement this.

Validating entities

In this example, we are looping through source data containing rows of Publication Authors that are migrated into Drupal as entities. Each Publication Author entity contains references to Publication entities. In the test, we verify that publications linked with publication authors in the source data have been accurately migrated as entities with entity references in Drupal. We also verify that a few fields for the Publication Authors (name, initials, remote ID, etc) have been accurately migrated.

Here's a link to the full test as it's a bit lengthy — for this post, I'll break down the individual parts.

Kicking things off

At the top of the test script, we have this code:

This is just telling CasperJS about our test.

Defining variables

Next up we define a couple variables we'll use in the test.

Running the test

Now it's time to kick off the test! The test starts with a command telling CasperJS to run the check() method:

So, what's in check()?

json is the variable we defined above. It contains rows of publication author data. This snippet of code is saying, loop through the entire data set of publication authors, and call the verifyData() method with the current row; then after we're finished with the source data tell CasperJS that the test is completed.

Verifying the row

Now for the core of the test, the verifyData() method. I'm not going to go into great detail, as the full gist has comments explaining, but the basic overview is:

For the current publication author that verifyData() is looking at, grab a list of publication IDs from our other source data (var pubAuthorJoin).
Visit our site's API (/api/publication-author/%) to load the publication author entity in Drupal. Note that if response.status isn't 200, we retry a few times (waiting 5 seconds each time) to make sure that there's an actual problem with loading the entity, as opposed to a hiccup on the server that didn't return the page when CasperJS requested it.
Run the assertions. In this example we are verifying that the last name, initials, author ID, and external ID were migrated correctly. We also loop through the pubsForAuthor variable (from step #1) and make sure that all the publications that should be linked with this author have been migrated and referenced correctly in Drupal.

Success

And that's it! Here's what some of the output from the test looks like:

# Checking content returned by /api/publication-author/2751
PASS Success message present
PASS No duplicate content found.
PASS Titles match
PASS Last name is present.
PASS Initials present.
PASS Publication Author ID exists.
PASS Pub ID 10828 is linked.
PASS Pub ID 2516 is linked.

And if there are failures, we'll get a log stating which assertion failed, the relevant ID associated with the failure, and so on. You can even configure CasperJS to take a screenshot when there is a failure to help debug what is going on.

More complex tests

Each content migration for Drupal has its own set of CasperJS tests. Some are quite complex — for example, on a staff member page, there are views showing publications for a staff member, services they provide, etc, and we have CasperJS check all the links to make sure that the cross-references are accurate — while others are much more straightforward. There aren't really many limitations on this, other than the amount of time your tests will take to run.

Implementing the testing and deployment process with Jenkins

A continuous integration delivery system is essential to making this all work well. (For an overview of Jenkins, check out this Slideshare.) Here's the general order of operations for the work Jenkins does:

Rebuild @stage by importing the DB from @production, disabling a couple of modules and tweaking a few settings in the process
Run an import of the 3rd party data source XML and store it on the server
Run a Drush script that does an initial pass of validating the source data — the idea is that there are some basic validations of the source data that can be done well before we run the content migrations, e.g. checking each file for required fields, unique values, etc.
Kick off a migration on @stage using the latest data set
Run the CasperJS tests (one at a time, so that we can fail quickly if needed) to iterate over the source data and verify that content and fields migrated correctly to Drupal
If all tests pass, then a data pull and migration are initiated in @prod

If there are errors, we get a report stating exactly which rows of the source data did not pass the tests, and Jenkins stores a copy of the imported content, so we can quickly locate the source of any issues. This also means that we have a framework for testing changes to Migration code implementation — if we need to adjust a field mapping, or add a new migration that ties in to the other ones, we can run the tests on the new code and see if there are any issues.

The whole process takes between 2 - 3 hours. While it is time consuming, the end result is that we can run drush migrate-import with confidence in production, in an automated way, and not have to worry about data corruption or failures.

So, that's the gist of how we are validating our Drupal migrations and preventing problems from reaching production. If you have questions or suggestions please let us know in the comments below.

Comments

Awesome. We were just talking

Awesome. We were just talking about this during Stevector's session at midcamp.

This is really great stuff.

But the problem that I see is that you are only verifying the last few steps of the chain. It sounds like the CRM is pretty wonky so if something breaks there it has the potential to bork the whole process, potentially in a way that Casper still passes. You might need some tests of the entire system.

I'm assuming that you're not re-migrating all content every day, only content that's new or changed. In that case it becomes a bit more difficult to create a test of the whole system. But one thing you could do is have a test record in the CRM. Create some process that causes that record's timestamp to change every day. That way it will be re-imported every day. And you can verify in Drupal that every field has the value that you are expecting.

Thanks for your comment.

Thanks for your comments.

But the problem that I see is that you are only verifying the last few steps of the chain. It sounds like the CRM is pretty wonky so if something breaks there it has the potential to bork the whole process, potentially in a way that Casper still passes. You might need some tests of the entire system.

Yeah, totally. I glossed over this in the post, but it's mentioned here:

Run a Drush script that does an initial pass of validating the source data — the idea is that there are some basic validations of the source data that can be done well before we run the content migrations, e.g. checking each file for required fields, unique values, etc.

The script does a sanity check—do we have at least X number of records for a particular data type, are all the required fields there, if the publication data contains a reference to an author ID, does the publication author data contain that ID, etc. This way we can catch potential failures well before CasperJS is in the picture.

I like the idea of a test record for each data type though, and may look at implementing that as an additional safety check.

Hi,

Can someone tell me how to re-direct to another webpage using Casperjs? Example, in a webpage, I have a "logout" button. I want to automate the process by which on clicking the "logout" button, the page gets re-directed to the logged out page. Currently its staying in the same page

Testing Drupal data migrations with CasperJS

The Scenario

Validating the migration

Preparation

API module for locating migrated data

Dynamic migration

Validating entities

Kicking things off

Defining variables

Running the test

Verifying the row

Success

More complex tests

Implementing the testing and deployment process with Jenkins

Comments

Awesome. We were just talking

This is really great stuff.

Thanks for your comment.

Hi,

Add new comment

Restricted HTML