At DesignHammer, we've used the Migrate module on many projects, with a variety of use cases: Drupal-to-Drupal migrations; importing content from CSVs or XML to populate a new site build; importing XML from an eBook as nodes for an online version of the book content, and so on. It's a flexible, stable module that provides a great API for getting content from point A to point B. (Note: if you are not familiar with the Migrate module, you should check out the documentation — for the purposes of this post, you just need to know that it's a framework to help developers move content from an external source into Drupal.)
But while Migrate lets you automate the process of importing the content, validating what gets migrated is often a manual review process. In most of the projects we've worked on, if our migration code gets article #123 migrated correctly, then we are pretty confident that article #456 will look fine too, because the data is in a consistent format.
Clearly, this is an optimistic assumption and it doesn't always hold up. On a recent project, the source data is anything but consistent, so while a manual review of one migrated item might look correct, any number of other items could have problems. Adding to the challenge is that this is not a one-time migration, but a regular, nightly import of data, so we can't afford to have errors from the migration showing up in production.
We have a Drupal 7 site where about 90% of the content is pulled from an external data source once per day. The external data source is itself supplying data based on an import from a CRM. So, the workflow is something like: the client edits content in a CRM, that data gets exported to another server as XML, then that server provides a SOAP API for our Drupal site to pull down XML. Finally, for reasons that I won't go into here, we need to convert the XML to CSV format.
There are a couple points where data can get munged as it makes its way from the CRM to the Drupal site. More importantly, the CRM data is not always correct to begin with — sometimes invalid cross-references are created between different entities in the CRM. For example, if we are importing Authors and Publications they have written, we would expect that the Publication XML contains valid Author IDs, and that the Author IDs are the correct Author references for the data. This is a nice assumption, but we ran into problems over and over again where it isn't the case. In other cases, fields that should not have HTML in them (e.g. fields that map to node titles) sometimes have HTML.
We can't go ahead with deploying a site where a nightly import has the potential to break a lot of the content, so we need to look at setting up a process to prevent any problems from ever reaching production.
Validating the migration
First thing we need to do is get the source data in a format that CasperJS can understand — JSON. As I mentioned earlier, we are importing XML from the data source and converting it to CSV for our Migrate module code. This is done through a custom module that grabs the XML from the external source, then converts the XML to CSV (here's a snippet for anyone interested). To get JSON, we read the CSV back into an array, then write the array as JSON.
API module for locating migrated data
The second thing needed is an API for our Drupal site to return content to CasperJS based on the source ID. For example, let's say we have a Publication with ID 123. After running a migration, we have a Publication node with a node ID of 456. What we want is for CasperJS to be able to find the migrated entity based on the source ID, so we can compare if the data get migrated accurately. To do this, I wrote an API module with a
hook_menu() entry for returning content. The idea being that I can now visit
http://example.com/api/publication/123 and the API module will redirect me to
http://example.com/node/456. What's more, the API module will also set messages or errors if more than one Drupal entity matches a source ID, or if the item is not found — that way CasperJS can run assertions based on those status messages.
Next, we need to write a dynamic migration test in CasperJS.
Recently I wrote about Using CasperJS Drush and Jenkins to test Drupal and I recommend glancing through that to see an example of the way a typical test might be written: you define some behaviors that will get run on a predefined set of paths, and then test them.
The problem is, the tests described in that post look at a predefined number of tests/steps to perform. In our case, we have a variable number of items (in fact, thousands of items) that we need to check. Fortunately CasperJS provides an approach for dynamic tests — just what we need.
Let's look at some code to implement this.
In this example, we are looping through source data containing rows of Publication Authors that are migrated into Drupal as entities. Each Publication Author entity contains references to Publication entities. In the test, we verify that publications linked with publication authors in the source data have been accurately migrated as entities with entity references in Drupal. We also verify that a few fields for the Publication Authors (name, initials, remote ID, etc) have been accurately migrated.
Here's a link to the full test as it's a bit lengthy — for this post, I'll break down the individual parts.
Kicking things off
At the top of the test script, we have this code:
This is just telling CasperJS about our test.
Next up we define a couple variables we'll use in the test.
Running the test
Now it's time to kick off the test! The test starts with a command telling CasperJS to run the
So, what's in
json is the variable we defined above. It contains rows of publication author data. This snippet of code is saying, loop through the entire data set of publication authors, and call the
verifyData() method with the current row; then after we're finished with the source data tell CasperJS that the test is completed.
Verifying the row
Now for the core of the test, the
verifyData() method. I'm not going to go into great detail, as the full gist has comments explaining, but the basic overview is:
- For the current publication author that
verifyData()is looking at, grab a list of publication IDs from our other source data (
- Visit our site's API (
/api/publication-author/%) to load the publication author entity in Drupal. Note that if
response.statusisn't 200, we retry a few times (waiting 5 seconds each time) to make sure that there's an actual problem with loading the entity, as opposed to a hiccup on the server that didn't return the page when CasperJS requested it.
- Run the assertions. In this example we are verifying that the last name, initials, author ID, and external ID were migrated correctly. We also loop through the
pubsForAuthorvariable (from step #1) and make sure that all the publications that should be linked with this author have been migrated and referenced correctly in Drupal.
And that's it! Here's what some of the output from the test looks like:
# Checking content returned by /api/publication-author/2751
PASS Success message present
PASS No duplicate content found.
PASS Titles match
PASS Last name is present.
PASS Initials present.
PASS Publication Author ID exists.
PASS Pub ID 10828 is linked.
PASS Pub ID 2516 is linked.
And if there are failures, we'll get a log stating which assertion failed, the relevant ID associated with the failure, and so on. You can even configure CasperJS to take a screenshot when there is a failure to help debug what is going on.
More complex tests
Each content migration for Drupal has its own set of CasperJS tests. Some are quite complex — for example, on a staff member page, there are views showing publications for a staff member, services they provide, etc, and we have CasperJS check all the links to make sure that the cross-references are accurate — while others are much more straightforward. There aren't really many limitations on this, other than the amount of time your tests will take to run.
Implementing the testing and deployment process with Jenkins
A continuous integration delivery system is essential to making this all work well. (For an overview of Jenkins, check out this Slideshare.) Here's the general order of operations for the work Jenkins does:
@stageby importing the DB from
@production, disabling a couple of modules and tweaking a few settings in the process
- Run an import of the 3rd party data source XML and store it on the server
- Run a Drush script that does an initial pass of validating the source data — the idea is that there are some basic validations of the source data that can be done well before we run the content migrations, e.g. checking each file for required fields, unique values, etc.
- Kick off a migration on
@stageusing the latest data set
- Run the CasperJS tests (one at a time, so that we can fail quickly if needed) to iterate over the source data and verify that content and fields migrated correctly to Drupal
- If all tests pass, then a data pull and migration are initiated in
If there are errors, we get a report stating exactly which rows of the source data did not pass the tests, and Jenkins stores a copy of the imported content, so we can quickly locate the source of any issues. This also means that we have a framework for testing changes to Migration code implementation — if we need to adjust a field mapping, or add a new migration that ties in to the other ones, we can run the tests on the new code and see if there are any issues.
The whole process takes between 2 - 3 hours. While it is time consuming, the end result is that we can run
drush migrate-import with confidence in production, in an automated way, and not have to worry about data corruption or failures.
So, that's the gist of how we are validating our Drupal migrations and preventing problems from reaching production. If you have questions or suggestions please let us know in the comments below.