a laptop with a browser open to show astro portfolio

a drupal migration story, part ii: data cleanup and transformations

clean data is good data. in this post, we'll look at how to clean up and transform data during a migration.

EJ /

A migration is a great opportunity to clean up your data because clean data is good data. In this post, we’ll look at how to clean up and transform data during a migration. The assumption here is that we’re migrating a bunch of your run-of-the-mill basic-/article-type nodes that do not have specialized functionality, think news articles, blog posts, etc. And without further ado…

The (Dirty) Data

‘Dirty’ data can be a result of many things, including: warrantless copy-and-paste from other sources, lack of data validation, or simple a change in data structure. In our case, we will assume the good ol’ copy-and-paste from word processors and email clients. There’s a sample of raw dirty web copy that include a lot of HTML tags and inline styles.

"copy": [<p style="font-weight:bold;"><i><span>A</span></i><i><span>liqua Lorem sint elit ipsum ullamco duis consectetur!</span></i><span></span></p>
<div class="layout layout--onecol"><div class="layout__region layout__region--content"><div class="block block-ctools-block block-entity-fieldnodecreated"><div class="content"><span class="field field--name-created field--type-created field--label-above"><strong>Consequat excepteur ea et proident et veniam occaecat cupidatat eu irure deserunt labore. Sint laborum et culpa dolore esse cupidatat culpa ex officia cillum consectetur anim fugiat. Laborum nostrud Lorem nisi est ipsum mollit commodo eiusmod nostrud cupidatat elit excepteur ut. Cillum culpa duis pariatur. Lorem culpa cupidatat velit Lorem tempor ut amet excepteur.</strong></span></div> <div class="content"><span class="field field--name-created field--type-created field--label-above"><a href="https://urldefense.com/lorem-ipsum">Laboris in aute est excepteur occaecat duis adipisicing exercitation labore Lorem fugiat mollit</a><br /></span></div></div></div></div><div class="layout layout--twocol-section layout--twocol-section--67-33"><div class="layout__region layout__region--first"><div class="block block-layout-builder block-field-blocknodearticlebody"><div class="content"><div property="schema:text" class="clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item">
<p><strong>&nbsp;</strong></p>
<p>Ex deserunt elit culpa do reprehenderit consequat. Mollit nulla aliqua cillum minim eiusmod. Eiusmod adipisicing tempor aute Lorem ipsum laborum irure aute irure nisi sint occaecat sunt sit. Eiusmod pariatur est in reprehenderit elit tempor esse anim voluptate dolor Lorem consequat Lorem quis.</p>
<p>Quis aliqua irure irure sunt labore proident nulla. Anim pariatur sit est laborum cillum. Laborum nisi do sint reprehenderit tempor ut <a href="/internal-link">dolor</a> amet.</p></div></div></div></div></div><table width="100"><tbody><tr><td>a table</td></tr></tbody></table>

Our aim here is to clean up up all the extraneous and unwanted HTML tags and inline styles, and transform the data into simple and clean HTML. The best place to do this is in the migration’s process stage. Let’s write a custom process plugin to do this.

The Process Plugin

A process plugin is a class that implements the ProcessPluginInterface and is used to transform data during a migration. In our case, we’ll write a custom general purpose process plugin that can to clean up any copy. Drush can scaffold this plugin for us, and give us a good starting point.

drush generate plugin:migrate:process

Following the prompts, we’ll type the (machine) name of our custom module and provide a (machine) name for the plugin ID; drush will auto-suggest the plugin class name based on the ID. Next, we can add an dependencies as need. Below is the generated plugin class.

<?php

declare(strict_types = 1);

namespace Drupal	en_migPluginmigrateprocess;

use DrupalmigrateMigrateExecutableInterface;
use DrupalmigrateProcessPluginBase;
use DrupalmigrateRow;

/**
 * Provides a blah_blah plugin.
 *
 * Usage:
 *
 * @code
 * process:
 *   bar:
 *     plugin: ten_mig_copy_cleanup
 *     source: foo
 * @endcode
 *
 * @MigrateProcessPlugin(id = "ten_mig_copy_cleanup")
 */
final class CopyCleanup extends ProcessPluginBase

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property): mixed {
    // @todo Transform the value here.
    return $value;
  }

}

Cleanup on Aisle Copy

To accomplish our mission, we’ll employ HTML Purifier library via the ezyang/htmlpurifier package. We’ll install the package via composer like any other: composer require ezyang/htmlpurifier. With the library installed, we can now use it in our plugin. HTML Purifier’s configuration documentaion is a good place to start to understand all the options it offers. In our case, we’ll go with a few yet important options.

$config->set('AutoFormat.RemoveEmpty', true);
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp', true);
$config->set('HTML.TargetBlank', true);
$config->set('HTML.TargetNoreferrer', true);
$config->set('HTML.TargetNoopener', true);
$config->set('HTML.Allowed', 'a[href|title|rel|target],b,blockquote,caption,cite,em,figcaption,figure,h2,h3,h4,h5,h6,i,img[alt|style|src|title],li,ol,p,strong,table,tbody,td,th,tr,ul,');

Our config options will remove empty tags, non-breaking spaces, and add target="_blank", rel="noreferrer", and rel="noopener" to all external anchor tags. We’ll also allow only a few tags and their attributes. Here, I’m choosing to keep tables, even though they are not always great in terms of accessibility and responsiveness. However, I will flag them in Drupal’s logs for editorial review post migration.

Occasionally, links coming from Outlook will be obscured with urldefense.com URL (as is in our case). This is obviously not ideal or user-friendly for general links on a web page. Trying to determin the endpoint of these URLs is a outside the scope of this post, but for good measure, let’s flag them as well.

To accomplish all this and make things a little easier on ourselves, let’s create an array of all the things we want to flag so we have a single place to manage them. Having a key-value pair of the undesirables means when we flag them, we can also provide what type it is.

$patterns = [
  'urls' => '/https://urldefense.comS*/i',
  'hasTable' => '/<tableS*/i',
];
    
$types = [];

// ...

// check if any of the $checklist items are in $purified if they are,
// log a drupal message
foreach ($patterns as $type => $pattern) {
  if (preg_match($pattern, $purified)) {
    $types[] = $type;
  }
}

if (!empty($types)) {
  Drupal::logger('ten_mig')->warning('Node ID %id has a pattern violation of type(s): %types', [
    '%id' => $row->getSourceProperty('nid'),
    '%types' => implode(', ', $types),
  ]);
}

Bring it Home

With our plugin in place, we can now use it in our migration YAML file. Below is a snippet of how we can use it in a migration.

process:
  body/value:
    plugin: ten_mig_copy_cleanup
    source: body
  body/format:
    plugin: default_value
    default_value: full_html

Depending on which tags you’re allowing, the body/format can remain basic or, as in our case and because of the inclusion of tables, be set to full_html. Of course, you may also choose to customize your site’s text formats to allow only the tags you’re allowing in the plugin.

And that’s it! We’ve cleaned up our data and transformed it into something clean, lean, and manageable.

PS: The Github repository has been updated with the full code for this plugin, updated news.json sample data, optional ‘News’ content type, and comments galore.