Importing archives into WordPress

I’m starting with the import process not because it is an exceptionally good place to start when preparing to move a site to WordPress but because it’s one of the few things I got right from the get-go when I transferred my first news site to WordPress.

The best way to import content into WordPress, in my experience, is using WordPress’s XML import.

WordPress’s XML files follow an easy-to-grasp but powerful structure that, in general, goes something like this:

[xml]<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"

<title>My example site</title>
<pubDate>Thu, 28 May 2009 16:06:40 +0000</pubDate>

<category domain="category" nicename="my-category"><![CDATA[My Category]]></category>
<category domain="tag" nicename="my-tag"><![CDATA[My Tag]]></category>
<title><![CDATA[My Post Title]]></title>
<dc:creator><![CDATA[My Name]]></dc:creator>
<pubDate>Tue, 06 Jul 2010 10:51:32 +0000</pubDate><dc:creator><![CDATA[bdnoutdoors]]></dc:creator>
<guid isPermaLink="false"></guid>
<content:encoded><![CDATA[My post content.]]></content:encoded>
<excerpt:encoded><![CDATA[My post excerpt.]]></excerpt:encoded>
<wp:post_date>2010-07-06 10:51:32</wp:post_date>
<wp:post_date_gmt>2010-07-06 10:51:32</wp:post_date_gmt>

<wp:meta_value>My Post Meta Value</wp:meta_value>


That’s a fairly simple usage of WordPress XML, and when we imported our content from the Bangor Daily News we did a lot more.

For example, we imported all our posts with a hidden post meta (_old_id) value of the article’s ID in our old CMS. Then, we used the CP Redirect plugin as a template for a new plugin to redirect people clicking on old links to the new URL.

We also found that using the <dc:creator> tag quickly overwhelmed us. As with any newspaper, there are thousands of people who have written just one or a few articles for us, and we didn’t want to create accounts for all of them. Instead, we created a whitelist of authors we wanted to come in as users — basically just BDN staff and frequent contributors and freelancers — and the rest of the posts came in with a default username and with the author’s name in a most meta field name _byline.

We don’t embed images in posts. Rather, we query for all images attached to the post and display them at the top of the post and in the sidebar (more about this in a later post). So we natively imported the images so they would become attachments. WordPress automatically copies all attachments onto the server, so we didn’t have to worry about getting all the images off our old server. Importing the images is just as easy. <wp:post_parent> is set to the ID of the post, <wp:status> is set to inherit and <wp:post_type> is set to attachment. The image path goes in <wp:attachment_url>, and the caption goes in <excerpt:encoded>.

We broke the XML files up by 1,000 posts at a time. All in all, we had more than 100 XML files. We also imported everything onto a local machine and then pushed the database back up to our webserver. All told, importing everything took several solid days of work.

If you’re working on a site much larger than ours, you might consider importing posts directly into WordPress using the API, but to be honest I’m not sure how much overhead that would save.

The script, which you’ll have to modify a bit but hopefully not too much, is on github.

4 thoughts on “Importing archives into WordPress”

  1. This is great stuff! While it is fairly easy to import archives like this, I do share your pain as we suffered through converting 20+ sites to WP. Love the idea of this blog, really wish we could have started with version 3.0.

    Have you encountered any performance issues during peak times yet? If so, care to share what measures you took to keep your servers from tilting?

    We had to do some major caching work to keep our sites up and running. Learned fairly quickly that a single blog with 100k posts, 700k meta during high traffic times was not a good combination!

    Again, thanks for sharing!

    1. Dave,
      We use some fairly industry-standard caching tools, and haven’t had any problems with high traffic so far. Varnish, Memcached, HyperDB and Litespeed are the big tools we use. No matter the CMS, running a large site without caching isn’t recommended.

Leave a Reply

Your email address will not be published. Required fields are marked *