Converting Wordpress Posts To Lektor

written by Mickey on 2019-02-19

As promised in the previous installment of this column, I have now imported all my old blog posts from Wordpress. In case you have a similar requirement, here's how I did it:

1. Export your posts from Wordpress. Login to your Wordpress administration interface and then click on Tools → Export Data. Select the content you want to export (blog posts in my case) and you will download an XML file with all your content. Here's a fragment for a single post without comments:

<item>
        <title>A New Blog</title>
        <link>https://archive.vanille.de/a-new-blog/</link>
        <pubDate>Wed, 14 Jun 2006 02:31:04 +0000</pubDate>
        <dc:creator><![CDATA[mickey]]></dc:creator>
        <guid isPermaLink="false">http://www.vanille-media.de/site/?p=5</guid>
        <description></description>
        <content:encoded><![CDATA[Finally, in coincidence with my new site <a title="VanilleMedia" href="http://www.vanille-media.de">VanilleMedia</a>, I started a new blog. It's not that I was unsatisfied with my last handcoded one, but these days it looks like there is a tendency to use all those nice planet sites -- i.e. <a title="planet.linuxtogo.org" href="http://planet.linuxtogo.org">planet.linuxtogo.org</a> or <a title="planet.maemo.org" href="http://planet.maemo.org">planet.maemo.org</a> -- syndicating blogs from different places, which needs a standardized XML format. I really didn't want to reinvent the wheel here, so after evaluating a lot of content management systems and blog packages, I settled on using wordpress for the complete site.
This also marks the start of me blogging in english -- I guess blogging in german wouldn't be all that useful for most of those planet sites.]]></content:encoded>
        <excerpt:encoded><![CDATA[]]></excerpt:encoded>
        <wp:post_id>95</wp:post_id>
        <wp:post_date><![CDATA[2006-06-14 04:31:04]]></wp:post_date>
        <wp:post_date_gmt><![CDATA[2006-06-14 02:31:04]]></wp:post_date_gmt>
        <wp:comment_status><![CDATA[open]]></wp:comment_status>
        <wp:ping_status><![CDATA[open]]></wp:ping_status>
        <wp:post_name><![CDATA[a-new-blog]]></wp:post_name>
        <wp:status><![CDATA[publish]]></wp:status>
        <wp:post_parent>0</wp:post_parent>
        <wp:menu_order>0</wp:menu_order>
        <wp:post_type><![CDATA[post]]></wp:post_type>
        <wp:post_password><![CDATA[]]></wp:post_password>
        <wp:is_sticky>0</wp:is_sticky>
        <category domain="category" nicename="general"><![CDATA[general]]></category>
</item>

2. Convert the XML file into individual markdown files. For this step, there are a number of tools. I settled on pelican-import, which is a part of the Pelican static site generator. Call it like that:

pelican-import -m markdown --wpfile -o posts vanille-archive.xml

and it will convert all your blog posts into individual files like a-new-blog.md:

Title: A New Blog
Date: 2006-06-14 04:31
Author: mickey
Category: general
Slug: a-new-blog
Status: published

Finally, in coincidence with my new site [VanilleMedia](http://www.vanille-media.de "VanilleMedia"), I
started a new blog. It's not that I was unsatisfied with my last handcoded one, but these days it
looks like there is a tendency to use all those nice planet sites -- i.e.
[planet.linuxtogo.org](http://planet.linuxtogo.org "planet.linuxtogo.org") or
[planet.maemo.org](http://planet.maemo.org "planet.maemo.org") -- syndicating blogs from different
places, which needs a standardized XML format. I really didn't want to reinvent the wheel here, so
after evaluating a lot of content management systems and blog packages, I settled on using wordpress
for the complete site. This also marks the start of me blogging in english -- I guess blogging in german
wouldn't be all that useful for most of those planet sites.

Note that this process makes you lose the comments from your readers. These comments are super-important for me, so I will soon start working on enhancing pelican-import to render the comments (into a seperate markdown file) as well.

3. Adjust the markdown files to make them work with Lektor. Lektor expects a slightly different format for the frontmatter, so I wrote a quick'n'dirty python script to adjust the necessary things. Here's the script pelican2lektor.py:

 import os

 postnames = os.listdir()
 for postname in postnames:
    print( f"Converting {postname}...")
    lektordirname = os.path.join( "lektor", postname )
    lektorfilename = os.path.join( "lektor", postname, "contents.lr" )
    os.makedirs( lektordirname, exist_ok=True )

    postlines = open( postname ).readlines()
    header = postlines[:6]

    meta = dict()
    for line in header:
        stripped = line.strip()
        key, value = stripped.split(": ", 1)
        key = key.lower()
        meta[key] = value

    date = meta["date"].split()[0]

    post = """title: %s
---
 author: %s
---
 pub_date: %s
---
 body: %s
 """

    post = post % ( meta["title"], meta["author"], date, "".join( postlines[6:] ) )

    with open( lektorfilename, "w" ) as outfile:
        outfile.write( post )

Running this creates a lektor directory with the appropriate subdirectories – one for every blog post – and the corresponding contents.lr files:

 title: A New Blog
---
 author: mickey
---
 pub_date: 2006-06-14
---
 body:
 Finally, in coincidence with my new site [VanilleMedia](http://www.vanille-media.de "VanilleMedia"), I started a new blog. It's not that I was unsatisfied with my last handcoded one, but these days it looks like there is a tendency to use all those nice planet sites -- i.e. [planet.linuxtogo.org](http://planet.linuxtogo.org "planet.linuxtogo.org") or [planet.maemo.org](http://planet.maemo.org "planet.maemo.org") -- syndicating blogs from different places, which needs a standardized XML format. I really didn't want to reinvent the wheel here, so after evaluating a lot of content management systems and blog packages, I settled on using wordpress for the complete site.
 This also marks the start of me blogging in english -- I guess blogging in german wouldn't be all that useful for most of those planet sites.

As long as there are still some content errors (missing comments, broken links, missing images, bogus format statements, etc.) left – which I will hopefully fix over the next days – I'm going to leave the archive @ archive.vanille.de intact.