Skip to content

hallowelt/migrate-confluence

Repository files navigation

Migrate Confluence XML export to MediaWiki import data

This is a command line tool to convert the contents of a Confluence space into a MediaWiki import data format.

Prerequisites

  1. PHP >= 8.2 with the xml extension must be installed
  2. pandoc => 3.1.6. The pandoc tool must be installed and available in the PATH (https://pandoc.org/installing.html).

Installation

  1. Download migrate-confluence.phar from https://github.com/hallowelt/migrate-confluence/releases/tag/latest
  2. Make sure the file is executable. E.g. by running chmod +x migrate-confluence.phar
  3. Move migrate-confluence.phar to /usr/local/bin/migrate-confluence (or somewhere else in the PATH)

Workflow

Export "space" from Confluence

  1. Create an export of your confluence space

Step 1:

Export 1

Step 2:

Export 2

Step 3:

Export 3

  1. Save it to a location that is accessbile by this tool (e.g. /tmp/confluence/input/Confluence-export.zip)
  2. Extract the ZIP file (e.g. /tmp/confluence/input/Confluence-export)
    1. The folder should contain the files entities.xml and exportDescriptor.properties, as well as the folder attachments

Migrate the contents

  1. Create the "workspace" directory (e.g. /tmp/confluence/workspace/ )
  2. From the parent directory (e.g. /tmp/confluence/ ), run the migration commands
    1. Run migrate-confluence analyze --src input/ --dest workspace/ to create "working files". After the script has run you can check those files and maybe apply changes if required (e.g. when applying structural changes).
    2. Run migrate-confluence extract --src input/ --dest workspace/ to extract all contents, like wikipage contents, attachments and images into the workspace
    3. Run migrate-confluence convert --src workspace/ --dest workspace/ (yes, --src workspace/ ) to convert the wikipage contents from Confluence Storage XML to MediaWiki WikiText
    4. Run migrate-confluence compose --src workspace/ --dest workspace/ (yes, --src workspace/ ) to create importable data

If you re-run the scripts you will need to clean up the "workspace" directory!

Import into MediaWiki

  1. Copy the diretory "workspace/result" directory (e.g. /tmp/confluence/workspace/result/ to your target wiki server (e.g. /tmp/result)
  2. Go to your MediaWiki installation directory
  3. Make sure you have the target namespaces set up properly. See workspace/space-id-to-prefix-map.php for reference.
  4. Make sure $wgFileExtensions is setup properly. See workspace/attachment-file-extensions.php for reference.
  5. Use php maintenance/importImages.php /tmp/result/images/ to first import all attachment files and images
  6. Use php maintenance/importDump.php /tmp/result/output.xml to import the actual pages

You may need to update your MediaWiki search index afterwards.

Config file

It is possible to use a yaml file to configure the commands analyze, extract and convert. As an expample see /doc/config.sample.yaml. The configuration file can be applied by adding the option --config /tmp/config.yaml.

Not all parameters of config.sample.yaml have to be used in the config file. If something is not part of it the default will be used.

NSFileRepo compatibility

There is now a compatibility for the mediawiki extension https://www.mediawiki.org/wiki/Extension:NSFileRepo which restricts access files and images to a given set of user groups associated with protected namespaces.

If NSFileRepo is used the upload of the images can not be done with the script maintenance/importImages.php but with extensions/NSFileRepo/maintenance/importFiles.php.

Example: php extensions/NSFileRepo/maintenance/importFiles.php /tmp/result/images/

User spaces

In confluence user spaces are protected. In MediaWiki this is not possible for namespace User. Therefore user spaces are migrated to a namespace User<username> which can be protected in BlueSpice for MediaWiki.

Included MediaWiki wikitext templates

  • AttachmentsSectionEnd
  • AttachmentsSectionStart
  • Excerpt
  • Info
  • InlineComment
  • Layout
  • Layouts.css
  • Note
  • Panel
  • RecentlyUpdated
  • SubpageList
  • SubpageListRow
  • Tip
  • Warning

Be aware that those pages may be overwritten by the import if they already exist in the target wiki.

Included upload files

  • Icon-info.svg
  • Icon-note.svg
  • Icon-tip.svg
  • Icon-warning.svg

Be aware that those files may be overwritten by the import if they already exist in the target wiki.

Required MediaWiki extensions

The output generated by the tool contains certain elements that need additonal extensions to be enabled.

  1. TemplateStyles
  2. ParserFunctions
  3. SimpleTasks (https://www.mediawiki.org/wiki/Extension:SimpleTasks)
  4. Semantic MediaWiki
  5. HeaderTabs

Manual post-import maintenance

Cleanup Categories

In the case that the tool can not migrate content or functionality it will create a category, so you can manually fix issues after the import

  • Broken_link
  • Broken_user_link
  • Broken_page_link
  • Broken_image
  • Broken_layout
  • Broken_macro/<macro-name>

Not migrated

  • User identities
  • Comments
  • Various macros
  • Various layouts
  • Blog posts
  • Files of a space which can not be assigned to a page

Creating a build

  1. Clone this repo
  2. Run composer update --no-dev
  3. Run box compile to actually create the PHAR file in dist/. See also https://github.com/humbug/box

TODO

  • Reduce multiple linebreaks (<br />) to one
  • Remove line breaks and arbitrary fromatting (e.g. <b>) from headings
  • Mask external images (<img />)
  • Preserve filename of "Broken_attachment"
  • Merge multiple <code> lines into <pre>
  • Remove bold/italic formatting from wikitext headings (e.g. === '''Some heading''' ===)
  • Fix unconverted HTML lists in wikitext (e.g. <ul><li>==== Lorem ipsum ====</li><li>'''<span class="confluence-link"> </span>[[Media:Some_file.pdf]]'''</li></ul><ul>)
  • Remove empty confluence storage format fragments (e.g. <span class="confluence-link"> </span>, <span class="no-children icon">)