Skip to content

Generate a redirect map from two sitemaps for website migration.

License

Notifications You must be signed in to change notification settings

jsphpl/redirect-mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

redirect-mapper

Takes two lists of URLs and outputs a mapping that assigns each entry in list 1 an item from list 2 along with a score that indicates how likely the two refer to the same thing.

Use case

This script was created to automatically generate a map of redirects when migrating a website. The input lists would be a sitemap of each the old and new website, both plain text files containing one url per line. The URLs are required to be "pretty", meaning not just /post.php?id=123 but rather something like /blog/why-wordpress-sucks and ideally have their protocol- and domain parts removed.

It can of course be used as a generic tool to fuzzy match two sets of strings. It uses the Levenshtein distance metric as implemented by python-Levenshtein.

Warning: Always check the results manually. Never trust the output of the script blindly. It will assign each item in list 1 one item from list 2, even if it's a really bad match.

map.py usage

  1. Clone this repository git clone https://github.com/jsphpl/redirect-mapper
  2. Enter it cd redirect-mapper
  3. Install dependencies python setup.py install
  4. Use it:
$ python map.py [-h] [-t VALUE] [-c PATH] [-d] list1 list2

Generates a redirect map from two sitemaps for website migration.

By default, all matches are dumped on the standard output. If an item
from list1 is exactly contained in list2, it will be assigned right
away, without calculating distance or checking for ambiguity.

Issues & Documentation: https://github.com/jsphpl/redirect-mapper

positional arguments:
  list1                 List of target items for which to find matches. (1 item per line)
  list2                 List of search items on which to search for matches. (1 item per line)

optional arguments:
  -h, --help            show this help message and exit
  -t VALUE, --threshold VALUE
                        Range within which two scores are considered equal. (default: 0.05)
  -c PATH, --csv PATH   If specified, the output will be formatted as CSV and written to PATH
  -d, --drop-exact      If specified, exact matches will be ommited from the output

Examples

Generate a list of redirects

Say your're asking where to redirect all the urls from old_sitemap.txt ?. Pass it as the first argument like so:

python map.py old_sitemap.txt new_sitemap.txt

Adjust ambiguity threshold

To influence the level at which two matches are considered equally good, use the -t VALUE argument.

python map.py -t 0.1 old_sitemap.txt new_sitemap.txt

Omit exact matches

If the results are used to set up 301 redirects on the new website to catch all traffic arriving at old URLs, exact matches can be omitted. They will be handled by actual pages exisiting on the new site (list2). Use the -d flag here.

python map.py -d old_sitemap.txt new_sitemap.txt

Save output to CSV file

Specify the output filename with -c PATH.

python map.py -c results.csv old_sitemap.txt new_sitemap.txt

Aggregating URLs from an XML sitemap

A helper exists that lets you crawl an XML sitemap and outputs a flat list of URLs, as required as input by map.py. Together with that tool, the whole process of generating a redirect map could look like the following. After that, you would of course manually check the results.csv, taking special care of matches with a low score (≤0.8).

python aggregate.py https://old-website.com/sitemap.xml > old.txt
python aggregate.py https://new-website.com/sitemap.xml > new.txt
python map.py --drop-exact --csv results.csv old.txt new.txt

aggregate.py usage

$ python aggregate.py [-h] URL/PATH

Aggregates URLs from a set of XML sitemaps listed under the entry path.

This script processes the XML file at given path, opens all sitemaps
listed inside, and prints all URLs inside those maps to stdout.
It should support most sitemaps that comply with the spec at
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd.

It was tested with sitemaps generated by the following WP plugins:
 - (Google XML Sitemaps)[https://wordpress.org/plugins/google-sitemap-generator/]
 - (XML Sitemap & Google News feeds)[https://wordpress.org/plugins/xml-sitemap-feed/]
 - (Yoast SEO)[https://wordpress.org/plugins/wordpress-seo/]

Issues & Documentation: https://github.com/jsphpl/redirect-mapper

positional arguments:
  URL/PATH    Path or URL of the root sitemap.

optional arguments:
  -h, --help  show this help message and exit