Auto-generating a redirect map website migration
Whenever you redesign or migrate a website to a new system, causing the URLs to change, you need to setup 301 redirects. They point the URLs of the old system to appropriate pages on the new site and help you keep your search engine rankings. So if you don't want to blast all your SEO efforts of the past years, you'd be well-advised setting them up.
So how are you gonna go about it? Right, take a list of pages on the old site and find a matching page on the new site for each entry. If there is a page with exactly the same URL, good, no need to redirect. If not, another page needs to be found as a destination for all traffic looking for the old page that no longer exists.
Auto-generate that list
To ease the process, i thought it would be great to have a tool that performs some sort of string matching between the old and new URLs. The output of this tool would still need to be manually checked but could serve as a basis for the redirect plan.
Here's the tool: https://github.com/jsphpl/redirect-mapper
Let's say we have two simple lists of URLs:
/books/faust /books/romeo-and-juliet /ebooks/from-zero-to-hero /blog /about-us /deprecated-page
/books/goethe/faust /books/shakespeare/romeo-and-juliet /ebooks/random-wannabe/from-zero-to-hero /blog /about-us /newly-added-page
Now all entries in old_urls.txt need to be assigned a target from the new_urls.txt set. To do so, we simply use our tool:
python map.py -c redirects.csv old_urls.txt new_urls.txt
It puts its suggestions for our redirect map into redirects.csv
Item (list1),Match (list2),Score,Ambiguous,Exact,Alternatives /books/faust,/books/goethe/faust,0.77,False,False, /books/romeo-and-juliet,/books/shakespeare/romeo-and-juliet,0.79,False,False, /ebooks/from-zero-to-hero,/ebooks/random-wannabe/from-zero-to-hero,0.77,False,False, /blog,/blog,1.0,False,True, /about-us,/about-us,1.0,False,True, /deprecated-page,/newly-added-page,0.61,False,False,
This file contains one line for each entry in our old_urls.txt along with:
- Match: The most similar url from new_urls.txt (the elected candidate)
- Score: The distance score of the two urls. 1.0 means identical, 0.0 would theoretically mean no similarities at all
- Ambiguous: An indicator whether there are more candidates in new_urls.txt with a distance score close to the elected candidate, probably being alternative matches
- Exact: An indicator whether "Match" is identical with "Item"
- Alternatives: In case of ambiguity, candidates with a distance score close to the winner's are listed as alternative matches
We can now open this file using our spreadsheet editor of choice and carefully inspect each line, removing false matches.
- Remove all exact matches (column Exact=True), as they don't need a redirect.
- Sort by Ambiguous and carefully inspect all matches with this column set to True. Replace the Match with a URL from the Alternatives column, if more appropriate. If neither the Match nor any of the Alternatives are appropriate redirect targets, try to find a matching page using your brain and/or the new website's search feature. If that also fails, remove the entire line from the spreadsheet.
- Sort by Score ascending and check if the Match column likely represents the corresponding new page for the old "Item" page. If not, replace Match with a more appropriate target. Otherwise, remove the entire line