Migrating blog content and validating URLs

I migrated my blog from WordPress to Astro. One challenge was to migrate the URLs. The goal was to ensure that a link to a blog post / article should still work after the migration. Best without having to work with redirects. In my old blog the URLs were using the scheme /[year]/[month]/[slug]. A post with the URL of /2025/12/title-of-post should have the same URL. To ensure this, one task was to validate the URLs:

Capture the current URLs to my posts
Validate each link at the new site
Fix any URL that returns 404

Obtain list of URLs

The first taks was to get a list of URls for my blog posts. The easiest way is to get them from my sitemap.xml file. I have a repo / programm sitemap-crawler 🔗 that takes as innput a sitemap.xml file and outputs the URLs included in the sitemap.xml.

npm start https://www.itsfullofstars.de/sitemap_index.xml

The output is stored in the file pages.txt and contains all discovered links. Those are the links that should work after the migration.

https://www.itsfullofstars.de/2018/04/openvpn-connection-test/
https://www.itsfullofstars.de/2018/04/setup-openvpn-troubleshooting/
https://www.itsfullofstars.de/2018/04/solving-reverse-proxy-error-err_content_decoding_failed/
https://www.itsfullofstars.de/2018/06/clone-a-scp-git-repository-from-command-line/
https://www.itsfullofstars.de/2018/06/download-resources-from-sap-cloud-for-your-ci-job/
https://www.itsfullofstars.de/2018/06/sap-web-ide-invalid-backend-response-received-by-scc/
https://www.itsfullofstars.de/2018/07/how-to-use-find-to-sort-files-across-folders/
https://www.itsfullofstars.de/2018/08/how-to-access-ui5-model-data/
https://www.itsfullofstars.de/2018/09/err_content_decoding_failed/
...

Prepare URL list

The links in pages.txt point to my server itsfullofstars.de. For testing the migration and check if all links are working, I had to adjust them to work with the local instance of my new website running at http://localhost:4321 🔗. This can be done by any tool that allows to replace https://www.itsfullofstars.de 🔗 with http://localhost:4321/ 🔗

http://localhost:4321/2018/04/openvpn-connection-test/
http://localhost:4321/2018/04/setup-openvpn-troubleshooting/
http://localhost:4321/2018/04/solving-reverse-proxy-error-err_content_decoding_failed/
http://localhost:4321/2018/06/clone-a-scp-git-repository-from-command-line/
http://localhost:4321/2018/06/download-resources-from-sap-cloud-for-your-ci-job/
http://localhost:4321/2018/06/sap-web-ide-invalid-backend-response-received-by-scc/
http://localhost:4321/2018/07/how-to-use-find-to-sort-files-across-folders/
http://localhost:4321/2018/08/how-to-access-ui5-model-data/
http://localhost:4321/2018/09/err_content_decoding_failed/
...

Validate URLs

The script below is taken from Check the Status of a URL Without Downloading Using wget 🔗

!/bin/bash
 
input_file="pages.txt"
log_file="status_log.txt"
 
while read -r url; do
  echo "Checking $url..."
  status=$(wget --spider --tries 1 --server-response "$url" 2>&1 | grep "HTTP/" | tail -1)
  echo "$url - $status" >> "$log_file"
done < "$input_file"
 
echo "Status check completed. Results saved to $log_file."

Running the script ouputs the result of wget for all links listed in pages.txt. The output of the run is logged to status_log.txt. It contains for every URL the URL checked, and the wget HTTP response. 200 for OK, 404 for Not Found.

http://localhost:4321/2018/04/setup-openvpn-troubleshooting/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/04/solving-reverse-proxy-error-err_content_decoding_failed/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/06/clone-a-scp-git-repository-from-command-line/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/06/download-resources-from-sap-cloud-for-your-ci-job/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/06/sap-web-ide-invalid-backend-response-received-by-scc/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/07/how-to-use-find-to-sort-files-across-folders/ -   HTTP/1.1 200 OK
http://localhost:4321/2018/08/how-to-access-ui5-model-data/ -   HTTP/1.1 404 Not Found

In the above example, the last URL returned an error: HTTP 404 Not Found. To get a better overview of the errornous URLs, I filtered the results to only include the URLs with HTTP 404.

grep 404 ./status_log.txt > failing_urls.txt

The file failing_urls.txt contains now all URLs that returned with HTTP status code 404

http://localhost:4321/2018/02/activating-the-clickjacking-framing-protection-service/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2021/01/one-mobile-platform-to-rule-them-all/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2021/05/nichts-geht-uber-sicherheit/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2021/07/how-is-my-website-doing/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2022/02/apache-ah00561-size-of-a-request-header-field-exceeds-server-limit/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2022/05/wie-man-mit-einer-meldung-zu-einem-dsgvo-verstos-richtig-umgeht/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2022/07/a-different-kind-of-certification/ -   HTTP/1.1 404 Not Found
http://localhost:4321/2022/08/migrating-from-single-disk-to-raid5/ -   HTTP/1.1 404 Not Found
...

These URls are failing. For every entry in the list now I only had to find out why the link failed. In most cases it only was because of either the slug or the date part being wrong.