I am currently the Head of DevRel at unSkript. unSkript is building an open source platform to help SRE and DevOps teams build automations that reduce their daily ‘toil.’ When I think of toil, I think of all the bulls*it manual tasks that we have to do every day – just to keep everything up and running.

So, I spend a lot of my time creating (and writing about) DevOps automations. But I also put on my DevRel hat and do my DevRel-y tasks to help my part of the company grow. And, just like for SRE/DevOps teams, some of these tasks are very repetitive – you might even call them ‘toil’.

Since I am spending a lot of time automating away toil for others, its natural to think about how I might automate away some of the ‘toil’ I have in my own role. I have written about the RunBook I created to collect daily metrics and store them in a central database. But there are several other RunBooks I have created to make my life easier.

404s

So, you’re working in the docs, and moving stuff around, and you accidentally break a link. This happens. We’ve all done it. And a lot of times you catch it. But recently, there was a broken link in our docs that colleague found. We thought – are there tools that can help us find issues like 404s in our docs? I didn’t even bother to do a search – since I figured I could automate this with a Jupyter Notebook pretty quickly.

Build a Jupyter NoteBook

I’m building the automation in a NoteBook using the unSkript framework, so there is an extra cell that initializes unSkript. But the rest of the NoteBook can be run in any Jupyter environment. Here’s the link on GitHub.

The input to this RunBook is a Sitemap.

A sitemap is an XML document that provides information about your site. It is used by search engines to help index your site (and giving the sitemap to Google in the Google Search Console will help your SEO). Every platform I have worked with for docs or websites automatically generates a sitemap for you (and it is generally found at /sitemap.xml.)

Step 1: Read in the Sitemap, and collect a list of the URLs on the page

import requests
import xmltodict
import json

#This Action reads in the Sitemap, converts the XML to a dictionary, 
# and then extracts every URL into a list


response = requests.get(sitemap)
contents = response.text
# Parse the XML data to a dictionary
xml_dict = xmltodict.parse(contents)
#print(xml_dict['urlset'])
urlList = []
for url in xml_dict['urlset']['url']:
    urlList.append(url['loc'])

print("sitemap read in, list of urls created")

The List variable urlList now has every URL extracted from the sitemap.

Step 2:

Let’s loop through each url in urlList (extracted from the sitemap) read in the HTML, and extract every link using the BeautifulSoup library. Then – make a request to each URL from the page and save the HTTP status that is returned, as well as the page where the link was seen. If the response is anything but 200, we know which page has the bad link, and which link is broken!

Note: the docs have a lot of cross-referencing, so if a link has already been checked, we don’t need to check it on a second page. (This does mean that if there is a bad link on multiple pages, it may take a few iterations to find each instance). Also, we exclude all references to localhost and 127.0.0.1, as those urls appear in the docs, but will fail in testing.

import requests
import textstat
from bs4 import BeautifulSoup


urls = urlList
#urls = ["https://unskript.com"]
links = {}

for url in urls:
    # get the text of the file
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    for link in soup.find_all('a'):
        link_url = link.get('href')
        if (link_url not in links) and (link_url[0:4] == "http") and ("localhost" not in link_url)and ("127.0.0.1" not in link_url) and ("runbooks.sh" not in link_url):
            #print(link_url, url)
            #we want to add it
            link_response = requests.get(link_url)
            link_status = link_response.status_code
            data = {"status": link_status, "first_seen": url}
            links[link_url] = data
print("list completed")

Step 3: List the non-200 responses.

A 200 response means that the link loaded as expected.

If we see anything in the 3xx range, it has moved – and we should update the link to the new page.

A 404 means that the page has not been found, and the link is broken. So we should find the new page, and fix the link. 403 are forbidden – which probably means that the link works, but the page content being blocked by the server from script access (screen scraping and the like will use the same process we are using).

Robot v. Robot

When testing the sitemap from unSkript.com, I received two errors:

https://wellfound.com/company/unskript/jobs 403
https://www.linkedin.com/company/unskript-inc 999

Both of these pages work, but are blocking automations from scraping their pages (one is our Job board, and the other our LinkedIn page. The links are fine. It’s just the robots at these two companies not liking my robot, and blocking it.

Conclusion

Everyone hates broken links. Especially when it is in a site that you are in charge of. Rather than pay for a service to regularly check the links in our docs and on our website, we now have a RunBook that can regularly check for broken links on our site.

Want to try it for yourself? The RunBook is Open Source. It’s currently set to run in the unSkript Automation framework, but if you delete the first cell, it’ll run in any Jupyter environment. And you can be assured that every link on your site is working!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.