Downloading the entire CAD-comic archive with Python

Over the weekend I had some time to play around with Python. I had not touched Python for the last 7-8 years, and even then had only briefly used it. Though I am not a fan of some parts of the language, the speed with which you can whip something together is actually quite impressive.

Even though I had almost no experience with the language, in about two hours I had a script that could download all CTRL+ALT+DEL-comics given a year, or a range of years.

The complete code is available on my github. I will just explain the general idea that I had.

First of all, the archive is available on http://www.cad-comic.com/cad/archive and you can choose which year you want. Each year just changes the URL by appending the year to the archive endpoint, for example for 2013: http://www.cad-comic.com/cad/archive/2013.

On this page, you find a list of all the comics that have been published that year, with a url to refer to them. I had to grab the URLs for each comic for a given year. Afterwards, when I had the URL I had to fetch the image that was displayed on that page. By looking at the source, I determined that I could find it under the _content_ div, and then just the only image.

Using BeautifulSoup, scraping a website is easy and fast. In addition, the library urllib2 for Python makes sending web requests easy as well.

Let's take the code apart now.

Before we begin with the actual code, an important step is to set up our headers for the requests we will be making over the web. If we do not do this, then our web requests will fail as they will be blocked by the server that CAD-comics is running on.

# Pass some headers, so the website allows our requests.
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection' : 'keep-alive'}

First we have to scrape the archive for a given year. Since we know how to build the URL for this year, we just have to send a request to that, and use BeautifulSoup to fetch all the hrefs on the page. In addition, I want to filter the hrefs to only refer to pages that have a comic. Otherwise I would have hrefs to other parts, such as to social media. I figured that all the comic urls also have the year in them, as comic urls have their published date in them, thus my filter can just look for the year in the href string. This should limit the errors we make.

""" Scrape an archive for a given year, finds retrieves all the comic URLs for this year."""
def scrapeArchiveForComics(year):
    print year
    global scrapingYear
    global index
    index = 0 # We set the index to 0 again, because we are creating a new list of comics
    scrapingYear = year
    archiveUrl = "http://www.cad-comic.com/cad/archive/" + year

    # Look for all the URLs (a-tags) containing '2002'
    request = urllib2.Request(archiveUrl,headers=hdr);
    req = urllib2.urlopen(request)
    soup = BeautifulSoup(req.read(),'html.parser')
    aTags = soup.find_all("a")
    comicTags = filter(urlContainsYear,aTags)
    comicUrls = map(mapFetchHrefFromImageUrl,comicTags)
    # These are sorted by how they appear in the archive. They need to be reversed (last on archive = first chronologically)
    return (list(reversed(comicUrls)))

For filtering them, I did not do it inline (I'm sure you can do this in Python with a Lambda expression, and it would probably be more elegant. In fact, I am sure that a lot of this python code will make seasoned pythonistas want to smack me with a book on Python.

Anyway, I made a global of the year that I am scraping, so in my urlContainsYear method, I can just look for this in the url.

def urlContainsYear(url):
    global scrapingYear
    stringUrl = str(url) # Because BeautifulSoup makes it a 'tag' normally
    return scrapingYear in stringUrl

To get the actual href from the BeautifulSoup tag elements, I used a map function. Once again, this should probably be a lambda.

def mapFetchHrefFromImageUrl(imgUrl):
    return str(imgUrl["href"])

This method returns me urls like this: /cad/20120123, /cad/20120213,..

Now that I have these urls, I can build them by joining them with the baseurl and then looking for the image in the resulting HTML. For this, I have used another method that I call for each returned url.


def scrape(partialComicURL):
    global scrapingYear
    global index
    print "Scraping CAD-comics.."
    site = baseUrl + partialComicURL
    request = urllib2.Request(site,headers=hdr)
    f = urllib2.urlopen(request)
    soup = BeautifulSoup(f.read(),'html.parser')

    # Find the image source
    contentDiv = soup.find(id="content")
    imageSource = contentDiv.find_all('img')[0]["src"] # src attribute of first element in the array (only one result for the URL)
    imageRequest = urllib2.Request(imageSource,headers=hdr)

    # format filename, jus take the last part of the comic (after 2nd slash of partial)
    extension = imageSource[-3:]
    filename = str(index) + " : " + (partialComicURL.split("/")[2]) + "." + extension
    index+=1

    #Write the image to a file
    f = open(scrapingYear+"/"+filename,"wb")
    f.write(urllib2.urlopen(imageRequest).read())
    f.close()
    return

I believe this code is quite self-explanatory to anyone familiar with Python. I do keep an index, which might look odd. But the index is so that I can prefix the filename with a number, to ensure that they are in the right order when viewing them locally. Order can be quite important for the CAD-Comic, as some storylines run over multiple comics and they need to be in order. That's also the reason why in the end of the scrapeArchiveForComics method, I reverse the list. Because from the website, it comes in 'most-recent' first. The reverse order of what we actually want.

In addition, I also format the filename a bit more. I just use the date of the comic, and I look for the extension in the image source to determine whether it is .jpg or .png. At some point in time, Tim (author of CAD), decided to change image format it turns out.

Those are actually the two most interesting methods that we need to make this work. The main method I wrote takes some CLI arguments to fill in the parameters for these methods.

def main():
    if len (sys.argv) > 1:
        year = sys.argv[1]
        if year == "all":
            startYear = int(sys.argv[2]) if len (sys.argv) == 3 else 2002
            print "Scraping for all years"
            thisYear = datetime.now().year
            yearRange = range(startYear,thisYear+1) # +1 to include the current year. Otherwise range is not-inclusive
            for archiveYear in yearRange:
                downloadForYear(str(archiveYear))
            print "Done!"
        else:
            print "Scraping for year: " + year
            downloadForYear(str(year))
            print "Done!"
    else:
        print "Pass a year, starting from 2002 (sample usage: python main.py 2002"

This will call a 'downloadForYear' method, which in addition to starting the scraping, will also create a folder to store the images.

""" main method to download for year"""
def downloadForYear(year):
    comicUrls = scrapeArchiveForComics(year)
    try:
        os.mkdir(year) # Create a folder to store the comics.
    except OSError:
            pass # the folder already exists, should we maybe empty it?

    i = 0
    for comic in comicUrls:
        print "Scraping CAD-comic #" + str(i) + " from: " + str(len(comicUrls)) + " for year: " + year
        scrape(comic)
        i+=1

That is pretty much the whole code. I did omit some parts of the code here, as the full source is available on github as mentioned earlier. So make sure to check that out if you have any issues getting this to run. Plus, if you want to fix things about this code, PRs are always welcome! 🙂

I must say that even though the language and the syntax of Python might look a bit alien to me (coming from a background with Java and Haskell), the amount of work you can do with so much ease is appealing to me. In fact, I am sure that I will end up writing some more small Python scripts in the future, as I had a great deal of fun playing around with the language.