Downloading funghi, from Wikimedia Commons

#!/usr/bin/env python
# print the urls of all the images in a category of Wikimedia Commons
# example:
# $ python get_commons.py "Category:Illustrations_of_fungi"

# pipe to wget for download:
# $ python get_commons.py [category] | wget -i - --wait 1

import sys
import json
import urllib2
from urllib import quote

def make_api_query(category, q_continue=""):
    if q_continue:
        q_continue = '&gcmcontinue=' + q_continue
    url = 'http://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=' + category + q_continue + '&gcmlimit=500&prop=imageinfo&iiprop=url&format=json'
    request = json.loads(urllib2.urlopen(url).read())
    if 'error' in request:
        sys.exit(request['error']['info'])
    for page in request['query']['pages'].values():
        try:
            print page['imageinfo'][0]['url']
        except KeyError: pass
    # there is a maximum of 500 results in one request, for paging
    # we use the query-continue value:
    if 'query-continue' in request:
        q_continue = quote(request['query-continue']['categorymembers']['gcmcontinue'])
        make_api_query(category, q_continue)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit("usage: python get_commons.py [category]")
    make_api_query(sys.argv[1])

Download

3 Comments

We used a variation on this script while creating a design for Radio Panik.

This is an example of a simple script composed in the Python programming language. print outputs lines to the terminal. there is one command line argument, which is the name of the category the script will retrieve the image files for. If you would just run the python script python get_commons.py Illustrations_of_Mushrooms you would get a line of urls printed to the terminal. The script was designed to be used in a pipeline with the wget program (which, if you’ve installed Homebrew, you can install by typing brew install wget). In the wget program, we specify -i - which means take input from the standard input.

The script requests a url that returns list of images, and their properties, in a json format (just like the twitter api provides). The script contains a function that calls a url from the wikipedia api, then extracts the image locations available in the response. If there are more images available then can be displayed at once, wikipedia’s api also returns a new url to query that will provide the next set of images and so on. To handle this, the function that queries wikipedia, checks if such a ‘continue-url’ is present, and if so, calls itself with this url as an argument. The fact that a programming function can call itself is rather baffling at first. This is what programmers call recursion. Python is not optimized for recursion, but there are styles of programming that are built around this concept. Recursion is an appealing analytical trick. I imagine there is some kind of abstraction gland stimulated by paradoxes and self-referential logic tricks€a kind of weird sister to sexual excitement. Like the symbols preferred by medieaval alchemists. Jenseits likes circles, spirals and Oroburouses.

Reply

Leave a comment