Self-Supervised Learning with Image网


Until fairly recently, deep learning models needed a LOT of data to get decent performance. Then came an innovation called transfer learning, which we’ve covered in some previous posts. We train a network once on a huge dataset (such as ImageNet, or the entire text of Wikipedia), and it learns all kinds of useful features. We can then retrain or ‘fine-tune’ this pretrained model on a new task (say, elephant vs zebra), and get incredible accuracy with fairly small training sets. But what do we do when there isn’t a pretrained model available?

Pretext tasks (left) vs downstream task (right). I think I need to develop this style of illustration – how else will readers know that this blog is just a random dude writing on weekends? 🙂

Enter Self-Supervised Learning (SSL). The idea here is that in some domains, there may not be vast amounts of labeled data, but there may be an abundance of unlabeled data. Can we take advantage of this by using it somehow to train a model that, as with transfer learning, can then be re-trained for a new task on a small dataset? It turns out the answer is yes – and it’s shaking things up in a big way. This fastai blog post gives a nice breakdown of SSL, and shows some examples of ‘pretext tasks’ – tasks we can use to train a network on unlabeled data. In this post, we’ll try it for ourselves!

Follow along in the companion notebook.


Read the literature on computer vision, and you’ll see that ImageNet has become THE way to show off your new algorithm. Which is great, but coming in at 1.3 million images, it’s a little tricky for the average person to play with. To get around this, some folks are turning to smaller subsets of ImageNet for early experimentation – if something works well in small scale tests, *then* we can try it in the big leagues. Leading this trend have been Jeremy Howard and the fastai team, who often use ImageNette (10 easy classes from ImageNet), ImageWoof (Some dog breeds from ImageNet) and most recently Image网 (‘ImageWang’, 网 being ‘net’ in Chinese).

Image网 contains some images from both ImageNette and ImageWoof, but with a twist: only 10% of the images are labeled to use for training. The remainder are in a folder, unsup, specifically for use in unsupervised learning. We’ll be using this dataset to try our hand at self-supervised learning, using the unlabeled images to train our network on a pretext task before trying classification.

Defining Our Pretext Task

A pretext task should be one that forces the network to learn underlying patterns in the data. This is a new enough field that new ideas are being tried all the time, and I believe that a key skill in the future will be coming up with pretext tasks in different domains. For images, there are some options explained well in this fastai blog. Options include:

  • Colorization of greyscale images
  • Classifying corrupted images
  • Image In-painting (filling in ‘cutouts’ in the image)
  • Solving jigsaws

For fun, I came up with a variant of the image in-painting task that combines it with colorization. Several sections of the input image are blurred and turned greyscale. The network tries to replace these regions with sensible values, with the goal being to have the output match the original image as closely as possible. One reason I like the idea of this as a pretext task is that we humans get something similar. Each time we move our eyes, things that were in our blurry, greyscale peripheral vision are brought into sharp focus in our central vision – another input for the part of our brain that’s been pretending they were full HD color the whole time 🙂

Here are some examples of the grey-blurred images and the desired outputs:

Input/Output pairs for our pretext task, using the RandomGreyBlur transform

We train our network on this task for 15 epochs, and then save its parameters for later use in the downstream task. See the notebook for implementation details.

Downstream Task: Image Classification

Now comes the fun part: seeing if our pretext task is of any use! We’ll follow the structure of the Image网 leaderboard here, looking at models for different image sizes trained with 5, 20, 80 or 200 epochs. The theory here is that we’d hope that out pretext task has given us a decent network, so we should get some results after 5 epochs, and keep getting better and better results with more training.

Results from early testing

The notebook goes through the process, training models on the labeled data provided with Image网 and scoring them on the validation set. This step can be quite tedious, but the 5-epoch models are enough to show that we’ve made an improvement on the baseline, which is pretty exciting. For training runs 20 epochs and greater, we still beat a baseline with no pre-training, but fall behind the current leaderboard entry based on simple inpainting. There is much tweaking to be done, and the runs take ~1 minute per epoch, so I’ll update this when I have more results.

Where Next?

Image网 is fairly new, and the leaderboard still needs filling in. Now is your chance for fame! Play with different pretext tasks (for eg, try just greyscale instead of blurred greyscale – it’s a single line of code to change), or tweak some of the parameters in the notebook and see if you can get a better score. And someone please do 256px?

Beyond this toy example, remember that unlabeled data can be a useful asset, especially if labeled data is sparse. If you’re ever facing a domain where a pretrained model is unavailable, self-supervised learning might come to your rescue.

Meta ‘Data Glimpse’ – Google Dataset Search

Christmas came in January this year, with Google’s release of ‘Dataset Search‘. They’ve indexed millions of cool datasets and made it easy to search through them. This post isn’t about any specific dataset, but rather I just wanted to share this epic new resource with you.

Google’s Dataset Search

I saw the news as it came out, which meant I had the pleasure of sharing it with my colleagues – all of whom got nerd sniped to some degree, likely resulting a much loss of revenue and a ton of fun had by all 🙂 A few minutes after clicking the link I was clustering dolphin vocalizations and smiling to myself. If you’re ever looking for an experiment to write up, have a trawl through the datasets on there and pick one that hasn’t got much ML baggage attached – you’ll have a nice novel project to brag about.

Clustering Dolphin noises

Say what you like about Google, there are people there doing so much to push research forward. Tools like Colab, Google Scholar, and now Dataset Search make it easy to do some pretty amazing research from anywhere. So go on – dive in 🙂

Swoggle Part 1- RL Environments and Literate Programming with NBDev

I’m going to be exploring the world of Reinforcement Learning. But there will be no actual RL in this post – that’s for part two. This post will do two things: describe the game we’ll be training our AI on, and show how I developed it using a tool called NBDev which is making me so happy at the moment. Let’s start with NBDev.

What is NBDev?

Like many, I started my programming journey editing scripts in Notepad. Then I discovered the joy of IDEs with syntax highlighting, and life got better. I tried many editors over the years, benefiting from better debugging, code completion, stylish themes… But essentially, they all offer the same workflow: write code in an editor, run it and see what happens, make some changes, repeat. Then came Jupyter notebooks. Inline figures and explanations. Interactivity! Suddenly you don’t need to re-run everything just to try something new. You can work in stages, seeing the output of each stage before coding the next step. For some tasks, this is a major improvement. I found myself using them more and more, especially as I drifted into Data Science.

But what about when you want to deploy code? Until recently, my approach was to experiment in Jupyter, and then copy and paste code into a separate file or files which would become my library or application. This caused some friction – which is where NBDev comes in.

~~~~~ “Create delightful python projects using Jupyter Notebooks” – NBDev website ~~~~~

With NBDev, everything happens in your notebooks. By adding special comments like #export to the start of a cell, you tell NBDev how to treat the code. This means you can write a function that will be exported, write some examples to illustrate how it works, plot the results and surround it with nice explanations in markdown. The exported code gets paces in a neat, well-ordered .py file that becomes your final product. The Notebook(s) becomes documentation, and the extra examples you added to show functionality work as tests (although you can also add more formal unit testing). An extra line of code uploads your library for others to install with pip. And if you’re following their guide, you get a documentation site and continuous integration that updates whenever you push your changes to GitHub.

The upshot of all this is that you can effortlessly create good, clean code and documentation without having to switch between notebooks, editors and separate documentation. And the process you followed, the journey that lead to the final design choices, is no longer hidden. You can show how things developed, and include experiments that justify a particular choice. This is ‘literate programming’, and it feels like a major shift in the way I think about software development. I could wax lyrical about this for ages, but you should just go and read about it in the launch post here.

What on Earth is Swoggle?

Christmas, 2019. Our wedding has brought a higher-than-normal influx of relatives to Cape Town, and when this extended family gets together, there are some things that are inevitable. One of these, it turns out, is the invention of new games to keep the cousins entertained. And thus, Swoggle was born 🙂

A Swoggle game in progress – 2 players are left.

The game is played on an 8×8 board. There are usually 4 players, each with a base in one of the corners. Players can move (a dice determines how far), “spoggle” other players (capturing them and placing them in “swoggle spa” – none of this violent termnology) or ‘swoggle’ a base (gently retiring the bases owner from the game – no killing here). To make things interesting, there are four ‘drones’ that can be used as shields or to take an occupied base. Moving with a drone halves the distance you can travel, to make up for the advantages. A player with a drone can’t be spoggled by another player unless they too have a drone, or they ‘powerjump’ from their base (a half-distance move) onto the droned player. Maybe I’ll make a video one day and explain the rules properly 🙂

So, that’s the game. Each round is fairly quick, so we usually play multiple rounds, awarding points for different achievements. Spoggling (capturing) a player: 1 point. Swoggling (taking out a base): 3 points. Last one standing: 5 points. The dice rolls add lots of randomness, but there is still plenty of room for tactics, sibling rivalry and comedic mistakes.

Game Representation

If we’re going to teach a computer to play this, we need a way to represent the game state, check if moves are valid, keep track of who’s in the swoggle spa and which bases are still standing, etc. I settled on something like this:

Game state representation

There is a Cell in each x, y location, with attributes for player, drone and base. These cells are grouped in a Board, which represents the game grid and tracks the spa. The Board class also contains some useful methods like is_valid_move() and ways to move a particular player around. At the highest level, I have a Swoggle class that wraps a board, handles setting up the initial layout, provides a few extra convenience functions and can be used to run a game manually or with some combination of agents (which we’ll cover in the next section). Since I’m working in NBDev, I have some docs with almost no effort, so check out for details on this implementation. Here’s what the documentation system turned my notebooks into:

Part of the generated documentation

The ability to write code and comments in a notebook, and have that turn into a swanky docs page, is borderline magical. Mine is a little messy since this is a quick hobby project. To see what this looks like in a real project, check out the docs for NBDev itself or Fastai v2.

Creating Agents

Since the end goal is to use this for reinforcement learning, it would be nice to have an easy way to add ‘Agents’ – code that defines how a player in the game will make a move in a given situation. It would also be useful to have a few non-RL agents to test things out and, later, to act as opponents for my fancier bots. I implemented two types of agent:

  • RandomAgent Simply picks a random but valid move by trial and error, and makes that move.
  • BasicAgent Adds a few simple heuristics. If it can take a base, it does so. If it can spoggle a player, it does so. If neither of these options are possible, it moves randomly.

You can see the agent code here. The notebook also defines a few other useful functions, such as win_rates() to pit different agents against each-other and see how they do. This is fun to play with – after a few experiments it’s obvious that the board layout and order of players matters a lot. A BasicAgent going last will win ~62% of games against three RandomAgents – not unexpected. But of the three RandomAgents, the one opposite the BasicAgent (and thus furthest from it) will win the majority of the remaining games.

Next Step: Reinforcement Learning!

This was a fun little holiday coding exercise. I’m definitely an NBDev convert – I feel so much more productive using this compared to any other development approach I’ve tried. Thank you Jeremy, Sylvain and co for this excellent tool!

Now, the main point of this wasn’t just to get the game working – it was to use it for something interesting. And that, I hope, is coming soon in Part 2. As I type this, a neural network is slowly but surely learning to follow the rules and figuring out how to beat those sneaky RandomAgents. Wish it luck, stay tuned, and, if you’re *really* bored, pip install swoggle and watch some BasicAgents battle it out 🙂

Snapshot Serengeti – Working with Large Image Datasets

Driven Data launched a competition around the Snapshot Serengeti database – something I’ve been intending to investigate for a while. Although the competition is called “Hakuna Ma-data” (which where I come from means something like “there is no data”), this is actually the largest dataset I’ve worked with to date, with ~5TB of high-res images. I suspect that that’s putting people off (there are only a few names on the leaderboard), so I’m writing this post to show how I did an entry, run through some tricks for dealing with big datasets, give you a notebook to get started quickly and try out a fun new tool I’ve found for monitoring long-running experiments using’s dive in.

The Challenge

The goal of the competition is to create a model that can correctly label the animal(s) in an image sequence from one of many camera traps scattered around the Serengeti plains, which are teeming with wildlife. You can read more about the data and the history of the project on their website. There can be more than one type of animal in an image, making this a multi-label classification problem.

Some not-so-clear images from the dataset

The drivendata competition is interesting in that you aren’t submitting predictions. Instead, you have to submit everything needed to perform inference in their hidden test environment. In other words, you have to submit a trained model and the code to make it go. This is a good way to practice model deployment.


The approach I took to modelling is very similar to the other fastai projects I’ve done recently. Get a pre-trained resnet50 model, tune the head, unfreeze, fine-tune, and optionally re-train with larger images right at the end. It’s a multi-label classification problem, so I followed the fastai planet labs example for labeling the data. You can see the details of the code in the notebook (coming in the next section) but I’m not going to go over it all again here. The modelling in this case is less interesting than the extra things needed to work at this scale.

Starter Notebook

I’m a big fan of making data science and ML more accessible. For anyone intimidated by the scale of this contest, and not too keen on following the path I took in the rest of this post, I’ve created a Google Colab Notebook to get you started. It shows how to get some of the data, label it, create and train a model, score your model like they do in the competition and create a submission. This should help you get started, and will give a good score without modification. The notebook also has some obvious improvements waiting to be made – using more data, training the model further…..

Training a quick model in the starter notebook

The code in the notebook is essentially what I used for my first submission, which is currently the top out of the… 2 total submissions on the leaderboard. As much as I like looking good, I’ll be much happier if this helps a bunch of people jump ahead of that score! Please let me know if you use this, so that I don’t feel that this wasn’t useful to anyone?

Moar Data – Colab won’t cut it

OK, so there definitely isn’t 5TB of storage on Google Colab, and even though we can get a decent score with a fraction of the data, what if we want to go further? My approach was as follows:

  • Create a Google Cloud Compute instance with all the fastai libraries etc installed, by following this tutorial. The resultant machine has 50GB memory, a P100 GPU and 200GB disk space by default. It comes with most of what’s required for deep learning work, and has the added bonus of having jupyter + all the fastai course notebooks ready to get things going quickly. I made sure not to make the instance preemptible – we want to have long-running tasks going, so having it shut down unexpectedly would be sad.
  • Add an extra disk to the compute instance. This tutorial gave me the main steps. It was quite surreal typing in 6000 GB for the size! I mounted the dist at /ss_ims – that will be my base folder going forward.
  • Download a season of data, and then begin experimenting while more downloads. No point having that pricey GPU sitting idle!
  • Train the full model overnight, tracking progress.
  • Submit!
Mounting a scarily large disk!

I won’t go into the cloud setup here, but in the next section let’s look at how you can track the status of a long-running experiment.

Neptune ML – Tracking progress

I’d set the experiments running on my cloud machine, but due to lack of electricity and occasional loss of connection I couldn’t simply leave my laptop running and connected to the VM to show how the model training was progressing. With so many images, each epoch of training took ages, and I had a couple of models crash early in the process. This was frustrating – I would try to leave it going overnight but if the model failed in the evening it meant that I had wasted some of my few remaining cloud credits on a machine sitting idle. Luckily, I had recently seen how to monitor progress remotely, meaning I could check my phone while I was out and see if the model was working and how good it was getting.

Tracking loss and metrics over time with

The process is pretty simple, and well documented here. You sign up for an account, get an API key and add a callback to your model. This will then let you log in to from any device, and track your loss, any metrics you’ve added and the output of the code you’re running. I could give more reasons why this is useful, but honestly the main motivation is that it’s cool! I had great fun surreptitiously checking my loss from my phone every half hour while I was out and about.

Tracking model training with neptune

Where next?

I’m out of cloud credits, and as an ‘independent scientist’ my funding situation doesn’t really justify spending more money on cloud compute to try a better entry. If you’d like to sponsor some more work, I may have another go with a properly trained model. I did manage to experiment on using more than the first image in a sequence, and using Jeremy Howard’s trick of doing some final fine-tuning on larger images – would be interesting to see how much these improve the score in this contest.

I hope this post encourages more of you to try this contest out! As the starter notebook shows, you can get close to the top (beating the benchmark) with a tiny fraction of the data and some simple tricks. Give it a try and report how you do in the comments!

Deep Learning + Remote Sensing – Using NNs to turn imagery into meaningful features

Every now and again, the World Bank conducts something called a Living Standards Measurement Study (LSMS) survey in different countries, with the purpose being to learn about people, their incomes and expenses, how they’re doing economically and so on. These surveys provide very useful info to various stakeholders, but they’re expensive to conduct. What if we could estimate some of the parameters they measure from satellite imagery instead? That was the goal of some researchers at Stanford back in 2016, who came up with a way to do just that and wrote it up into this wonderful paper in Science. In this blog post, we’ll explore their approach, replicate the paper (using some more modern tools) and try a few experiments of our own.

Predicting Poverty: Where do you start?

Nighttime lights

How would you use remote sensing to estimate economic activity for a given location? One popular method is to look at how much light is being emitted there at night – as my 3 regular readers may remember, there is a great nighttime lights dataset produced by NOAA that was featured in a data glimpse a while back. It turns out that the amount of light sent out does correlate with metrics such as assets and consumption, and this data has been used in the past to model things like economic activity (see another data glimpse post for more that). One problem with this approach: the low end of the scale gets tricky – nighttime lights don’t vary much below a certain level of expenditure.

Looking at daytime imagery, we see many things that might help tell us about the wealth in a place: type of roofing material on the houses, the number of roads, how built-up an area is…. But there’s a problem here too: these features are quite complicated, and training data is sparse. We could try to train a deep learning model to take in imagery and spit out income level, but the LSMS surveys typically only cover a few hundred locations – not a very large dataset, in other words.

Jean et al’s sneaky trick

The key insight in the paper is that we can train a CNN to predict nighttime lights (for which we have plentiful data) from satellite imagery, and in the process it will learn features that are important for predicting lights – and that these features will likely also be good for predicting our target variable as well! This multi-step transfer learning approach did very well, and is a technique that’s definitely worth keeping in mind when you’re facing a problem without much data.

But wait, you say. How is this better than just using nightlights? From the article: “How might a model partially trained on an imperfect proxy for economic well-being—in this case, the nightlights used in the second training step above—improve upon the direct use of this proxy as an estimator of well-being? Although nightlights display little variation at lower expenditure levels (Fig. 1, C to F), the survey data indicate that other features visible in daytime satellite imagery, such as roofing material and distance to urban areas, vary roughly linearly with expenditure (fig. S2) and thus better capture variation among poorer clusters. Because both nightlights and these features show variation at higher income levels, training on nightlights can help the CNN learn to extract features like these that more capably capture variation across the entire consumption distribution.” (Jean et al, 2016). So the model learns expenditure-dependent features that are useful even at the low end, overcoming the issue faced by approaches that use nightlights alone. Too clever!

Can we replicate it?

The authors of the paper shared their code publicly but… it’s a little hard to follow, and is scattered across multiple R and Python files. Luckily, someone has already done some of the hard work for us, and has shared a pytorch version in this GitHub repository. If you’d like to replicate the paper exactly, that’s a good place to start. I’ve gone a step further and consolidated everything into a single Google Colab notebook that borrows code from the above and builds on it. The rest of this post will explain the different sections of the notebook, and why I depart from the exact method used in the paper. Spoiler: we get a slightly better result with much fewer images downloaded.

Getting the data

The data comes from the Fourth Integrated Household Survey 2016-2017. We’ll focus on Malawi for this post. The notebook shows how to read in several of the CSV files downloaded from the website, and combine them into ‘clusters’ – see below. For each cluster location, we have a unique ID (HHID), a location (lat and lon), an urban/rural indicator, a weighting for statisticians, and the important variable: consumption (cons). This last one is the thing we’ll be trying to predict.

The relevant info from the survey data

One snag: the lat and lon columns are tricksy! They’ve been shifted to protect anonymity, so we’ll have to consider a 10km buffer around the given location and hope the true location is close enough that we get useful info.

Adding nighttime lights

Getting the nightlights value for a given location

To get the nightlight data, we’ll use the python library to run Google Earth Engine queries. You’ll need a GEE account, and the notebook shows how to authenticate and get the required data. We can get the nightlights for each cluster location (getting the mean over an 8km buffer around the lat/lon points) and add this number as a column. To give us a target to aim at, we’ll compare any future models to a simple model based on these nightlight values only.

Downloading static maps images

Getting imagery for a given location

The next step takes a while: we need to download images for the locations. BUT: we don’t just want one for each cluster location – instead, we want a selection from the surrounding area. Each of these will have it’s own nightlights value, so that we get a larger training set to build our model on. Later, we’ll extract features for each image in a cluster and combine them. Details are in the notebook. The code takes several hours to run, but at the end of it you’ll have thousands of images ready to use.

Tracking requests/sec on in my Google Cloud Console

You’ll notice that I only generate 20 locations around each cluster. The original paper uses 100. Reasons: 1) I’m impatient. 2) There is a rate limit of 25k images/day, and I didn’t want to wait (see #1), 3) The images are 400 x 400, but are then shrunk to train the model. I figured I could split the 400px image into 4 (or 9) smaller images that overlap slightly, and thus get more training data for free. This is suggested as a “TO TRY” in the notebook, but hint: it works. If you really wanted to get a better score, trying this or adding more imagery is an easy way to do so.

Training a model

I’ll be using fastai to simplify the model creation and training stages. before we can create a model, we need an appropriate databunch to hold the training data. An optional addition at this stage is to add image transforms to augment our training data – which I do with tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.) as suggested in the fastai satelite imagery example based on Planet labs. The notebook has the full code for creating the databunch:

Data ready for modelling

Next, we choose a pre-trained model and re-train it with our data. Remember, the hope is that the model will learn features that are related to night lights and, by extension, consumption. I’ve had decent results with resnet models, but in the shared notebook I stick with models.vgg11_bn to more closely match the original paper. You could do much more on this model training step, but we pick a learning rate, train for a few epochs and move on. Another place to improve!

Training the model to predict nightlights

Using the model as a feature extractor

This is a really cool trick. We’ll hook into one of the final layers of the network, with 512 outputs. We’ll save these outputs as each image is run through the network, and they’ll be used in later modelling stages. To save the features, you could remove the last few layers and run the data through, or you can use a trick I learnt from this TDS article and keep the network intact.

Cumulative explained variance of top PCA features

512 (or 4096, depending on the mode and which layer you pick) is a lot of features. So we use PCA to get 30 or so meaningful features from those 512 values. As you can see from the plot above, the top few components explain most of the variance in the data. These top 30 PCA components are the features we’ll use for the last step in the process: predicting consumption.

Putting it all together

For each image, we now have a set of 30 features that should be meaningful for predicting consumption. We group the images by cluster (aggregating their features). Now, for each cluster, we have the target variable (‘cons’), the nighttime lights (‘nl’) and 30 other potentially useful features. As we did right at the start, we’ll split the data into a test and a train set, train a model and then make predictions to see how well it does. Remember: our goal is to be better than a model that just uses nighttime lights. We’ll use the r^2 score when predicting log(y), as in the paper. The results:

  • Score using just nightlights (baseline): 0.33
  • Score with features extracted from imagery: 0.41

Using just the features derived from the imagery, we got a significant score increase. We’ve successfully used deep learning to squeeze some useful information out of satellite imagery, and in the process found a way to get better predictions of survey outcomes such as consumption. The paper got a score of 0.42 for Malawi using 100 images to our 20, so I’d call this a success.


There are quite a few ways you can improve the score. Some are left as exercises for the reader 🙂 here are a few that I’ve tried:
1) Tweaking the model used in the final step: 0.44 (better than the paper)
2) Using sub-sampling to boost size of training dataset + using a random forest model: 0.51 (!)
3) Using a model trained for classification on binned NL values (as in paper) as opposed to training it on a regression task: score got worse
4) Cropping the downloaded images into 4 to get more training data for the model (no other changes): 0.44 up from 0.41 without this step. >0.5 aggregating features of 3 different subsets of images for each cluster
5) Using a resnet-50 model: 0.4 (no obvious change this time – score likely depends less on model architecture and more on how well it is trained)

Other potential improvements:
– Download more imagery
– Train the model used as a feature extractor better (I did very little experimentation or fine-tuning)
– Further explore the sub-sampling approach, and perhaps make multiple predictions on different sub-samples for each cluster in the test set, and combine the predictions.

Please let me know if any of these work well for you. I’m less interested in spending more time on this – see the next section.

Where next

I’m happy with these results, but don’t like a few aspects:

  • Using static maps from Google means we don’t know the date the imagery was acquired, and makes it hard to extend our predictions over a larger area without downloading a LOT of imagery (meaning you’d have to pay for the service or wait weeks)
  • Using RGB images and an imagenet model means we’re starting from a place where the features are not optimal for the task – hence the need for the intermediate nighttime lights training step. It would be nice to have some sort of model that can interpret satellite imagery well already and go straight to the results.
  • Downloading from Google Static Maps is a major bottleneck. I used only 20 images / cluster for this blog – to do 100 per cluster and for multiple countries would take weeks, and to extend predictions over Africa months. There is also patchy availability in some areas.

So, I’ve been experimenting with using Sentinel 2 imagery, which is freely available for download over large areas and comes with 13 bands over a wide spectrum of wavelengths. The resolution is lower, but the imagery still has lots of useful info. There are also large, labeled datasets like the EuroSAT database that have allowed people to pretrain models and achieve state of the art results for tasks like land cover classification. I’ve taken advantage of this by using a model pre-trained on this imagery for land cover classification tasks (using all 13 bands) and re-training it for use in the consumption prediction task we’ve just been looking at. I’ve been able to basically match the results we got above using only a single Sentinel 2 image for each cluster.

Using Sentinel imagery solves both my concerns – we can get imagery for an entire country, and make predictions for large areas, at different dates, without needing to rely on Google’s Static Maps API. More on this project in a future post…


As always, I’m happy to answer questions and explain things better! Please let me know if you’d like the generated features (to save having to run the whole modelling process), more information on my process or tips on taking this further. Happy hacking 🙂

Packaging a classification model as a web app

My shiny new web app, available here

In my previous post I introduced fastai, and used it to identify images with potholes. Since then, I’ve applied the same basic approach to the Standard Bank Tech Impact Challenge: Animal classification with pretty decent results. A first, rough model was able to score 97% accuracy thanks to the magic of transfer learning, and by unfreezing the inner layers and re-training with a lower learning rate I was able to up the accuracy to over 99% for this binary classification problem. It still blows my mind how good these networks are at computer vision.

Zebra or Elephant?

This was exciting and fun. But I wanted to share the result, and my peer group aren’t all that familiar with log-loss scores. How could I get the point across and communicate what this means? Time to deploy this model as a web application 🙂

Exporting the model for later use

Final training step, saving weights and exporting to a file in my Google Drive

I knew it was possible to save some of the model parameters with‘name’), but wasn’t sure how easy it would be to get a complete model definition. Turns out, enough people want this that you can simply call model.export(‘model_name’). So I set my model training again (I hadn’t saved last time) and started researching my next step while Google did my computing for me.

Packaging as an app

I expected this step to be rather laborious. I’d need to set up a basic app (planned to use Flask), get an environment with pytorch/fastai set up and deploy to a server or, just maybe, get it going on Heroku. But then I came across an exciting page in the fastai docs: ‘Deploying on Render‘. There are essentially 3 steps:
– Fork the example repository
– Edit the file to add a link to your exported model
– Sign up with Render and point it at your new GitHub repository.
Then hit deploy! You can read about the full process in the aforementioned tutorial. Make sure your fastai is a recent version, and that you export the model (not just saving weights).

The resultant app is available at I used an earlier model with 97% accuracy (since I’m enjoying that top spot on the leaderboard ;)) but it’s still surprisingly accurate. It even get’s cartoons right!

Please try it out and let me know what you think. It makes a best guess – see what it says for non-animals, or flatter your friends by classifying them as pachyderms.


There seems to be a theme to my last few posts: “Things that sound hard are now easy!”. It’s an amazing world we live in. You can make something like this! It took 20 minutes, with me doing setup while the model trained! Comment here with links to your sandwich-or-not website, your am-I-awake app, your ‘ask-a-computer-if-this-dolphin-looks-happy’ business idea. Who knows, one of us might even make something useful 🙂

Yes, that is apparently an elephant…

UPDATE: I’ve suspended the service for now, but can re-start it if you’d like to try it. Reach out if that’s the case 🙂

Pothole Detection (aka Johno tries fastai)

This week saw folks from all over the AI space converge in Cape Town for the AI Expo. The conference was inspiring, and I had a great time chatting to all sorts of interesting people. There were so many different things happening (which I’m not going to cover here), but the one that led to this post was a hackathon run by Zindi for their most recent Knowledge competition: the MIIA Pothole Image Classification Challenge. This post will cover the basic approach used by many entrants (thanks to Jan Marais’ excellent starting notebook) and how I improved on it with a few tweaks. Let’s dive in.

The Challenge

The dataset consists of images taken from behind the dashboard of a car. Some images contain potholes, some don’t – the goal is to correctly discern between the two classes. Some example pictures:

Train and test data were collected on different days, and at first glance it looks like this will be a tough challenge! It looks like the camera is sometimes at different angles (maybe to get a better view of potholes) and the lighting changes from pic to pic.

The first solution

Jan won a previous iteration of this hackathon, and was kind enough to share a starting notebook (available here) with code to get up and running. You can view the notebook for the full code, but the steps are both simple and incredibly powerful:

  • Load the data into a ‘databunch’, containing both the labeled training data and the unlabeled test data. Using 15% of the training data as a validation set. The images are scaled to 224px squares and grouped into batches.
The images are automatically warped randomly each time (to make the model more robust). This can be configured, but the default is pretty good.
  • Create a model: learn = cnn_learner(data, resnet18, metrics=accuracy). This single line does a lot! It downloads a pre-trained network (resnet18) that has already been optimised for image classification. It reconfigures the output of that network to match the number of classes in our problem. It links the model to the data, freezes the weights of the internal layers, and gives us a model ready for re-training on our own classes.
  • Pick a learning rate, by calling learn.lr_find() followed by learn.recorder.plot() and picking one just before the graph bottoms out (OK, it’s more complicated than that but you can learn the arcane art of lr optimization elsewhere)
*sucks thumb* A learning rate of 0.05 looks fine to me Bob.
  • Fit the model with learn.fit_one_cycle(3, lr) (Change number of epochs from 3 to taste), make predictions, submit!

There is some extra glue code to format things correctly, find the data and so on. But this is in essence a full image classification workflow, in a deceptively easy package. Following the notebook results in a log-loss score of ~0.56, which was on par with the top few entries on the leaderboard at the start of the hackathon. In the starter notebook Jan gave some suggestions for ways to improve, and it looks like the winners tried a few of those. The best score of the day was Ivor (Congrats!!) with a log-loss of 0.46. Prizes were won, fun was had and we all learned how easy it can be to build an image classifier by standing on the shoulders of giants.

Making it better

As the day kicked off, I dropped a few hints about taking a look at the images themselves and seeing how one could get rid of unnecessary information. An obvious answer would be to crop the images a little – there aren’t potholes in the dashboard or the sky! I don’t think anyone tried it, so let’s give it a go now and see where we get. One StackOverflow page later, I had code to crop and warp an image:

Before and after warping. Now the road is the focus, and we’re not wasting effort on the periphery.

I ran my code to warp all the images and store them in a new folder. Then I basically re-ran Jan’s starting notebook using the warped images (scaled to 200×200), trained for 5 epochs with a learning rate of 0.1, made predictions and…. 0.367 – straight to the top of the leader-board. The image warping and training took 1.5 hours on my poor little laptop CPU, which sort of limits how much iterating I’m willing to do. Fortunately, Google Colab gives a free GPU, cutting that time to a few minutes.


My time in the sun

Thanks to Google’s compute, it didn’t take long to have an even better model. I leave it to you dear readers to figure out what tweaks you’ll need to hop into that top spot.

My key takeaway from this is how easy it’s become to do this sort of thing. The other day I found code from 2014 where I was trying to spot things in an image with a kludged-together neural network. The difference between that and today’s exercise, using a network trained on millions of images and adapting it with ease thanks to a cool library and a great starting point… it just blows my mind how much progress has been made.

Why are you still reading this? Go enter the competition already! 🙂

Trying Automated ML

Some students had asked me for my opinion on automated tools for machine learning. The thought occurred that I hadn’t done much with them recently, and it was about time I gave the much-hyped time savers a go – after all, aren’t they going to make data scientists like me redundant?

In today’s post, I’ll be trying out Google’s AutoML tool by throwing various datasets at it and seeing how well it does. To make things interesting, the datasets I’ll be using will be from Zindi competitions, letting us see where AutoML would rank on the player leader-board. I should note that these experiments are a learning exercise, and actually using AutoML to win contests is almost certainly against the rules. But with that caveat out the way, let’s get started!

How it works

AutoML (and other similar tools) aims to automate one step of the ML pipeline – that of model selection and tuning. You give it a dataset to work on, specify column types, choose an output column and specify how long you’d like it to train for (you pay per hour). Then sit back and wait. Behind the scenes, AutoML tries many different models and slowly optimizes network architecture, parameters, weights… essentially everything one could possibly tweak to improve performance gets tweaked. At the end of it, you get a (very complicated) model that you can then deploy with their services or use to make batch predictions.

The first step with AutoML tables – Importing the data.

The resultant models are fairly complex (mine were ~1GB each fully trained) and are not something you can simply download and use locally – you must deploy them via Google (for an extra fee). This, coupled with the cost of training models, makes it fairly expensive to experiment with if you use up your trial credits – so use them wisely.

Fortunately, there are other ways to achieve broadly the same result. For example, AutoKeras. Read more about that here.

Experiment 1: Farm Pin Crop Detection

This competition involves a classification problem, with the goal being to predict which crop is present in a given field. The training data is provided as field outlines and satellite images – not something that can effortlessly slot into AutoML tables. This meant that the first step was to sample the image bands for the different fields, and export the values to a CSV files for later analysis (as described in this post). This done, I uploaded the resultant training file to cloud storage, selected the table, chose my input and output columns and hit go.

AutoML ‘Evaluate’ tab showing model performance.

The scoring metric for this competition is log loss. My previous best (using the same training data to train a random forest model) scored around 0.64 (~20th on the leaderboard). So a score of <0.6 looked promising. I uploaded the test set, hit predict and then manually cleaned up the output to match the submission format for Zindi. Score? 0.546, putting me in 12th place. No feature engineering besides sampling some satellite images, no manual tweaking of model parameters…. not bad!

I was quite pleased with this result. I enjoy the feature engineering side of things, but the tedium of hyper-parameter tuning is less appealing to me. If this tool can magically let me skip that step, it’s a win in my book! I may re-visit this with some added features, information from more images and perhaps a trick or two to enlarge the training set.

Experiment 2: Traffic Jam

Spurred on by the first success, I turned to the Traffic Jam competition since I still had the dataset on my laptop. This was a regression problem, with the goal being to predict the number of tickets sold for a given trip into Nairobi. The training data was fairly sparse, with only ~2000 rows to work from. Still, I figured it was worth a shot and threw a few node hours worth of Google-managed ML magic at the problem.

An MAE of 3.4, hypothetically equivalent to ~3rd place!

The evaluation results had me excited – and MAE of 3.4 would have placed the model in third place had the competition remained open. I hastily uploaded the predictions to Zindi, to see the score of… 5.3 (160th place). Now, I might be missing some glaring error in the way I formatted predictions for upload, but I suspect that the issue is with AutoML. It’s not really designed for such small datasets. From the website: “Depending on how many features your dataset has, 1,000 rows might not be enough to train a high-performing model.” The impressive MAE shown in the results tab is for one particular test set, and it seems that for the Zindi test set we were simply not as lucky. Another potential factor: The random test set will have sampled from the same date range as the training data, whereas the Zindi test set was for a different time period. In cases like this, a non-shuffled test/train split can be a better indicator of true performance.

So, we’ve learnt something new! The magic tool isn’t magic, and just like any other method it needs good training data to make good predictions.

Experiment 3: Sendy

I couldn’t resist trying it out once more on the newly launched Sendy Competition. I merged the Riders info into the train and test sets, uploaded the data, gave it an hour of training time and set it going. The goal is to minimize RMSE when predicting travel time between two locations (for deliveries). I also did some modelling myself while I waited for the AutoML training to finish.

Scores (RMSE for predicted time in seconds)
My first attempt (Catboost on provided data): 734 (7th place when this post was written)
First place: 721
Google AutoML: 724 (4th place until I convince them to remove my latest entry)

Not too shabby! To me, one of the great uses of a tool like this is to give a ballpark for what a good model looks like. Without the Zindi leaderboard, I wouldn’t have a way to gauge my model performance. Is it good? Could it get better with the same data? Now I can compare to the AutoML, using it as a ‘probably close to best’ measure.

Where next?

These quick tests have convinced me that these automated tools can be a useful part of my workflow, but are not a complete replacement for manual experimentation, exploration, feature engineering and modelling. I intend to play around more with AutoML and other tools in the near future, so stay tuned for a continuation of this series.

Mapping Change in Cropland in Zimbabwe (Part 1)

I’m going to be working on a project that will ultimately require manually outlining tobacco fields in Zimbabwe. To help locate potential fields, it would be nice to have a model that can predict whether a giver area contains cropland. To train such a model required labeled fields – a chicken and egg scenario that should have me resigned to hours of manual work. But I much prefer not to do things if I can possibly get a computer to do it, and this post (along with one or more sequels) will document the ways I’ve made my life easier, and the lessons learnt in the process.

Trick #1: Standing on shoulders

I recently encountered a project that already did a lot of the hard work tagging fields all over Africa. Their results (featured in today’s Data Glimpse post) look great, but the training data used isn’t published. Now, I could just use their published map, but I’m also interested in change over time, while their map is based solely on 2015 data. What if we train a new model on the output of their model? This isn’t generally a great idea (since you’re compounding errors) but it might be good enough for our purposes.

Satellite image (left) and my predicted cropland (right, in red)

In Google Earth Engine (script available here), I created a composite image from Landsat 8 images taken in 2015, including NDVI, band values from a greenest-pixel composite and band values from late in the year (planting season for most crops). This is to be the input to out model. I then sampled 2500 points, recording the inputs (the bands of the composite image) and the desired output (the cropland probability made available by the team). This data was used to train a random forest model (framing the task as a classification problem) and the predictions compared to the predictions from the QED data. The result: 99% accuracy.

Confusion matrix and accuracy

What does this accuracy figure mean? How is it so high? It’s less astonishing when we look more deeply. This is a model, the same type as that used by the QED team, with roughly the same inputs. It isn’t surprising that it can quickly replicate the decision function so accurately. It’s highly unlikely that it’s this accurate when compared to the ground truth. But we can say the following: we now have a model that is very similar to that used by the QED team to predict cropland probability for the year 2015.

Now what? Looking at change over time

The model takes landsat 8 image data as it’s inputs. It was trained on 2015 data, but there is no reason why we can’t make predictions based on other years, and see where these predictions differ from the 2015 ones. Subtracting two years’ predictions gives a difference image, shown below for 2015 – 2018. Red indicated areas where cropland is predicted in 2018 and not 2015 (new cropland). Black and green are areas where the model predicts no change or less cropland in 2018.

Difference Image (2018). Potential new cropland shown in red.

I don’t want to trust this model too much, but if nothing else this shows some areas where there might be fields that have appeared in the last few years. I now have a much better idea where to look, and where to dig deeper with manual inspection of images from different years.

Conclusions and next steps

This sets the scene for my next steps: manually outlining fields, differentiating between different crop types, training an improved model, adding more inputs… Stay tuned for part 2.

New Database: Forest Change in Different Regions

Forest loss is a major problem facing many parts of the world right now. Trees are being cleared to make way for agriculture, or simply cut down for fuel and timber. Tracking this loss is an important goal, and much work has been done in this area.

One of the best datasets on the topic is the Hansen Global Forest Change [1] dataset, available for free on the Google Earth Engine platform. This dataset tracks forest loss since the year 2000, and has become a key tool in fighting deforestation.

Forest cover (green), loss (red) and gain(blue) – from the Hansen dataset[1]

There is only one issue that I have with this data: it is HUGE! Approximately 1.22 TB. For anyone unable to write the code needed to analyse the data in GEE, this size means that downloading the data or importing it into traditional mapping applications is not feasible. And often we don’t need all of this data, instead simply requiring a few key stats on an area of interest. Consider wanting a graph of forest loss in your country over the last 20 years: it’s a nice visual to help you make a point, but it’s not worth learning to code or downloading >1TB of data for.

This leads to today’s project. I wrote some code that takes in a file specifying the boundaries of different regions. It then aggregates the data from the Hansen dataset over each of the specified regions. For example, I used the Large Scale International Boundary Polygons (LSIB) [2] map of the world’s countries as an input, ending up with total forest loss, loss per year and forest cover for every country in a convenient 98 KB csv file. It also outputs a version of the input file as a shapefile, with added attributes containing the summarized forest change data. The former is all you need to plot change over time, see which regions have experienced the most loss or identify which country has lost the most forest in the last ten years. The latter is nice for creating colorful maps displaying this information – it’s only ~60MB, and loads quickly into the mapping software on my laptop.

Forest loss in different regions

The Earth Engine code is available here.The rest of this post will explain how to use the generated datasets (available here) for simple analyses.

Viewing the shapefile in QGIS

QGIS [3] is an open source GIS application. The vector file (available here) can be opened in QGIS with ‘Open Data Source Manager’ -> ‘Vector Layer’ -> browse to the .shp file and click ‘Add’. By default, it looks uniform. To see the information better, right click on the layer, open properties and change the style from ‘single symbol’ to ‘graduated’:

Setting the style of the vector layer in QGIS

With these settings applied, the differences between countries become apparent. Play with the colours and classes until it looks good. To query the exact value of the loss in a given country, use the ‘Identify Features’ tool (Ctrl-Shift-I) and click to see all the attributes. To create a beautiful PDF map, consult a tutorial such as this one for all the fancy output options.

Forest loss displayed in QGIS

Analyzing the data with Python + Pandas

The smaller csv file (available here) is good for cases where the country outlines are not required. It is possible to open the file in Excel or Google Sheets, but let’s stretch our Python muscles and make some simple plots. A notebook with the full code for this example is available in the GitHub repository.

The first step is loading the data: we import the necessary libraries then load the data into a pandas DataFrame with “df = pd.read_csv(‘countries_w_hansen.csv’)”. For our first plot, let’s look at the total loss (from the ‘loss’ column) for different world regions:

Plotting forest loss for different regions

The Hansen data encodes the years different areas experienced loss events. This data is captured in the ‘Group X’ columns. We can sum these columns to see the total loss each year, and note the worrying trend:

Forest loss per year

Of course, we have the country data, and can focus on a single country or region using df.loc:

Forest loss over time in Africa. The drop looks encouraging… until you consider the latest date this data was updated (2018 was still ongoing)

Where next?

This data is fairly depressing, but my hope is that an exploration of it doesn’t end with resignation. There are things we can do, ways we can help reduce this loss. Take a look at the data. Share the stats on your country, and push for change. Post those graphs on Facebook, call your representatives and demand action, find an organization working to fight this… If we’re serious about saving our planet, we’re all going to have to be involved.


[1] – Hansen, M. C., P. V. Potapov, R. Moore, M. Hancher, S. A. Turubanova, A. Tyukavina, D. Thau, S. V. Stehman, S. J. Goetz, T. R. Loveland, A. Kommareddy, A. Egorov, L. Chini, C. O. Justice, and J. R. G. Townshend. 2013. “High-Resolution Global Maps of 21st-Century Forest Cover Change.” Science 342 (15 November): 850–53. Data available on-line at:

[2] – LSIB: Large Scale International Boundary Polygons, Simplified
The United States Office of the Geographer provides
the Large Scale International Boundary (LSIB) dataset. The detailed
version (2013) is derived from two other datasets: a LSIB line
vector file and the World Vector Shorelines (WVS) from the National
Geospatial-Intelligence Agency (NGA).

[3] – QGIS. A Free and Open Source Geographic Information System.

[4] – GitHub repository containing data and code: