Turtle Recall: A Contrastive Learning Approach

NB: A scoring glitch caused this approach to look very good on the leaderboard, but local validation and a fix from Zindi later confirmed that it isn’t as magical as it first seemed. Still interesting from an educational point of view but if you’re looking to compete I’d suggest investigating alternate strategies.


Zindi has a competition running to identify individual turtles based on images from different views. This presents an interesting challenge for a few reasons:
1) There are relatively few images per turtle (10-50 each) and these have been taken from multiple angles. Given how similar they are, simply treating this as a normal multi-class classification challenge is hard.
2) There is an emphasis on generalization – it would be great if the organizations involved could add additional turtles without expensive re-training of models.

One potential approach that should help address these problems is to learn useful representations – some way to encode an image in a meaningful way such that the representations of images of one individual are all ‘similar’ by some measure while at the same time being dissimilar to the representations of images from other individuals. If we can pull this off, then given a new image we can encode it and compare the resulting representation with those of all known turtle images. This gives a ranked list of the most likely matches as well as a similarity score that could tell us if we’re looking at a completely new turtle.

To keep this post light on code, I have more info and a working example in this colab notebook. I’m also working on a video and will update this post once that’s done. And a modified version of this might be posted on Zindi learn, which again will be linked here once it’s up.

Contrastive Learning

The goal of contrastive learning is to learn these useful representations in an unsupervised or loosely-supervised fashion (aka self-supervised learning). A typical approach is to take some images, create augmented versions of those images and then embed both the originals and the augmented versions with some encoder network. The objective is to maximise the similarity between an image and its augmented version while minimising the similarity between that image and all the rest of the images in the batch. The trick here is that augmentation is used to create two ‘versions’ of an image. In our turtle case, we also have pictures of the same individual from different angles which can be used in place of (or in addition to) image augmentations to get multiple versions depicting one individual.

Top two rows: 16 turtles. Bottom 2 rows: augmented versions of different views of those same 16 turtles.

In my implementation, we generate a batch by picking batch_size turtles and then creating two sets of images with different pictures of those turtles. A resnet50 backbone acts like the encoder and is used to create embeddings of all of these images. We use a contrastive loss function to calculate a loss and update the network weights.

You can check the notebook or the video for more details on the implementation here. Once all the bugs were ironed out, the training loop runs and the loss shrinks nicely over time. But the question arises: how do we tell if the representations being learnt are actually useful?

Key reference for going deeper: SimCLR – A Simple Framework for Contrastive Learning of Visual Representations

Representational Similarity Matrices

Remember, our end goal is to be able to tell which individual turtle is in a new image. If things are working well, we’ll feed the new image through our encoder model to get a representation and then compare that to the encoded representations of the known turtles. All pictures of a given individual should be ‘similar’ in this space, but should not be similar to images of other individuals. A neat way to visualize this is through something called a Representational Similarity matrix. We take, say, 16 images of 5 different turtles. We embed them all and compute all possible pair-wise similarities and then plot them as a heatmap:

A Representation Similarity Matrix (RSM) comparing embeddings of 16 images from each of 5 turtles.

The images are obviously identical to themselves – hence the thin bright diagonal. But here you can also see that images of a given turtle seem to be similar to others of that same turtle – for instance, the bottom right 16×16 square shows that all images of the red turtle are quite similar to each other. This also shows us which turtles might be regularly confused (pink and yellow for eg) and which are relatively easy to disambiguate (pink and green).

RSMs are a useful tool for quickly getting a feel for the kind of representations being learnt, and I think more people should use them to add visual feedback when working on this kind of model. Looking at RSMs for images in the training set vs a validation set, or for different views, can shed more light on how everything is working. Of course, they don’t tell the whole story and we should still do some other evaluations on a validation set.

So does it work?

I trained a model on a few hundred batches with an embedding size of 100. For the test set, I took the turtle_ids of the most similar images in the training set to each test image and used those as the submission. If there were no images with a similarity above 0.8 I added ‘new_turtle’ as the first guess. This scores ~0.4 in local testing and ~0.36 on the public leaderboard. This is pretty good considering we ignored the image_position label, the label balance and various flaws in the data! However, a classification-based baseline with FastAI scores ~0.6 and the top entries are shockingly close to perfect with mapk scores >0.98 so we have a way to go before this is competitive.

One benefit of our approach: adding a new turtle to the database doesn’t require re-training. Instead, we simply encode any images of that individual we have and add the embeddings to the list of possible matches we’ll use when trying to ID new images.

Where Next?

There are many ways to improve on this:

  • Experiment with parameters such as embedding size, batch size, augmentation types, training approach, regularization etc.
  • Incorporate the image_position labels, either doing separate models for different angles, filtering potential matches based on the test labels or finding some way to feed the label into the model as an extra type of conditioning.
  • Experiment with fine-tuning the model on the classification task. Since it has now (theoretically) learnt good representations, we could likely fine-tune it with a classification loss and get even better competition performance (at the cost of lower genaralizability)
  • Explore automated data cleaning. Some images are out-of-domain, showing random background as opposed to turtle faces . Some images are just bad quality, or just don’t work with center-cropping.
  • Try different models as the backbone
  • Investigate label balance

…And many more. I hope this post gets you excited about the competition! Feel free to copy and adapt the notebook (with attribution please) and let me know if you manage to make any improvements. See you on the leaderboard ๐Ÿ™‚


Playing with Tweet Sentiment Analysis

The average sentiment of the most recent 200 tweets from each country’s capital city.

A mentee of mine has been working on web scraping for NLP projects and her most recent target was Twitter. She’s working on something cool (stay tuned) but in the meantime, I thought I’d share a few of my own experiments. You can follow along and see full code examples in this colab notebook.

Scraping Tweets with Twint

Scraping tweets from a specific user

I used twint – a scraper written in Python which gives a lot of functionality while avoiding the need for API keys, authentication etc. You can target specific users, locations, topics and dates (see their wiki for details) which makes this a powerful tool for finding and downloading tweets. For my tests today, I chose a few well-known Twitter personalities from my feed. I also scraped tweets from capital cities around the world, using the ‘Lang’ configuration option to focus on English tweets to make comparison easier (yes, I know, this is not ideal).

Sentiment Score with roBERTa

NLTK’s SIA can give a quick and easy sentiment score for a piece of text, but many tweets use more obscure language and styles that aren’t well-captured by the default lexicon or the approach as a whole. Luckily, tweet sentiment analysis is a popular task and there are pre-trained deep learning models available that do a pretty good job out-of-the-box. I used a roBERTa model fine-tuned on the TweetEval task. The model card on huggingface had all the code needed to classify a piece of text, making it very simple to get started. I’m so glad this trend of making models accessible with key info is catching on!

The model outputs three scores corresponding to the labels ‘negative’, ‘neutral’ and ‘positive’. We can combine the positive and negative scores to get a combined sentiment score running from -1 (very negative) to +1 (very positive). From this, we can get stats like ‘average sentiment’, but I wanted a better way to see at a glance what a user’s tweets look like. Hexbin plots to the rescue ๐Ÿ™‚ These show the distribution of tweets in both sentiment and tweet length. You can see that Musk tends to tweet shorter, more neutral tweets while Gates favours mid-length positive ones and Lomborg tends heavily towards grumpy full-length rants ๐Ÿ˜‚

Scoring Countries

I was curious: what would we see if we grabbed some tweets from the capital city of each country and found the average sentiment score? Where do the positive tweeters live? Ideally, we’d account for different languages, grab a wide selection of tweets covering a longer timeline and do all sorts of other careful analyses. But since this entire project is the result of one night’s insomnia I just grabbed the latest 200 English tweets from each country’s capital (using the countryinfo library to get the coordinates) and went with those. Plotting the average sentiment as a choropleth map using Plotly gives us the title image of this post. Don’t read too much into this – it’s just a demo to show what might be possible with a bit more work.


Data Science gives us the tools to ask questions about the world around us. And thanks to the kind folks who put so much effort into the libraries and tools we can access for free, it doesn’t have to be hard! I hope this post inspires you to ask your own questions. Feel free to modify and share the code, and PLEASE tag me on Twitter @johnowhitaker with your own visualizations and extensions. Happy scraping ๐Ÿ™‚

EDIT: I made a Huggingface space where you can try this for yourself: https://huggingface.co/spaces/johnowhitaker/twitter_viz

Data Glimpse: Predicted Historical Air Quality for African Cities

Air quality has been in the news a lot recently. Smoke from fires has had thousands of Californians searching for info around the health hazards of particulate matter pollution. Lockdown-induced changes have shown how a reduction in transport use can bring a breath of fresh air to cities. And a respiratory virus sweeping the globe has brought forward discussions around lung health and pollution, and the health burden associated with exposure to unhealthy levels of pollutants. There are thousands of air quality sensors around the world, but if you view a map of these sensors, it’s painfully obvious that some areas are underserved, with a marked lack of data:

Air Quality from sensors around the world. Source: https://waqi.info/

The ‘gap in the map’ was the motivation for a weekend hackathon hosted through Zindi, which challenged participants to build a model capable of predicting air quality (specifically PM25 concentration) based on available satellite and weather data.

The hackathon was a success, and was enough of a proof-of-concept that we decided to put a little more work into taking the results and turning them into something useful. Myself and Yasin Ayami spent a bit of time re-creating the initial data preparation phase (pulling the data from the Sentinel 5P satellite data collections in Google Earth Engine, creating a larger training set of known air quality readings etc) and then we trained a model inspired by the winning solutions that is able to predict historical air quality with a mean absolute error of less than 20.

Dashboard for exploring air quality across Africa (http://www.datasciencecastnet.com/airq/)

A full report along with notebooks and explanation can be found in this GitHub repository. But the good news is that you don’t need to re-create the whole process if you’d just like a look at the model outputs – those predictions are available in the repository as well. For example, to get the predictions for major cities across Africa you can download and explore this CSV file. And if you don’t want to download anything, I’ve also made a quick dashboard to show the data, both as a time-series for whatever city you want to view and as a map showing the average for all the locations.

I’ve tagged this post as a ‘Data Glimpse’ since the details are already written up elsewhere ๐Ÿ™‚ I hope it’s of interest, and as always let me know if you have any questions around this. J.

Self-Supervised Learning with Image็ฝ‘


Until fairly recently, deep learning models needed a LOT of data to get decent performance. Then came an innovation called transfer learning, which we’ve covered in some previous posts. We train a network once on a huge dataset (such as ImageNet, or the entire text of Wikipedia), and it learns all kinds of useful features. We can then retrain or ‘fine-tune’ this pretrained model on a new task (say, elephant vs zebra), and get incredible accuracy with fairly small training sets. But what do we do when there isn’t a pretrained model available?

Pretext tasks (left) vs downstream task (right). I think I need to develop this style of illustration – how else will readers know that this blog is just a random dude writing on weekends? ๐Ÿ™‚

Enter Self-Supervised Learning (SSL). The idea here is that in some domains, there may not be vast amounts of labeled data, but there may be an abundance of unlabeled data. Can we take advantage of this by using it somehow to train a model that, as with transfer learning, can then be re-trained for a new task on a small dataset? It turns out the answer is yes – and it’s shaking things up in a big way. This fastai blog post gives a nice breakdown of SSL, and shows some examples of ‘pretext tasks’ – tasks we can use to train a network on unlabeled data. In this post, we’ll try it for ourselves!

Follow along in the companion notebook.


Read the literature on computer vision, and you’ll see that ImageNet has become THE way to show off your new algorithm. Which is great, but coming in at 1.3 million images, it’s a little tricky for the average person to play with. To get around this, some folks are turning to smaller subsets of ImageNet for early experimentation – if something works well in small scale tests, *then* we can try it in the big leagues. Leading this trend have been Jeremy Howard and the fastai team, who often use ImageNette (10 easy classes from ImageNet), ImageWoof (Some dog breeds from ImageNet) and most recently Image็ฝ‘ (‘ImageWang’, ็ฝ‘ being ‘net’ in Chinese).

Image็ฝ‘ contains some images from both ImageNette and ImageWoof, but with a twist: only 10% of the images are labeled to use for training. The remainder are in a folder, unsup, specifically for use in unsupervised learning. We’ll be using this dataset to try our hand at self-supervised learning, using the unlabeled images to train our network on a pretext task before trying classification.

Defining Our Pretext Task

A pretext task should be one that forces the network to learn underlying patterns in the data. This is a new enough field that new ideas are being tried all the time, and I believe that a key skill in the future will be coming up with pretext tasks in different domains. For images, there are some options explained well in this fastai blog. Options include:

  • Colorization of greyscale images
  • Classifying corrupted images
  • Image In-painting (filling in ‘cutouts’ in the image)
  • Solving jigsaws

For fun, I came up with a variant of the image in-painting task that combines it with colorization. Several sections of the input image are blurred and turned greyscale. The network tries to replace these regions with sensible values, with the goal being to have the output match the original image as closely as possible. One reason I like the idea of this as a pretext task is that we humans get something similar. Each time we move our eyes, things that were in our blurry, greyscale peripheral vision are brought into sharp focus in our central vision – another input for the part of our brain that’s been pretending they were full HD color the whole time ๐Ÿ™‚

Here are some examples of the grey-blurred images and the desired outputs:

Input/Output pairs for our pretext task, using the RandomGreyBlur transform

We train our network on this task for 15 epochs, and then save its parameters for later use in the downstream task. See the notebook for implementation details.

Downstream Task: Image Classification

Now comes the fun part: seeing if our pretext task is of any use! We’ll follow the structure of the Image็ฝ‘ leaderboard here, looking at models for different image sizes trained with 5, 20, 80 or 200 epochs. The theory here is that we’d hope that out pretext task has given us a decent network, so we should get some results after 5 epochs, and keep getting better and better results with more training.

Results from early testing

The notebook goes through the process, training models on the labeled data provided with Image็ฝ‘ and scoring them on the validation set. This step can be quite tedious, but the 5-epoch models are enough to show that we’ve made an improvement on the baseline, which is pretty exciting. For training runs 20 epochs and greater, we still beat a baseline with no pre-training, but fall behind the current leaderboard entry based on simple inpainting. There is much tweaking to be done, and the runs take ~1 minute per epoch, so I’ll update this when I have more results.

Where Next?

Image็ฝ‘ is fairly new, and the leaderboard still needs filling in. Now is your chance for fame! Play with different pretext tasks (for eg, try just greyscale instead of blurred greyscale – it’s a single line of code to change), or tweak some of the parameters in the notebook and see if you can get a better score. And someone please do 256px?

Beyond this toy example, remember that unlabeled data can be a useful asset, especially if labeled data is sparse. If you’re ever facing a domain where a pretrained model is unavailable, self-supervised learning might come to your rescue.

Meta ‘Data Glimpse’ – Google Dataset Search

Christmas came in January this year, with Google’s release of ‘Dataset Search‘. They’ve indexed millions of cool datasets and made it easy to search through them. This post isn’t about any specific dataset, but rather I just wanted to share this epic new resource with you.

Google’s Dataset Search

I saw the news as it came out, which meant I had the pleasure of sharing it with my colleagues – all of whom got nerd sniped to some degree, likely resulting a much loss of revenue and a ton of fun had by all ๐Ÿ™‚ A few minutes after clicking the link I was clustering dolphin vocalizations and smiling to myself. If you’re ever looking for an experiment to write up, have a trawl through the datasets on there and pick one that hasn’t got much ML baggage attached – you’ll have a nice novel project to brag about.

Clustering Dolphin noises

Say what you like about Google, there are people there doing so much to push research forward. Tools like Colab, Google Scholar, and now Dataset Search make it easy to do some pretty amazing research from anywhere. So go on – dive in ๐Ÿ™‚

Swoggle Part 1- RL Environments and Literate Programming with NBDev

I’m going to be exploring the world of Reinforcement Learning. But there will be no actual RL in this post – that’s for part two. This post will do two things: describe the game we’ll be training our AI on, and show how I developed it using a tool called NBDev which is making me so happy at the moment. Let’s start with NBDev.

What is NBDev?

Like many, I started my programming journey editing scripts in Notepad. Then I discovered the joy of IDEs with syntax highlighting, and life got better. I tried many editors over the years, benefiting from better debugging, code completion, stylish themes… But essentially, they all offer the same workflow: write code in an editor, run it and see what happens, make some changes, repeat. Then came Jupyter notebooks. Inline figures and explanations. Interactivity! Suddenly you don’t need to re-run everything just to try something new. You can work in stages, seeing the output of each stage before coding the next step. For some tasks, this is a major improvement. I found myself using them more and more, especially as I drifted into Data Science.

But what about when you want to deploy code? Until recently, my approach was to experiment in Jupyter, and then copy and paste code into a separate file or files which would become my library or application. This caused some friction – which is where NBDev comes in.

~~~~~ “Create delightful python projects using Jupyter Notebooks” – NBDev website ~~~~~

With NBDev, everything happens in your notebooks. By adding special comments like #export to the start of a cell, you tell NBDev how to treat the code. This means you can write a function that will be exported, write some examples to illustrate how it works, plot the results and surround it with nice explanations in markdown. The exported code gets paces in a neat, well-ordered .py file that becomes your final product. The Notebook(s) becomes documentation, and the extra examples you added to show functionality work as tests (although you can also add more formal unit testing). An extra line of code uploads your library for others to install with pip. And if you’re following their guide, you get a documentation site and continuous integration that updates whenever you push your changes to GitHub.

The upshot of all this is that you can effortlessly create good, clean code and documentation without having to switch between notebooks, editors and separate documentation. And the process you followed, the journey that lead to the final design choices, is no longer hidden. You can show how things developed, and include experiments that justify a particular choice. This is ‘literate programming’, and it feels like a major shift in the way I think about software development. I could wax lyrical about this for ages, but you should just go and read about it in the launch post here.

What on Earth is Swoggle?

Christmas, 2019. Our wedding has brought a higher-than-normal influx of relatives to Cape Town, and when this extended family gets together, there are some things that are inevitable. One of these, it turns out, is the invention of new games to keep the cousins entertained. And thus, Swoggle was born ๐Ÿ™‚

A Swoggle game in progress – 2 players are left.

The game is played on an 8×8 board. There are usually 4 players, each with a base in one of the corners. Players can move (a dice determines how far), “spoggle” other players (capturing them and placing them in “swoggle spa” – none of this violent termnology) or ‘swoggle’ a base (gently retiring the bases owner from the game – no killing here). To make things interesting, there are four ‘drones’ that can be used as shields or to take an occupied base. Moving with a drone halves the distance you can travel, to make up for the advantages. A player with a drone can’t be spoggled by another player unless they too have a drone, or they ‘powerjump’ from their base (a half-distance move) onto the droned player. Maybe I’ll make a video one day and explain the rules properly ๐Ÿ™‚

So, that’s the game. Each round is fairly quick, so we usually play multiple rounds, awarding points for different achievements. Spoggling (capturing) a player: 1 point. Swoggling (taking out a base): 3 points. Last one standing: 5 points. The dice rolls add lots of randomness, but there is still plenty of room for tactics, sibling rivalry and comedic mistakes.

Game Representation

If we’re going to teach a computer to play this, we need a way to represent the game state, check if moves are valid, keep track of who’s in the swoggle spa and which bases are still standing, etc. I settled on something like this:

Game state representation

There is a Cell in each x, y location, with attributes for player, drone and base. These cells are grouped in a Board, which represents the game grid and tracks the spa. The Board class also contains some useful methods like is_valid_move() and ways to move a particular player around. At the highest level, I have a Swoggle class that wraps a board, handles setting up the initial layout, provides a few extra convenience functions and can be used to run a game manually or with some combination of agents (which we’ll cover in the next section). Since I’m working in NBDev, I have some docs with almost no effort, so check out https://johnowhitaker.github.io/swoggle/ for details on this implementation. Here’s what the documentation system turned my notebooks into:

Part of the generated documentation

The ability to write code and comments in a notebook, and have that turn into a swanky docs page, is borderline magical. Mine is a little messy since this is a quick hobby project. To see what this looks like in a real project, check out the docs for NBDev itself or Fastai v2.

Creating Agents

Since the end goal is to use this for reinforcement learning, it would be nice to have an easy way to add ‘Agents’ – code that defines how a player in the game will make a move in a given situation. It would also be useful to have a few non-RL agents to test things out and, later, to act as opponents for my fancier bots. I implemented two types of agent:

  • RandomAgent Simply picks a random but valid move by trial and error, and makes that move.
  • BasicAgent Adds a few simple heuristics. If it can take a base, it does so. If it can spoggle a player, it does so. If neither of these options are possible, it moves randomly.

You can see the agent code here. The notebook also defines a few other useful functions, such as win_rates() to pit different agents against each-other and see how they do. This is fun to play with – after a few experiments it’s obvious that the board layout and order of players matters a lot. A BasicAgent going last will win ~62% of games against three RandomAgents – not unexpected. But of the three RandomAgents, the one opposite the BasicAgent (and thus furthest from it) will win the majority of the remaining games.

Next Step: Reinforcement Learning!

This was a fun little holiday coding exercise. I’m definitely an NBDev convert – I feel so much more productive using this compared to any other development approach I’ve tried. Thank you Jeremy, Sylvain and co for this excellent tool!

Now, the main point of this wasn’t just to get the game working – it was to use it for something interesting. And that, I hope, is coming soon in Part 2. As I type this, a neural network is slowly but surely learning to follow the rules and figuring out how to beat those sneaky RandomAgents. Wish it luck, stay tuned, and, if you’re *really* bored, pip install swoggle and watch some BasicAgents battle it out ๐Ÿ™‚

Snapshot Serengeti – Working with Large Image Datasets

Driven Data launched a competition around the Snapshot Serengeti database – something I’ve been intending to investigate for a while. Although the competition is called “Hakuna Ma-data” (which where I come from means something like “there is no data”), this is actually the largest dataset I’ve worked with to date, with ~5TB of high-res images. I suspect that that’s putting people off (there are only a few names on the leaderboard), so I’m writing this post to show how I did an entry, run through some tricks for dealing with big datasets, give you a notebook to get started quickly and try out a fun new tool I’ve found for monitoring long-running experiments using neptune.ml.Let’s dive in.

The Challenge

The goal of the competition is to create a model that can correctly label the animal(s) in an image sequence from one of many camera traps scattered around the Serengeti plains, which are teeming with wildlife. You can read more about the data and the history of the project on their website. There can be more than one type of animal in an image, making this a multi-label classification problem.

Some not-so-clear images from the dataset

The drivendata competition is interesting in that you aren’t submitting predictions. Instead, you have to submit everything needed to perform inference in their hidden test environment. In other words, you have to submit a trained model and the code to make it go. This is a good way to practice model deployment.


The approach I took to modelling is very similar to the other fastai projects I’ve done recently. Get a pre-trained resnet50 model, tune the head, unfreeze, fine-tune, and optionally re-train with larger images right at the end. It’s a multi-label classification problem, so I followed the fastai planet labs example for labeling the data. You can see the details of the code in the notebook (coming in the next section) but I’m not going to go over it all again here. The modelling in this case is less interesting than the extra things needed to work at this scale.

Starter Notebook

I’m a big fan of making data science and ML more accessible. For anyone intimidated by the scale of this contest, and not too keen on following the path I took in the rest of this post, I’ve created a Google Colab Notebook to get you started. It shows how to get some of the data, label it, create and train a model, score your model like they do in the competition and create a submission. This should help you get started, and will give a good score without modification. The notebook also has some obvious improvements waiting to be made – using more data, training the model further…..

Training a quick model in the starter notebook

The code in the notebook is essentially what I used for my first submission, which is currently the top out of the… 2 total submissions on the leaderboard. As much as I like looking good, I’ll be much happier if this helps a bunch of people jump ahead of that score! Please let me know if you use this, so that I don’t feel that this wasn’t useful to anyone?

Moar Data – Colab won’t cut it

OK, so there definitely isn’t 5TB of storage on Google Colab, and even though we can get a decent score with a fraction of the data, what if we want to go further? My approach was as follows:

  • Create a Google Cloud Compute instance with all the fastai libraries etc installed, by following this tutorial. The resultant machine has 50GB memory, a P100 GPU and 200GB disk space by default. It comes with most of what’s required for deep learning work, and has the added bonus of having jupyter + all the fastai course notebooks ready to get things going quickly. I made sure not to make the instance preemptible – we want to have long-running tasks going, so having it shut down unexpectedly would be sad.
  • Add an extra disk to the compute instance. This tutorial gave me the main steps. It was quite surreal typing in 6000 GB for the size! I mounted the dist at /ss_ims – that will be my base folder going forward.
  • Download a season of data, and then begin experimenting while more downloads. No point having that pricey GPU sitting idle!
  • Train the full model overnight, tracking progress.
  • Submit!
Mounting a scarily large disk!

I won’t go into the cloud setup here, but in the next section let’s look at how you can track the status of a long-running experiment.

Neptune ML – Tracking progress

I’d set the experiments running on my cloud machine, but due to lack of electricity and occasional loss of connection I couldn’t simply leave my laptop running and connected to the VM to show how the model training was progressing. With so many images, each epoch of training took ages, and I had a couple of models crash early in the process. This was frustrating – I would try to leave it going overnight but if the model failed in the evening it meant that I had wasted some of my few remaining cloud credits on a machine sitting idle. Luckily, I had recently seen how to monitor progress remotely, meaning I could check my phone while I was out and see if the model was working and how good it was getting.

Tracking loss and metrics over time with neptune.ml

The process is pretty simple, and well documented here. You sign up for an account, get an API key and add a callback to your model. This will then let you log in to neptune.ml from any device, and track your loss, any metrics you’ve added and the output of the code you’re running. I could give more reasons why this is useful, but honestly the main motivation is that it’s cool! I had great fun surreptitiously checking my loss from my phone every half hour while I was out and about.

Tracking model training with neptune

Where next?

I’m out of cloud credits, and as an ‘independent scientist’ my funding situation doesn’t really justify spending more money on cloud compute to try a better entry. If you’d like to sponsor some more work, I may have another go with a properly trained model. I did manage to experiment on using more than the first image in a sequence, and using Jeremy Howard’s trick of doing some final fine-tuning on larger images – would be interesting to see how much these improve the score in this contest.

I hope this post encourages more of you to try this contest out! As the starter notebook shows, you can get close to the top (beating the benchmark) with a tiny fraction of the data and some simple tricks. Give it a try and report how you do in the comments!

Deep Learning + Remote Sensing – Using NNs to turn imagery into meaningful features

Every now and again, the World Bank conducts something called a Living Standards Measurement Study (LSMS) survey in different countries, with the purpose being to learn about people, their incomes and expenses, how they’re doing economically and so on. These surveys provide very useful info to various stakeholders, but they’re expensive to conduct. What if we could estimate some of the parameters they measure from satellite imagery instead? That was the goal of some researchers at Stanford back in 2016, who came up with a way to do just that and wrote it up into this wonderful paper in Science. In this blog post, we’ll explore their approach, replicate the paper (using some more modern tools) and try a few experiments of our own.

Predicting Poverty: Where do you start?

Nighttime lights

How would you use remote sensing to estimate economic activity for a given location? One popular method is to look at how much light is being emitted there at night – as my 3 regular readers may remember, there is a great nighttime lights dataset produced by NOAA that was featured in a data glimpse a while back. It turns out that the amount of light sent out does correlate with metrics such as assets and consumption, and this data has been used in the past to model things like economic activity (see another data glimpse post for more that). One problem with this approach: the low end of the scale gets tricky – nighttime lights don’t vary much below a certain level of expenditure.

Looking at daytime imagery, we see many things that might help tell us about the wealth in a place: type of roofing material on the houses, the number of roads, how built-up an area is…. But there’s a problem here too: these features are quite complicated, and training data is sparse. We could try to train a deep learning model to take in imagery and spit out income level, but the LSMS surveys typically only cover a few hundred locations – not a very large dataset, in other words.

Jean et al’s sneaky trick

The key insight in the paper is that we can train a CNN to predict nighttime lights (for which we have plentiful data) from satellite imagery, and in the process it will learn features that are important for predicting lights – and that these features will likely also be good for predicting our target variable as well! This multi-step transfer learning approach did very well, and is a technique that’s definitely worth keeping in mind when you’re facing a problem without much data.

But wait, you say. How is this better than just using nightlights? From the article: “How might a model partially trained on an imperfect proxy for economic well-beingโ€”in this case, the nightlights used in the second training step aboveโ€”improve upon the direct use of this proxy as an estimator of well-being? Although nightlights display little variation at lower expenditure levels (Fig. 1, C to F), the survey data indicate that other features visible in daytime satellite imagery, such as roofing material and distance to urban areas, vary roughly linearly with expenditure (fig. S2) and thus better capture variation among poorer clusters. Because both nightlights and these features show variation at higher income levels, training on nightlights can help the CNN learn to extract features like these that more capably capture variation across the entire consumption distribution.” (Jean et al, 2016). So the model learns expenditure-dependent features that are useful even at the low end, overcoming the issue faced by approaches that use nightlights alone. Too clever!

Can we replicate it?

The authors of the paper shared their code publicly but… it’s a little hard to follow, and is scattered across multiple R and Python files. Luckily, someone has already done some of the hard work for us, and has shared a pytorch version in this GitHub repository. If you’d like to replicate the paper exactly, that’s a good place to start. I’ve gone a step further and consolidated everything into a single Google Colab notebook that borrows code from the above and builds on it. The rest of this post will explain the different sections of the notebook, and why I depart from the exact method used in the paper. Spoiler: we get a slightly better result with much fewer images downloaded.

Getting the data

The data comes from the Fourth Integrated Household Survey 2016-2017. We’ll focus on Malawi for this post. The notebook shows how to read in several of the CSV files downloaded from the website, and combine them into ‘clusters’ – see below. For each cluster location, we have a unique ID (HHID), a location (lat and lon), an urban/rural indicator, a weighting for statisticians, and the important variable: consumption (cons). This last one is the thing we’ll be trying to predict.

The relevant info from the survey data

One snag: the lat and lon columns are tricksy! They’ve been shifted to protect anonymity, so we’ll have to consider a 10km buffer around the given location and hope the true location is close enough that we get useful info.

Adding nighttime lights

Getting the nightlights value for a given location

To get the nightlight data, we’ll use the python library to run Google Earth Engine queries. You’ll need a GEE account, and the notebook shows how to authenticate and get the required data. We can get the nightlights for each cluster location (getting the mean over an 8km buffer around the lat/lon points) and add this number as a column. To give us a target to aim at, we’ll compare any future models to a simple model based on these nightlight values only.

Downloading static maps images

Getting imagery for a given location

The next step takes a while: we need to download images for the locations. BUT: we don’t just want one for each cluster location – instead, we want a selection from the surrounding area. Each of these will have it’s own nightlights value, so that we get a larger training set to build our model on. Later, we’ll extract features for each image in a cluster and combine them. Details are in the notebook. The code takes several hours to run, but at the end of it you’ll have thousands of images ready to use.

Tracking requests/sec on in my Google Cloud Console

You’ll notice that I only generate 20 locations around each cluster. The original paper uses 100. Reasons: 1) I’m impatient. 2) There is a rate limit of 25k images/day, and I didn’t want to wait (see #1), 3) The images are 400 x 400, but are then shrunk to train the model. I figured I could split the 400px image into 4 (or 9) smaller images that overlap slightly, and thus get more training data for free. This is suggested as a “TO TRY” in the notebook, but hint: it works. If you really wanted to get a better score, trying this or adding more imagery is an easy way to do so.

Training a model

I’ll be using fastai to simplify the model creation and training stages. before we can create a model, we need an appropriate databunch to hold the training data. An optional addition at this stage is to add image transforms to augment our training data – which I do with tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.) as suggested in the fastai satelite imagery example based on Planet labs. The notebook has the full code for creating the databunch:

Data ready for modelling

Next, we choose a pre-trained model and re-train it with our data. Remember, the hope is that the model will learn features that are related to night lights and, by extension, consumption. I’ve had decent results with resnet models, but in the shared notebook I stick with models.vgg11_bn to more closely match the original paper. You could do much more on this model training step, but we pick a learning rate, train for a few epochs and move on. Another place to improve!

Training the model to predict nightlights

Using the model as a feature extractor

This is a really cool trick. We’ll hook into one of the final layers of the network, with 512 outputs. We’ll save these outputs as each image is run through the network, and they’ll be used in later modelling stages. To save the features, you could remove the last few layers and run the data through, or you can use a trick I learnt from this TDS article and keep the network intact.

Cumulative explained variance of top PCA features

512 (or 4096, depending on the mode and which layer you pick) is a lot of features. So we use PCA to get 30 or so meaningful features from those 512 values. As you can see from the plot above, the top few components explain most of the variance in the data. These top 30 PCA components are the features we’ll use for the last step in the process: predicting consumption.

Putting it all together

For each image, we now have a set of 30 features that should be meaningful for predicting consumption. We group the images by cluster (aggregating their features). Now, for each cluster, we have the target variable (‘cons’), the nighttime lights (‘nl’) and 30 other potentially useful features. As we did right at the start, we’ll split the data into a test and a train set, train a model and then make predictions to see how well it does. Remember: our goal is to be better than a model that just uses nighttime lights. We’ll use the r^2 score when predicting log(y), as in the paper. The results:

  • Score using just nightlights (baseline): 0.33
  • Score with features extracted from imagery: 0.41

Using just the features derived from the imagery, we got a significant score increase. We’ve successfully used deep learning to squeeze some useful information out of satellite imagery, and in the process found a way to get better predictions of survey outcomes such as consumption. The paper got a score of 0.42 for Malawi using 100 images to our 20, so I’d call this a success.


There are quite a few ways you can improve the score. Some are left as exercises for the reader ๐Ÿ™‚ here are a few that I’ve tried:
1) Tweaking the model used in the final step: 0.44 (better than the paper)
2) Using sub-sampling to boost size of training dataset + using a random forest model: 0.51 (!)
3) Using a model trained for classification on binned NL values (as in paper) as opposed to training it on a regression task: score got worse
4) Cropping the downloaded images into 4 to get more training data for the model (no other changes): 0.44 up from 0.41 without this step. >0.5 aggregating features of 3 different subsets of images for each cluster
5) Using a resnet-50 model: 0.4 (no obvious change this time – score likely depends less on model architecture and more on how well it is trained)

Other potential improvements:
– Download more imagery
– Train the model used as a feature extractor better (I did very little experimentation or fine-tuning)
– Further explore the sub-sampling approach, and perhaps make multiple predictions on different sub-samples for each cluster in the test set, and combine the predictions.

Please let me know if any of these work well for you. I’m less interested in spending more time on this – see the next section.

Where next

I’m happy with these results, but don’t like a few aspects:

  • Using static maps from Google means we don’t know the date the imagery was acquired, and makes it hard to extend our predictions over a larger area without downloading a LOT of imagery (meaning you’d have to pay for the service or wait weeks)
  • Using RGB images and an imagenet model means we’re starting from a place where the features are not optimal for the task – hence the need for the intermediate nighttime lights training step. It would be nice to have some sort of model that can interpret satellite imagery well already and go straight to the results.
  • Downloading from Google Static Maps is a major bottleneck. I used only 20 images / cluster for this blog – to do 100 per cluster and for multiple countries would take weeks, and to extend predictions over Africa months. There is also patchy availability in some areas.

So, I’ve been experimenting with using Sentinel 2 imagery, which is freely available for download over large areas and comes with 13 bands over a wide spectrum of wavelengths. The resolution is lower, but the imagery still has lots of useful info. There are also large, labeled datasets like the EuroSAT database that have allowed people to pretrain models and achieve state of the art results for tasks like land cover classification. I’ve taken advantage of this by using a model pre-trained on this imagery for land cover classification tasks (using all 13 bands) and re-training it for use in the consumption prediction task we’ve just been looking at. I’ve been able to basically match the results we got above using only a single Sentinel 2 image for each cluster.

Using Sentinel imagery solves both my concerns – we can get imagery for an entire country, and make predictions for large areas, at different dates, without needing to rely on Google’s Static Maps API. More on this project in a future post…


As always, I’m happy to answer questions and explain things better! Please let me know if you’d like the generated features (to save having to run the whole modelling process), more information on my process or tips on taking this further. Happy hacking ๐Ÿ™‚

Packaging a classification model as a web app

My shiny new web app, available here

In my previous post I introduced fastai, and used it to identify images with potholes. Since then, I’ve applied the same basic approach to the Standard Bank Tech Impact Challenge: Animal classification with pretty decent results. A first, rough model was able to score 97% accuracy thanks to the magic of transfer learning, and by unfreezing the inner layers and re-training with a lower learning rate I was able to up the accuracy to over 99% for this binary classification problem. It still blows my mind how good these networks are at computer vision.

Zebra or Elephant?

This was exciting and fun. But I wanted to share the result, and my peer group aren’t all that familiar with log-loss scores. How could I get the point across and communicate what this means? Time to deploy this model as a web application ๐Ÿ™‚

Exporting the model for later use

Final training step, saving weights and exporting to a file in my Google Drive

I knew it was possible to save some of the model parameters with model.save(‘name’), but wasn’t sure how easy it would be to get a complete model definition. Turns out, enough people want this that you can simply call model.export(‘model_name’). So I set my model training again (I hadn’t saved last time) and started researching my next step while Google did my computing for me.

Packaging as an app

I expected this step to be rather laborious. I’d need to set up a basic app (planned to use Flask), get an environment with pytorch/fastai set up and deploy to a server or, just maybe, get it going on Heroku. But then I came across an exciting page in the fastai docs: ‘Deploying on Render‘. There are essentially 3 steps:
– Fork the example repository
– Edit the file to add a link to your exported model
– Sign up with Render and point it at your new GitHub repository.
Then hit deploy! You can read about the full process in the aforementioned tutorial. Make sure your fastai is a recent version, and that you export the model (not just saving weights).

The resultant app is available at zebra-vs-elephant.onrender.com. I used an earlier model with 97% accuracy (since I’m enjoying that top spot on the leaderboard ;)) but it’s still surprisingly accurate. It even get’s cartoons right!

Please try it out and let me know what you think. It makes a best guess – see what it says for non-animals, or flatter your friends by classifying them as pachyderms.


There seems to be a theme to my last few posts: “Things that sound hard are now easy!”. It’s an amazing world we live in. You can make something like this! It took 20 minutes, with me doing setup while the model trained! Comment here with links to your sandwich-or-not website, your am-I-awake app, your ‘ask-a-computer-if-this-dolphin-looks-happy’ business idea. Who knows, one of us might even make something useful ๐Ÿ™‚

Yes, that is apparently an elephant…

UPDATE: I’ve suspended the service for now, but can re-start it if you’d like to try it. Reach out if that’s the case ๐Ÿ™‚

Pothole Detection (aka Johno tries fastai)

This week saw folks from all over the AI space converge in Cape Town for the AI Expo. The conference was inspiring, and I had a great time chatting to all sorts of interesting people. There were so many different things happening (which I’m not going to cover here), but the one that led to this post was a hackathon run by Zindi for their most recent Knowledge competition: the MIIA Pothole Image Classification Challenge. This post will cover the basic approach used by many entrants (thanks to Jan Marais’ excellent starting notebook) and how I improved on it with a few tweaks. Let’s dive in.

The Challenge

The dataset consists of images taken from behind the dashboard of a car. Some images contain potholes, some don’t – the goal is to correctly discern between the two classes. Some example pictures:

Train and test data were collected on different days, and at first glance it looks like this will be a tough challenge! It looks like the camera is sometimes at different angles (maybe to get a better view of potholes) and the lighting changes from pic to pic.

The first solution

Jan won a previous iteration of this hackathon, and was kind enough to share a starting notebook (available here) with code to get up and running. You can view the notebook for the full code, but the steps are both simple and incredibly powerful:

  • Load the data into a ‘databunch’, containing both the labeled training data and the unlabeled test data. Using 15% of the training data as a validation set. The images are scaled to 224px squares and grouped into batches.
The images are automatically warped randomly each time (to make the model more robust). This can be configured, but the default is pretty good.
  • Create a model: learn = cnn_learner(data, resnet18, metrics=accuracy). This single line does a lot! It downloads a pre-trained network (resnet18) that has already been optimised for image classification. It reconfigures the output of that network to match the number of classes in our problem. It links the model to the data, freezes the weights of the internal layers, and gives us a model ready for re-training on our own classes.
  • Pick a learning rate, by calling learn.lr_find() followed by learn.recorder.plot() and picking one just before the graph bottoms out (OK, it’s more complicated than that but you can learn the arcane art of lr optimization elsewhere)
*sucks thumb* A learning rate of 0.05 looks fine to me Bob.
  • Fit the model with learn.fit_one_cycle(3, lr) (Change number of epochs from 3 to taste), make predictions, submit!

There is some extra glue code to format things correctly, find the data and so on. But this is in essence a full image classification workflow, in a deceptively easy package. Following the notebook results in a log-loss score of ~0.56, which was on par with the top few entries on the leaderboard at the start of the hackathon. In the starter notebook Jan gave some suggestions for ways to improve, and it looks like the winners tried a few of those. The best score of the day was Ivor (Congrats!!) with a log-loss of 0.46. Prizes were won, fun was had and we all learned how easy it can be to build an image classifier by standing on the shoulders of giants.

Making it better

As the day kicked off, I dropped a few hints about taking a look at the images themselves and seeing how one could get rid of unnecessary information. An obvious answer would be to crop the images a little – there aren’t potholes in the dashboard or the sky! I don’t think anyone tried it, so let’s give it a go now and see where we get. One StackOverflow page later, I had code to crop and warp an image:

Before and after warping. Now the road is the focus, and we’re not wasting effort on the periphery.

I ran my code to warp all the images and store them in a new folder. Then I basically re-ran Jan’s starting notebook using the warped images (scaled to 200×200), trained for 5 epochs with a learning rate of 0.1, made predictions and…. 0.367 – straight to the top of the leader-board. The image warping and training took 1.5 hours on my poor little laptop CPU, which sort of limits how much iterating I’m willing to do. Fortunately, Google Colab gives a free GPU, cutting that time to a few minutes.


My time in the sun

Thanks to Google’s compute, it didn’t take long to have an even better model. I leave it to you dear readers to figure out what tweaks you’ll need to hop into that top spot.

My key takeaway from this is how easy it’s become to do this sort of thing. The other day I found code from 2014 where I was trying to spot things in an image with a kludged-together neural network. The difference between that and today’s exercise, using a network trained on millions of images and adapting it with ease thanks to a cool library and a great starting point… it just blows my mind how much progress has been made.

Why are you still reading this? Go enter the competition already! ๐Ÿ™‚