Language Models for Protein Sequence Classification

A 3D model of a protein kinase

We recently hosted a Zindi hackathon in partnership with Instadeep that challenged participants to predict the functional class of protein kinases (enzymes with some very specific functions in the cell) based on nothing but their amino acid sequences. This kind of sequence classification task has lots of potential applications – there is a lot of un-labelled data lying around on every computational biologist’s computer, and a tool that could guess a given protein’s function would be mighty handy.

Just one problem – it’s not exactly a simple task! There are 20-something amino acids which we represent as letters. Given a sequence like ‘AGASGSUFOFBEASASSSSSASBBBDGDBA’ (frantically monkey-types for emphasis) we need to find a way to a) encode this as something a model can make sense of and b) do the making-sense-of-ing! Fortunately, there’s another field where we need to go from a string of letters to something meaningful: Natural Language Processing. Since I’d just been watching the NLP lesson in the latest amazing fastai course I felt obliged to try out the techniques Jeremy was talking about on this sequence classification task.

The Basic Approach

Tokenized input (left) and class (right)

Treating this as a language task and drawing inspiration from ULMFiT[1], this was my basic approach:

  • I tokenized the sequences using ‘subword tokenization’ which captures not just individual amino acids as tokens but common groupings as well (eg ‘EELR’ is encoded as a single token). I think this basic approach was suggested by the SentencePiece paper[4] and it’s now part of fastai[5].
  • I then created a ‘pretext task’ of sequence completion to train a ‘language model’ (based on the AWD-LSTM architecture[2]). The model learns to predict the next token in a sequence with ~32% accuracy – the hope is that in doing so it also learns useful embeddings and some sort of latent understanding of how these sequences are structured.
  • We keep most of this network as the ‘encoder’ but modify the final layers for the actual task: sequence classification. Thanks to the pre-training, the model can very quickly learn the new task. I can get to 98% accuracy in a couple of minutes by training on only a small subset of the data.
  • Training the model for the sequence classification task takes a while on the full competition dataset, but it eventually reaches 99.8% accuracy with a log_loss on the test set (as used in the competition) of 0.08, which is equivalent to 3rd place.
  • Doing the normal tricks of ensembling, training a second model on reversed sequences etc quite easily bumps this up to glory territory, but that’s the boring bit.

It was fun to see how well this worked. You can find a more detailed write-up of the initial experiments on that competition dataset here. Spurred by these early results, I figured it was worth looking into this a little more deeply. What have others been doing on this task? Is this approach any good compared to the SOTA? Has anyone tried this particular flow on this kind of problem?

Getting Formal

It should come as no surprise that the idea of treating sequence classification like a language modelling task has already occurred to some people. For example, USDMProt[7] turns out to have very nearly the same approach as that outlines above (self-five!). Their github is a great resource.

There are other approaches as well – for example, ProtCNN[6] and DEEPPred[8] propose their own deep learning architectures to solve these kinds of tasks. And there are some older approaches such as BLAST and it’s derivatives[9] that have long been standards in this field which still do decently (although they seem to be getting out-performed by these newer techniques).

So, we’re not the first to try this. However, I couldn’t find any papers using anything like the ‘subword’ tokenization. They either use individual amino acids as tokens, or in rare cases some choice of n-grams (for example, triplets of amino acids). The advantage of subword tokenization over these is that it can scale between the complexity of single-acid encodings and massive n-gram approaches by simply adjusting the vocabulary size.

Your Homework

I did some initial tests – this definitely smells promising, but there is a lot of work to do for this to be useful to anyone, and I don’t currently have the time or compute to give it a proper go. If you’re looking for a fun NLP challenge with the potential to turn into some interesting research, this could be the job for you! Here’s my suggestions:

  • Pick one or more benchmarks. Classification of the PFam dataset is a nice one to start with. The ProtCNN paper[6] (quick link) has done a bunch of the ‘standard’ algorithms and shared their split as a kaggle dataset, so you can quickly compare to those results.
  • Get some data for language model training. The SWISSProt dataset is a nice one, and for early tests even just the PFam dataset is enough to try things out.
  • Train some language models. Do single-acid tokenization as a baseline and then try subword tokenization with a few different vocab sizes to compare.
  • See which models do best on the downstream classification task. Lots of experimenting to be done on sequence length, training regime and so on.
  • For bonus points, throw a transformer model or two at this kind of problem. I bet they’d be great, especially if pre-trained on a nice big dataset.
  • If (as I suspect) one of these does very well, document your findings, try everything again in case it was luck and publish it as a blog or, if you’re a masochist, a paper.
  • … profit?

I really hope someone reading this has the motivation to give this a go. If nothing else it’s a great learning project for language modelling and diving into a new domain. Please let me know if you’re interested – I’d love to chat, share ideas and send you the things I have tried. Good luck 🙂


[1] – Howard, J. and Ruder, S., 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.

[2] – Merity, S., Keskar, N.S. and Socher, R., 2017. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182.

[3] – Smith, L.N., 2017, March. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 464-472). IEEE.

[4] – Kudo, T. and Richardson, J., 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.

[5] – Howard, J. and Gugger, S., 2020. Fastai: A layered API for deep learning. Information, 11(2), p.108.

[6] – Bileschi, M.L., Belanger, D., Bryant, D.H., Sanderson, T., Carter, B., Sculley, D., DePristo, M.A. and Colwell, L.J., 2019. Using deep learning to annotate the protein universe. bioRxiv, p.626507. (ProtCNN)

[7] – Strodthoff, N., Wagner, P., Wenzel, M. and Samek, W., 2020. UDSMProt: universal deep sequence models for protein classification. Bioinformatics36(8), pp.2401-2409. (USDMProt)

[8] – Rifaioglu, A.S., Doğan, T., Martin, M.J., Cetin-Atalay, R. and Atalay, V., 2019. DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Scientific reports9(1), pp.1-16. (DEEPPred)

[9] – Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research25(17), pp.3389-3402.

Personal Metrics

This is just a quick post following on from some recent conversations in this area. tldr: Tracking some data about yourself is a great exercise, and I highly recommend it. In this post I’ll share a few of the tools I use, and dig around in my own data to see if there are any interesting insights….

Time Tracker: Toggl

The first source of data is my time tracker: toggl. It’s simple to use, and has a web app as well as a good android app. As a consultant, this is useful for billing etc, but it has also just become a general habit to log what I’m working on. It’s good motivation not to context-switch, and it’s a great way to keep track of what I’m up to. A good day of work can sometimes mean 4 hours on the clock, since I tend not to log small tasks or admin, but it’s still good enough that I’ll bill clients based on the hours logged. Toggle let you do some reporting within the app, but you can also export the data to CSV for later analysis. Here’s my last two years, total seconds per month:

Time logged per month (as of August 12)

As you can see, I’ve been busier than normal the past few months – one of the reasons this blog hasn’t had any new posts for a while!

Daily mood and activities: daylio

Daylio is a smartphone app that asks ‘How was your day?’ every day, and optionally let’s you log activities for the day. I’ve made it a habit, although tracking stopped for a few months at the start of the pandemic :/ One thing I like about this (and the previous thing I used, is that it forces you to evaluate how you’re feeling. Was today great, or merely good? Why was it ‘Meh’? And by quantifying something less concrete than simply hours worked, it let’s me see what I can do to optimize for generally better days.

Time worked on days marked as Average, Good or Great

Mondays are my lowest day, followed by Wednesdays. Being outdoors bumps my rating from ~4 (good) to nearly 4.5 (5 being ‘great’). As you can see in the image above, lots of work tends to mean not-so-great days. Around 3 hours per day logged (4-6 hours work) is where I start properly having fun, and if I can fit in activities like birding or something creative then it’s even closer to optimum. I’m in a pretty good place now despite the busyness – the average score (~4.3) is much higher than when I was still in uni trying to balance work and assignments (3.3). It’s nice to see this – on tougher days it’s amazing to look back and see how many good or great ones there are, and how lovely life is overall.

Moar data: uLogMe

I recently found a project called uLogMe (by Karpathy oif all people), and after reading his post about it I decided to give it a go. If you’re keen to try it, look for a fork on HitHub as the original project is deprecated. I only use the logging scripts, which keep track of active window title and number of keystrokes in each 9s window. This is really fun data, as you can identify different activities, find patterns, see trends in when you’re most active… As one example, look at a fairly typical day from last month:

Keystroke intensity over time

You can see me start a little late, since it’s winter. After an initial burst of work I went on a long walk looking for insects (there was a bioblitz on) before hacking away during my 10am meeting. There are spikes of activity and periods of very little (meetings) or no (breaks) activity. 6-8pm is my class time, so I’m tapping away in demos as I teach, especially in the second half of the lesson.

Check out Karpathy’s post to see what else it’s possible to do with this data.

Putting it all together

I can’t wait to get a fitness tracker to add sleep tracking, exercise and heart rate. But even without those, I have some really great data to be playing with. I can see relationships between external factors (travel, activities, work) and my mood, explore how much time goes into different projects, graph the number of characters being typed in different applications (spoiler: I use Jupyter a LOT) and generally put some hard numbers behind my intuition around how I’m spending my time and how that’s affecting me.

A small subset of the data now waiting to be analysed

I hope that this post marks a return to this blog for me (hours are trending downwards now that some courses I teach are wrapping up) and that it inspires you to find some personal data to track! If you still aren’t convinced, here’s a TED talk that might push you over the edge. Happy hacking 🙂

Swoggle Part 2 – Building a Policy Network with PyTorch, dealing with Cheaty Agents and ‘Beating’ the Game

In part 1, we laid the groundwork for our Reinforcement Learning experiments by creating a simple game (Swoggle) that we’d be trying to teach out AI to play. We also created some simple Agents that followed hard-coded rules for play, to give our AI some opponents. In this post, we’ll get to the hard part – using RL to learn to play this game.

The Task

Reinforcement Learning (Artist’s Depiction)

We want to create some sort of Agent capable of looking at the state of the game and deciding on the best move. It should be able to learn the rules and how to win by playing many games. Concretely, our agent should take in an array encoding the dice roll, the positions of the players and bases etc, and it should output one of 192 possible moves (64 squares, with two special kinds of move to give 64*3 possible actions). This agent shouldn’t just be a passive actor – it must also be able to learn from past games.

Policy Networks

In RL, a ‘policy’ is a map from game state to action. So when we talk about ‘Policy Learners’, ‘Policy Gradients’ or ‘Policy Networks’, we’re referring to something that is able to learn a good policy over time.

The network we’ll be training

So how would we ‘learn’ a policy? If we had a vast archive of past games, we could treat this as a supervised learning task – feed in the game state, chosen action and eventual reward for each action in the game history to a neural network or other learning algorithm and hope that it learns what ‘good’ actions look like. Sadly, we don’t have such an archive! So, we take the following approach:

  • Start a game (an ‘episode’)
  • Feed the game state through our policy network, which initially will give random output probabilities on each possible action
  • Pick an action, favoring those for which the network output is high
  • Keep making actions and feeding the resultant game state through the network to pick the next one, until the game ends.
  • Calculate the reward. If we won, +100. If we lost, -20. Maybe an extra +0.1 for each valid move made, and some negative reward for each time we tried to break the rules.
  • Update the network, so that it (hopefully) will better predict which moves will result in positive rewards.
  • Start another game and repeat, for as long as you want.

Here’s a notebook where I implement this. The code borrows a little from this implementation (with associated blog post that explains it well). Some things I changed:

  • The initial example (like most resources you’ll find if you look around) chooses a problem with a single action – up or down, for example. I modified the network to take in 585 inputs (the Swoggle game state representation) and give out 192 outputs for the 62*3 possible actions an agent could take. I also added the final sigmoid layer since I’ll be interpreting the outputs as probabilities.
  • Many implementations either take random actions (totally random) or look at the argmax of the network output. This isn’t great in our case – random actions are quite often invalid moves, but the top output of the network might also be invalid. Instead, we sample an action from the probability distribution represented by the network output. This is like the approach Andrej Karpathy takes in his classic ‘Pong from Pixels’ post (which I highly recommend).
  • This game is dice-based (which adds randomness) and not all actions are possible at all times, so I needed to add code to handle cases where the proposed move is invalid. In those cases, we add a small negative reward and try a different action.
  • The implementation I started from used a parameter epsilon to shift from exploration (making random moves) to optimal play (picking the top network output). I removed this – by sampling from the prob. distribution, we keep our agent on it’s toes, and it always has a chance of acting randomly/unpredictably. This should make it more fun to play against, while still keeping it’s ability to play well most of the time.

This whole approach takes a little bit of time to internalize, and I’m not best placed to explain it well. Check out the aforementioned ‘Pong from Pixels’ post and google for Policy Gradients to learn more.

Success? Or Cheaty Agents?

OpenAI’s glitch-finding players (source:

Early on, I seemed to have hit upon an excellent strategy. Within a few games, my Agent was winning nearly 50% of games against the basic game AI (for a four player game, anything above 25% is great!). Digging a little deeper, I found my mistake. If the agent proposed a move that was invalid, it stayed where it was while the other agents moved around. This let it ‘camp’ on it’s base, or wait for a good dice roll before swoggling another base. I was able to get a similar win-rate with the following algorithm:

  1. Pick a random move
  2. If it’s valid, make the move. If not, stay put (not always a valid action but I gave the agent control of the board!)

That’s it – that’s the ‘CheatyAgent’ algorithm 🙂 Fortunately, I’m not the first to have flaws in my game engine exploited by RL agents – check out the clip from OpenAI above!

Another bug: See where I wrote sr.dice() instead of dice_roll? This let the network re-roll if it proposed an invalid move, which could lead to artificially high performance.

After a few more sneaky attempts by the AI to get around my rules, I finally got a setup that forced the AI to play by the rules, make valid moves and generally behave like a good and proper Swoggler should.

Winning for real

Learning to win!!!

With the bugs ironed out, I could start tweaking rewards and training the network! It took a few goes, but I was able to find a setup that let the agent learn to play in a remarkably short time. After a few thousand games, we end up with a network that can win against three BasicAgents about 40-45% of the time! I used the trained network to pick moves in 4000 games, and it won 1856 of them, confirming it’s superiority to the BasicAgents, who hung their heads in shame.

So much more to try

I’ve still got plenty to play around with. The network still tries to propose lots of invalid moves. Tweaking the rewards can change this (note the orange curve below that tracks ratio of valid:invalid moves) but at the cost of diverting the network from the true goal: winning games!

Learning to make valid moves, but at the cost of winning.

That said, I’m happy enough with the current state of things to share this blog. Give it a go yourself! I’ll probably keep playing with this, but unless I find something super interesting, there probably won’t be a part 3 in this series. Thanks for coming along on my RL journey 🙂

Behind the scenes of a Zindi Contest

User comments

Ever wondered what goes into launching a data science competition? If so, this post is for you. I spent the last few days working on the Fowl Escapades: Southern African Bird Call Audio Identification Challenge on Zindi, and thought it would be fun to take you behind the scenes a little to show how it all came together.

Step 1: Inspiration

Many competitions spring from an existing problem in need of a solution. For example, you may want a way to predict when your delivery will arrive based on weather, traffic conditions and the route your driver will take. In cases like this, an organization will reach out to Zindi with this problem statement, and move to stage 2 to see if it’s a viable competition idea. But this isn’t the only way competitions are born!

Sometimes, we find a cool dataset that naturally lends itself to answering an interesting problem. Sometimes we start with an interesting problem, and go looking for data that could help find answers. And occasionally, we start with nothing but a passing question at the end of a meeting: ‘does anyone have any other competition ideas?’. This was the case here.

I had been wanting to try my hand at something involving audio data. Since I happen to be an avid birder, I thought automatic birdsong identification would be an interesting topic. For this to work, we’d need bird calls – lot’s of them. Fortunately, after a bit of searching I found the star of this competition: Hundreds of thousands of calls from all over the world! A competition idea was born.

Step 2: Show me the data

To run a competition, you need some data (unless you’re going to ask the participants to find it for themselves!). This must:

  • Be shareable. Anything confidential needs to be masked or removed, and you either need to own the data or have permission to use it. For the birdsong challenge, we used data that had CC licences but we still made sure to get permission from xeno-canto and check that we’re following all the licence terms (such as attribution and non-modification).
  • Be readable. This means no proprietary formats, variable definitions, sensible column names, and ideally a guide for reading in the data.
  • Be manageable. Some datasets are HUGE! It’s possible to organize contests around big datasets, but it’s worth thinking about how you expect participants to interact with the data. Remember – not everyone has fast internet or free storage.
  • Be useful. This isn’t always easy to judge, which is why doing data exploration and building a baseline model early on is important. But ideally, the data has some predictive power for the thing you’re trying to model!
Visualizing birdsongs

By the time a dataset is released as part of a competition, it’s usually been through several stages of preparation. Let’s use the birdsong example and look at a few of there steps.

  • Collection: For an organization, this would be an ongoing process. In our example case, this meant scraping the website for files that met our criteria (Southern African birds) and then downloading tens of thousands of mp3 files.
  • Cleaning: A catch-all term for getting the data into a more usable form. This could be removing unnecessary data, getting rid of corrupted files, combining data from different sources…
  • Splitting and Masking: We picked the top 40 species with the most example calls, and then split the files for each species into train and test sets, with 33% of the data kept for the test set. Since the file names often showed the bird name, we used ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) to generate random IDs. However you approach things, you’ll need to make sure that the answers aren’t deducible from the way you organize things (no sorting by bird species for the test set!)
  • Checking (and re-checking, and re-checking): Making sure everything is in order before launch is vital – nothing is worse than trying to fix a problem with the data after people have started working on your competition! In the checking process I discovered that some mp3s had failed to download properly, and others were actually .wav files with .mp3 as the name. Luckily, I noticed this in time and could code up a fix before we went live.

Many of these steps are the same when approaching a data science project for your own work. It’s still important to clean and check the data before launching into the modelling process, and masking is useful if you’ll need to share results or experiments without necessarily sharing all your secret info.

Step 3: Getting ready for launch

Aside from getting the data ready, there are all sorts of extra little steps required to arrive at something you’re happy to share with the world. An incomplete list of TODOs for our latest launch:

  • Decide on a scoring metric. This will be informed by the type of problem you’re giving to participants. In this case, we were torn between accuracy and log loss, and ended up going with the latter. For other cases (eg imbalanced data), there are a host of metrics. Here’s a guide:
  • Put together an introduction and data description. What problem are we solving? What does the solution need to do? What does the training data look like? This will likely involve making some visualizations, doing a bit of research, finding some cool images to go with your topic…
  • Social media. This isn’t part of my job, but I gather that there is all sorts of planning for how to let people know about the cool new thing we’re putting out into the world 🙂
  • Tutorials. Not essential, but I feel that giving participants a way to get started lowers the barriers to entry and helps to get more novices into the field. Which is why, as is becoming my habit, I put together a starter notebook to share as soon as the contest launches.
A confusion matrix – one way to quickly see how well a classification algorithm is working. (from the starter notebook)
  • Baseline/benchmark. This is something I like to do as early as possible in the process. I’ll grab the data, do the minimal cleaning required, run it through some of my favorite models and see how things go. This is nice in that it gives us an idea of what a ‘good’ score is, and whether the challenge is even doable. When a client is involved, this is especially useful for convincing them that a competition is a good idea – if I can get something that’s almost good enough, imagine what hundreds of people working for prize money will come up with! If there’s interest in my approach for a quick baseline, let me know and I may do a post about it.
  • Names, cover images, did you check the data???, looking at cool birds, teaser posts on twitter, frantic scrambles to upload files on bad internet, overlaying a sonogram on one of my bird photos… All sorts of fun 🙂
Fine-tuning the benchmark model

I could add lots more. I’ve worked on quite a few contests with the Zindi team, but usually I’m just part of the data cleaning and modelling steps. I’ve had such a ball moving this one from start to finish alongside the rest of the team, and I really appreciate all the hard work they do to keep us DS peeps entertained!

Try it yourself!

I hope this has been interesting. As I said, this whole process has been a blast. So if you’re sitting on some data, or know of a cool dataset, why not reach out and host a competition? You might even convince them to let you name it something almost as fun as ‘Fowl Escapades’. 🙂

Zindi UberCT Part 3: Uber Movement

Uber Movement has launched in Cape Town

Today, Uber Movement launched in Cape Town. This is good news, since it means more data we can use in the ongoing Zindi competition I’ve been writing about! In this post we’ll look at how to get the data from Uber, and then we’ll add it to the model from Part 2 and see if it has allowed us to make better predictions. Unlike the previous posts, I won’t be sharing a full notebook to accompany this post – you’ll have to do the work yourself. That said, if anyone is having difficulties with anything mentioned here, feel free to reach out and I’ll try to help. So, let’s get going!

Getting the data

My rough travel ‘zones’

Zindi provided some aggregated data from Uber movement at the start of the competition. This allows you to get the average travel time for a route, but not to see the daily travel times (it’s broken down by quarter). But on the Uber Movement site, you can specify a start and end location and get up to three months of daily average travel times. This is what we’ll be using.

Using sophisticated mapping software (see above), I planned 7 routes that would cover most of the road segments. For each route, I chose a start and end zone in the Uber Movement interface (see table above) and then I downloaded the data. To do it manually would have taken ages, and I’m lazy, so I automated the process using pyautogui, but you could also just resign yourself to a few hours of clicking away and get everything you need. More routes here would have meant better data, but this seemed enough to give me a rough traffic proxy.

Some of the travel times data

I manually tagged each segment with the equivalent Uber Movement trip I would be using to quantify traffic in that area, using QGIS. This let me link this ‘zone id’ from the segments shapefile to my main training data, and subsequently merge in the Uber Movement travel times based on zone id and datetime.

Does it work?

Score (y axis) vs threshold for predicting a 1. In my case, a threshold of ~0.35 was good.

In the previous post, the F1 score on my test set was about 0.082. This time around, without anything changed except the addition of the Uber data, the score rises above 0.09. Zindi score: 0.0897. This is better than an equivalent model did without the uber movement data, but it’s still not quite at the top – for that a little more tweaking will be needed 🙂

I’m sorry that this post is shorter than the others – it was written entirely in the time I spent waiting for data to load or models to fit, and is more of a show-and-tell than a tutorial. That said, I hope that I have achieved my main goal: showing that the Uber Movement data is a VERY useful input for this challenge, and giving a hint or two about where to start playing with it.

(PS: This model STILL ignores all of the SANRAL data. Steal these ideas and add that in, and you’re in for a treat. If you do this, please let me know? Good luck!)

Mapping Change in Cropland in Zimbabwe (Part 2)

In Part 1, I showed a quick way to get a model that predicts cropland extent, using someone else’s model as a starting point. This was a fun exercise, but in today’s post I’d like to show a more conventional approach to achieve the same goal, and then use that to track change in land cover over time within a region.

Training Data

This time, we’ll generate training data manually. For convenience, I’m changing the goalposts slightly: in this post, we’ll be making a simple model to distinguish between open land (fields, grassland, bare earth) and woodland. In the area of interest, this pretty much covers all the bases. Collecting data is a simple but laborious process – examples of each class are outlines in Google Earth Engine and saved as two separate FeatureCollections:

Some open areas (red) and woodland (green) manually outlined for training.


We’ve covered modelling in GEE before, so I won’t go into details here. Sentinel 2 imagery is used, and I pretty much followed the docs to create a classifier and then apply it to the input image over the whole area. The model is fairly accurate, and a quick visual double-check confirms that it’s doing a good job of making the open areas:

Open area masked (left) vs input image (right)

Change over time

By choosing fields and wooded areas for training that have been present for decades, we can use the same training data to build models on imagery from different years. To track change in open land area, we can make a prediction for each year and sum the area that is classified as ‘open land’ with the following code snippet:

Getting the total area classified as open land over an ROI (ward 8)

For my ROI, the total open land trends steadily upwards. For dates earlier than 2015, I used Lnadsat 7 imagery as the input. From 2015 to 2018, Sentinel 2 Imagery was used as well as Landsat for comparison. In some years (2010 and 2018/19) there were enough cloudy images that I combined two years for the estimate. Some of the Landsat 7 imagery isn’t the best quality, and there are some issues with this approach that mean I wouldn’t trust the figures to be incredibly accurate. BUT, we’ve accomplished our goal: the figures show the change in land cover over time:


I hope this inspires you to try something like this for yourself, in an area that you’re interested in. I don’t think I’ll come back to this topic, although I’ll keep working on this project to turn it into something reliable (adding more training data, properly assessing accuracy, incorporating ground-truth data to verify etc etc). This post also marks the end of the Pioneer project mentioned here. My posting schedule will likely slow down, and you can expect some more diverse posts in the near future. Stay tuned!

Data Glimpse: Cropland and Settlement maps from QED.AI

The point of this Data Glimpse post is to feature a wonderful yet badly publicized data source: Using crowd-sourced data, they built really accurate maps of fields and settlements for the whole of Africa. They also make related spatial layers available (Enhanced Vegetation Index for different years, soil metrics etc). Their focus is “data systems and AI for health and agriculture”. The soil maps draw heavily on the AfSIS project, which makes the data from thousands of soil samples available (

The interface showing cropland probability

The QED maps interface makes it really easy to download all the available maps at 1km resolution. I’m not going to do any further analysis in this post – these maps are useful without modification, and it was really interesting for me to see the distribution of agriculture in Africa. The cropland probability map will be making an appearance in the next post.

Data Glimpse: Nighttime Lights

This ‘Data Glimpse’ post will look at the Global Radiance-Calibrated Nighttime Lights dataset [1], available through Google Earth Engine. However, the method shown here can be used with any Raster data source. To avoid repetition, I’ll refer back to this post any time I aggregate raster data over a shapefile.

The Data

The dataset is a collection of images from different years showing nighttime lights all over the globe. This information can be used to see where people are [2] and estimate measures such as economic activity in an area [3]. They have been used in some great research estimating the Global Human Footprint and highlighting the last wild places on earth [4].

Nighttime lights displayed in GEE

Each image contains two bands: ‘avg_vis’, which is the measure of illumination, and ‘cf_cvg’ describing cloud cover (used as a data quality metric).

Aggregating the Data by Region

Instead of a large raster image, we might want to aggregate the data by region. For example, we might want to look at how the amount of light visible at night in National Parks has changed over time. To get the data in the form that we want, we first need to define the regions that we’re interested in. This script that I made to illustrate the idea uses a landuse map of Zimbabwe as an example, but one could just as easily use Country outlines or draw a region with the ‘Draw a shape’ tool in GEE.

With the input region(s) defined, the key step is to use the reduceRegions function to add properties to each feature (area) that summarize the underlying raster. For example, with an image of nighttime illumination in the year 2000 called ‘lights_2000’ and the landuses map, we can add the mean illumination in each area with var landuse_with_lights = lights_2000.reduceRegions(landuses, ee.Reducer.mean());. The result can be exported as a shapefile or CSV file (see the script for details) and displayed or analyses in whatever software you like.

Average nighttime illumination over Zimbabwe

Change over Time

One of the nice things about this dataset is that it contains values for several different years. I took a look at the data from 2000 and 2010, with the goal of seeing if protected areas (forest lands, national parks etc) had seen an increase in nighttime lights (an indicator that people are moving into these areas). Most protected areas in Zimbabwe had almost no nighttime lights recorded, and those that did show (on average) a drop in the amount of nighttime lights (2010 values are ~20% lower than those for 2000). In the few places where lights had increased, the increase seems to be due to safari camps rather than encroachment from neighboring districts. The data can’t tell the whole story, and poor coverage plus the relative dimness of firelight might mean that some encroachment is missed, but it was encouraging to see that the wilderness areas are still largely dark and empty – just the way they should be.


[1] –
[2] – Elvidge, C.D., Imhoff, M.L., Baugh, K.E., Hobson, V.R., Nelson, I., Safran, J., Dietz, J.B. and Tuttle, B.T., 2001. Night-time lights of the world: 1994–1995. ISPRS Journal of Photogrammetry and Remote Sensing, 56(2), pp.81-99.
[3] – Wu, J., Wang, Z., Li, W. and Peng, J., 2013. Exploring factors affecting the relationship between light consumption and GDP based on DMSP/OLS nighttime satellite imagery. Remote Sensing of Environment, 134, pp.111-119.
[4] – Sanderson, E.W., Jaiteh, M., Levy, M.A., Redford, K.H., Wannebo, A.V. and Woolmer, G., 2002. The human footprint and the last of the wild: the human footprint is a global map of human influence on the land surface, which suggests that human beings are stewards of nature, whether we like it or not. BioScience, 52(10), pp.891-904.

Data Glimpse: South Africa’s Hydrological Data

South Africa’s Department of Water Affairs (DWA) makes all kinds of data publicly available through their data portal: The download interface is a little clunky, but simple once you get the hang of it. This short post will take a look at some typical data, and list some of the ways this could be used in the future.

The DWA website, after selecting ‘Verified data’.

Most of the data comes from monitoring stations, each of which is assigned a unique ID. The easiest way to find stations in your area of interest is via the ‘Station Catalogue’ link visible in the above screenshot. Stations are typically a depth measure in a dam or river.

With a station chosen, the next step is to specify the date range and type of data you’d like to download. The available dates and information are listed in the Station Catalog. I picked a station in the Pongola river system, and saved the data file generated by the website as ‘daily_flows.txt’. This is a text file with variables separated by whitespace, and can be loaded into a pandas dataframe for analysis as follows:

Loading the data.

With the data thus loaded, it’s fairly easy to pot the flow over a given year, or calculate monthly averages. Here’s a plot showing the daily flow rate out of Jozini dam in 2018. Note that the graph has many flat areas – this is because this is a managed flow, with the amount of water released from the dam regulated by local authorities (somewhat badly, in this case [2]).

A plot of the daily flow rate.

A notebook showing more plots and an example of re-sampling for yearly averages is available here.

So what can you do with this data? Here are some ideas (let me know if you’d like to see any as future posts):
– Get dam level data for dams all over South Africa and and animate the levels over time, to illustrate the recent drought and the (alarming) longer trend.
– Use the data to learn hydrodynamic modelling (see [1])
– Combine with rain data to see how watershed capture has changed with agriculture and land use change
– Look for the change in river flows after new projects (dams, diversions and so on)

I hope you’ve enjoyed this brief glimpse at some fun data. Please let me know if you do something with this, or if you have some data that you’d like featured.

[1] – Birkhead, A.L., Brown, C.A., Joubert, A.R., Singh, A. and Tlou, T., 2018. The Pongola Floodplain, South Africa–Part 1: Two-dimensional hydrodynamic modelling in support of an environmental flows assessment. Water SA, 44(4), pp.731-745.

[2] – Lanyi, Shira. 2018. “Paradise Lost: The Struggle to Preserve the Pongola River and its Inhabitants.” Open Rivers: Rethinking Water, Place & Community, no. 11.

Data Glimpse: Visualizing Economic Activity with the G-Econ Project data

This is the first ‘data glimpse’ – a short exploration of an existing dataset, with code and examples showing some of the ways the data can be used. For today’s glimpse, I’ll be playing with the ‘G-Econ’ dataset [1], as recommended by <jonasmendes> on Pioneer. This dataset looks at economic activity for different locations, as opposed to breaking it down by country. There is data available from 1990, 2000 and 2005, broken down by ‘grid cell’ (a square one degree wide and one degree high).

Economic Activity by Grid Cell – G-Econ data for 1990

Loading the data

The data is shared as a Microsoft Excel worksheet [2]. There are 27,446 rows, and it’s a little overwhelming visually. Spreadsheets aren’t my forte, so my first step was to load the data into a Pandas DataFrame in a Jupyter notebook (available here for anyone who wants to follow along). With the data ready, I set out on the most obvious task: showing the data as a map. A few minutes of StackOverflow later, we have a visual and a GeoTiff file that can be opened in mapping software such as QGIS:

Asking questions

Because the data is aggregated by location (as opposed to population), it can answer some interesting questions. How does economic output vary with temperature or rainfall? How ‘centralized’ is industry in different regions? What’s the deal with all that $$$ hugging the coastlines? Let’s dig in.

Environmental Factors

First up, the effect of temperature:

Not much gets done where it’s cold, it seems

What about rainfall?

Economic Activity (2000) vs max precipitation (mm rainfall)

And finally, distance to the ocean:

Coasts are the place to be?

It appears that the most productive places are those where people like to be: accessible, not too hot, not too dry but not constantly drenched… A Goldilocks zone for human activity. The data already contains these environmental variables – I highly encourage you to try your own plots, or to read up the more thorough analyses in [1].

Comparing Countries

There are many ways we could compare countries. A bar plot of average economic activity per grid cell, perhaps, or comparison between the most productive single grid cell in each country. I was interested to see which countries had the most spread. The GIF below shows this dramatically: the top few cells in Russia are responsible for a huge chunk of the economic activity, while India has much more of a spread:

Scaled fraction of the total economic activity in four countries.

For the code, see the GitHub repository associated with this post.


I hope you’ve enjoyed this quick, informal look at a fun dataset. I’m planning on doing more of these ‘Data Glimpse’ posts, since they take less time than a full write-up. The trade-off is that quality is lower, since I’m not going to invest time into perfectly labelled axes, long explanations or extra figures. Let me know what you think about this plan!

[1] – Nordhaus, W., Azam, Q., Corderi, D., Hood, K., Victor, N.M., Mohammed, M., Miltner, A. and Weiss, J., 2006. The G-Econ database on gridded output: Methods and data. Yale University, New Haven6.References:
[2] – (accessed June 2019)