The HerdHover project is a collaboration with Dr. Blair Costelloe and others in the Department of Collective Behaviour. Blair’s initial hope was to recreate the overhead mounted camera views used for years to study fish schools in white tanks in the lab with hovering drones filming herds of animals in the unconfined wild. Most of my Ph.D. has been devoted to making this dream a reality as the technical lead of the project.


Studying collective behavior is hard because there are lots of individuals interacting and influencing each other at the same time. To truly understand the dynamics of a group one must know what all individuals are doing at all times. Traditional methods of studying animal behavior are intractable because one human cannot meaningfully watch 50 fish at the same time, for example, and say how every individual is influencing every other. Therefore, behavior must be well quantified, and, if it is to be widely usable, the process must be automated.

Tools for automatically quantifying the behavior of animal groups in the lab have existed for many years. With full control over the experimental environment it is possible to make easy conditions for tracking and describing all individuals over long periods of time. Animals, however, did not evolve in the lab and a one meter by one meter empty box only approximates nature in the coarsest sense.

To truly understand how animal groups have evolved to behave in both their social and surrounding environments one must study them in nature. That is the goal of the HerdHover project.

In the Field

While the tools we have developed are widely applicable, currently the project focuses on the study of two species of zebra in Laikipia County, Kenya. We use off the shelf DJI phantom drones to film various herds of ungulates while flying at a height of around 80 meters. We use a relay of multiple drones to extend the length of observations beyond the battery life of a single drone. Before the first drone comes home the next drones is already hovering above and filming the same scene. In this way we can film a group for arbitrarily long periods of time. Unlike with fixed cameras, it is easy for us to follow a herd as they moves over the landscape. With these simple field techniques we can be recording a group within five minutes of first seeing it.

In addition to filming the groups, we also use a senseFly eBee fixed wing drone to create very detailed (1 pixel ~ a few centimeters) 3D map of the actual landscape that these groups are observed in.

From Videos to Data

Understanding the Challenge

The videos we bring back from the field are far from usable data. As noted above, human brains struggle understanding even a ten-individual complex system in real time so just watching the videos over and over again is unhelpful. Features of the videos must be extracted and quantified so that computational tools can be used to aid analysis. A challenge, however, is that quantifying even simply the position of every individual in a group in a single forty-five minute video (30fps) of twenty individuals will have over 1.6 million location points. If one tried to process this video by hand at even the aggressive pace of one second per click it would take three tedious months of work. That time only grows as more features are tracked. To be tractable, this data must extracted automatically.

Even if tracking were easy, however, knowing where an animal is in a video is very different from knowing where it actually was on the ground. One complication is that both the drone and the animals are moving at the same time. Another complication is that the landscape the animals live on has a lot of local variation in elevation. So, even once we know where all individuals in a video are, if they all move to the left, did they really all move to the left? or did the drone just move to the right? Are the individuals that are farther apart on the left side of the frame really farther apart than the animals on the right side of the frame? or are they just on a hill and the ground on the left is closer to the drone camera than the ground on the right? From just looking at points moving on the screen one can’t say. This means after tracking what we care about in the videos, we also needed to find a way to map pixels in video space to latitude/longitude points in the real environment.

Object Detection and Tracking

We track not only the location of individuals but also nine body posture points on each individual. Tracking the individuals’ locations is a two step process. We run a fine tuned model of faster rcnn with a resnet101 backbone on down sampled 1080x2048 frames from the video. The model detects four classes: buffalo, gazelle, zebra, and lastly all animals that are too close together to detect independently. The locations of the very close individuals are then cropped out of the full 2160x4096 resolution frame and fed into a retinanet model that is trained only to distinguish locations of close individuals.

We use energy minimization based techniques to connect individual detections from one frame to the next to build trajectories of individuals throughout each observation. The result is a set of tracks segments that can be connected together with a GUI to deal with occlusions and other tracking errors.

Equipped with complete tracks for every individual, we crop out just around the location of each individual in full resolution to use for posture tracking. Our posture tracking technique was designed by Jake Graving and is described in this paper published in eLife.

Mapping from Video to Earth

Without making certain assumptions, it isn’t possible to directly map from the 2D coordinates of an image to the 3D coordinates of the world. Luckily we know some things about what’s happening in the video. Most importantly, we know that all the animals we see are standing on the ground. Slightly surprisingly, we also benefit from the movement of the drone as we fly over the landscape while following the zebra herds.

We extract spatially overlapping video frames from each observation as the drone moves and use them to create 3D maps of the underlying landscape. The map is geo-referenced with information from the drone’s on-board GPS sensor. This not only gives us a model of the ground on which the animals must be standing, but also, for each extracted frame, a camera matrix that projects into this world. We assume all animals exist on the intersection between the ground described in the map and the ray cast from the animals location in the camera frame through the camera matrix. Since it is computationally crazy to build this map with every frame in the observation we use local features in the landscape to understand how the drone moves between these extracted “ground truth” frames. We get further precision by manually linking the map generated from the still video frames with the map generated from the dedicated mapping drone that we fly after the observation. The result is average precision well under a meter.

Putting this all together we get not only position but also posture information for all individuals in a group in their natural habitat at 30hz frequency with high spatial precision.


We are moving into analysis now. To begin, we are working to understand how individuals place themselves in the environment and in the herd. Are you, for instance, more likely to face in the same direction as your near neighbors or in different direction so as a group you see more of the environment?