There are a lot of insects in the world. Like a lot a lot. Recent estimates put the number of insect species at a whopping 5.5 million. Only about a million have names, leaving the other 4.5 million or so species undescribed. To put this into context, there are 8,053 described amphibian species and the projected number of undescribed species remains in the thousands. While I consider all life to be equally important (and I’m particularly fond of amphibians- I mean, just look at these), the gap in our understanding of insect diversity is especially concerning. They’re important for agriculture, flowering plants, decomposition, and a host of other functions. Without a firm understanding of existing insect diversity, scientists can’t accurately determine the causes of rapid declines in insect populations across the globe.
Needless to say, describing 4.5 million species of insect is an onerous task. That’s why, as part of my PhD dissertation, I’m devising a new metric of diversity that skirts around this problem. Although I would love to go into excruciating detail about what it is, I’ll leave that up to my dissertation chapter. Briefly, I use DNA sequences from insects collected by researchers from across the globe to summarize the genetic diversity of insect communities, rather than their species diversity. This makes data collection much more feasible across a broad scale. However, exactly what it tells us about insect communities is still uncertain.
To remedy this, I’m using machine learning. Machine learning, the part-statistics, part-algorithmic wizardry that helps Snapchat turn your face into an alien and Facebook recommend you custom underwear, can also be used to answer important scientific questions. In a biodiversity context, machine learning can help clarify what about the environment, whether that is climate, topography, human-mediated habitat loss, etc., best predicts a selected metric of diversity. In my project planning stages, I decided to use this approach to assess the relationships my new metric has with other global-scale processes. However, I only had a rudimentary knowledge of the capabilities of a machine learning approach.
That’s why I decided to take the Applied Machine Learning in R two-day workshop in Austin, Texas. The workshop’s goal was to “step through the process of building, visualizing, testing, and comparing models that are based on prediction.” This was exactly what I needed. Unfortunately, since it was a part of an industry conference for tech professionals, the rstudio::conf2019, it was prohibitively expensive. Luckily, I was able to secure a Provost’s Digital Innovation Training Grant, which covered the costly tuition. As a University of Texas at Austin alumni, I easily found a couch to crash on, which cut expenses significantly. The prime location and generous help from the GC Digital Initiatives paved my way to the workshop.
In between eating street tacos and taking dips at Barton Springs Pool, I learned the basics of performing machine learning analyses in R, a programming language centered around statistical analysis. Going in, I had a strong background in R, but I did not have a working knowledge of machine learning. The workshop’s exercises and strong leadership gave me more confidence in using the machine learning techniques I was already familiar with and introduced me to algorithms I had never heard of. It was one of these new algorithms– I point out it’s named MARS for the statistically and astronomically inclined– that I decided to use for my dissertation chapter. It strikes a balance between predictive power and explanatory ease that I am aiming for in my project.
After some months of pulling my hair out, working through the kinks of how to wrangle 4.7 million insect DNA sequences, I was able to use my new machine learning skills to perform preliminary analyses of my data. The results, which I presented as a poster at the Evolution conference in Providence, Rhode Island, point toward my metric being a useful summary of insect biodiversity. I received numerous bits of positive feedback that I am currently using to finalize my analyses and publish this work in a professional journal. Initially, I thought this project would take up a large chunk of my PhD, but with the help of the Applied Machine Workshop and GC Digital Initiatives, I was able to jumpstart my analyses and will put out a stronger product, sooner. In due course, I hope my research leads to a stronger understanding of insect diversity and how us humans are impacting it.