Chapter 6Wildlife Bioacoustics Detection
—Zhongqi Miao
Executive Summary
Automatically detecting sound events with artificial intelligence has become increasingly popular in the field of bioacoustics, particularly for wildlife monitoring and conservation. Conventional methods predominantly use supervised learning techniques, which depend on substantial amounts of manually annotated bioacoustics data. However, manual annotation in bioacoustics is tremendously resource-intensive, both in terms of human labor and financial resources; further, manual annotation requires considerable domain expertise. This requirement for domain expertise consequently undermines the validity of crowdsourcing annotation methods, such as Amazon Mechanical Turk. Additionally, the supervised learning framework restricts application scope to predefined categories within closed settings.
To address these challenges, we developed an approach that leverages a multi-modal contrastive learning technique called Contrastive Language-Audio Pre-training (CLAP). CLAP allows for flexible class definition during inference by using descriptive text prompts and can perform Zero-Shot Transfer on previously unencountered datasets. Here, we found that without specific fine-tuning or additional training, an out-of-the-box CLAP model could effectively generalize across nine bioacoustics benchmarks, covering a wide variety of sounds that were unfamiliar to the model. We showed that CLAP achieved comparable, if not superior, ...