Animal Sound Classifier

Classify 10 animal sounds (dog, cat, cow, frog, pig, hen, rooster, sheep, crow, insects) from a 5-second .wav clip. Compares a classical Random Forest baseline against a small CNN trained on log-mel spectrograms.

Built as a learning project and a portfolio piece — real human labels, raw audio in, and an honest side-by-side comparison of a classical and a deep model on the same held-out fold.

What this teaches

How audio becomes a feature vector (MFCCs) or an image (mel-spectrogram).
Why a classical baseline matters before reaching for deep learning.
What a fair held-out evaluation looks like (ESC-50’s pre-defined folds).
Where the two model families actually differ — and where they don’t.

Results

Model	Features	Test accuracy (fold 5)
Random Forest	40-d MFCC summary (mean + std)	63.8%
Small CNN	128-bin log-mel spectrogram	66.3%

The headline: with only 320 training clips for 10 classes, the CNN barely edges out the Random Forest. That’s the lesson — deep learning doesn’t automatically win on small datasets, and a strong classical baseline keeps you honest. The gap would likely widen with data augmentation, a larger backbone, or pretraining; see “What I’d do next” below.

The off-diagonal is where the story is — hen/rooster and cow/sheep are the consistent mix-ups, which makes acoustic sense.

Random Forest	CNN

Quickstart

# 1. install
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

# 2. one-time data download (~600 MB)
python download_data.py

# 3. train both models (CPU works; CNN takes a few minutes)
python train.py

# 4a. predict from the command line
python predict.py data/animals/1-30226-A-0.wav

# 4b. or run the demo app
streamlit run app.py

Repo layout

sound-classifier/
├── download_data.py    # pulls ESC-50, filters to animals
├── features.py         # MFCC summary + log-mel spectrogram
├── train.py            # trains RF + CNN, writes results.json + figures/
├── predict.py          # CLI inference
├── app.py              # Streamlit demo
├── notebook.py         # cell-marked exploration walkthrough
├── requirements.txt
└── README.md

Approach notes

Why ESC-50 animals. Real human labels (no clustering tricks), balanced classes (40 clips per class), and built-in 5-fold cross-validation. Small enough to train on a laptop, well-known enough to compare against published benchmarks.

Why both models. A Random Forest on hand-crafted features is the cheapest thing that could possibly work — if a CNN can’t beat it, you have a problem. The comparison itself is the lesson.

Train/test split. Folds 1–4 train, fold 5 tests. No clip ever appears in both. This is the standard ESC-50 protocol.

Per-clip normalization for the CNN. Each spectrogram is z-scored so the model can’t cheat by reading recording loudness instead of acoustic content.

What I’d do next

Cross-dataset eval: pull a handful of animal clips from Freesound and measure the accuracy drop. That gap is the distribution-shift story.
Data augmentation: time-shift, pitch-shift, and SpecAugment on the spectrograms — usually worth a few points for the CNN.
A pretrained audio backbone (YAMNet or a small AST) — the obvious next step once the baseline is solid.

Credits

Dataset: ESC-50 by Karol J. Piczak (CC BY-NC).