Classifying Accents from Audio of Human Speech

25 Aug 2020

This blog post details my second project completed while studying at Metis. The code for this project can be found here.

Project Overview

The guidelines for this project were as follows:

Create a SQL database to store all tabular data. Make queries from this database to access data while performing analysis and modeling.
Choose a project objective that requires use of supervised classification algorithms. Experiment with Random Forest, Logistic Regression, XGBoost, K-Nearest Neighbors, as well as ensembling any combination of models.
Deploy the model in a Flask application or other interactive visualization.

It took me a bit longer than expected to decide on a project. I found myself stuck in a mental loop:

Search for interesting datasets.
Find something promising that would be a suitable binary or multi-classification problem.
Realize that the dataset had been downloaded by thousands of people, has been used in large Kaggle competitions, etc.
Self-doubt. (“Is my project original enough? Will it stand out?”)
Repeat.

Eventually, I was able to convince myself that as long as I pick something of interest to me and do a good job of applying what I had learned, it could be a successful project…but in all honesty it was the time limit that helped me to decide. This experience taught me the balance between thoughtfully planning a project, but still efficiently delivering a useful MVP on a deadline. It’s more important to finish a simple yet useful project than come up empty handed with an overly complex or over-perfected project.

My project objective was to classify spoken audio of speakers from six countries as American or not-American. (This was the result of narrowing the scope of my initial project objective, as explained in Obstacles below.)

The Dataset

I used the Speaker Accent Recognition Dataset from the UCI Machine Learning Repository. This dataset includes data extracted from over 300 audio recordings of speakers from six different countries. Half of the data contains speakers from the United States, and the other half is divided among Spain, France, Germany, Italy, and the United Kingdom. (The original paper is referenced in References.)

Each audio recording in this dataset was pre-processed and transformed into 12 Mel-frequency Cepstrum Coefficients (MFCC).

MFCC

MFCC Plot: It looks good, but what does it mean?

The data was easy to acquire (a simple .csv download). However, in order to add a learning element to the project, I created a SQL database on an AWS EC2 instance to store the data and access it remotely.

Obstacles

Modifying Initial Scope

Similarly to my last project, I had to iteratively modify the scope. Initially I intended to create a multinomial classifier for all of the accents present in the dataset, but the data was too limited. Not only was the total size of the dataset particularly small, but this was further compounded by the class imbalance between American accents and all other accents. Unsurprisingly, I pivoted towards creating a binary classifier to distinguish between American and all other accents.

Inability to Reproduce Features

I had hoped to increase the sample size of non-American accents for each by transforming audio from the Speech Accent Archive into MFCC features. However, I was unable to reproduce the existing data with a couple of different Python packages (librosa and python-speech-features). Therefore, I wouldn’t have been able to trust the results of applying these packages to new audio recordings. This further contributed to the need to narrow the scope of the project to a binary classifier.

Interpretability of Features

Due to the esoteric nature of the field of psychoacoustics, MFCC features are not interpretable to most people (including myself). If I told you that “an accent being American is dependent on the 10th MFCC”, that would be effectively meaningless. This was unfortunate for two reasons:

I couldn’t intelligently and creatively do feature engineering to improve my model. Using a more brute force feature engineering approach was somewhat helpful, but having some domain specific knowledge is advantageous.
The classifier itself wasn’t interpretable. In other words, I couldn’t draw any useful conclusions on the relationship between particular MFCC and particular accents.

Modeling & Results

Before beginning any modeling, I trained the original dataset on K-Nearest Neighbors, Random Forest, and Logistic Regression and recorded their training and validation ROC AUC scores to serve as a set of baseline models.

Exploratory Data Analysis & Feature Engineering

The data was of high quality and minimal cleaning was necessary so I moved quickly into exploratory data analysis. Upon plotting the distributions of each feature (separated by class), I noticed that some features had bimodal distributions for only one class.

For features like x5 [unitless], the distribution for instances where the labeled called was “American” has a distinct second mode.

I tried adding an additional boolean feature that indicated if a particular feature was in its respective bimodal range to try and put more weight on that behavior in the distributions and sort of “enforce” the separability in the data. Unfortunately this didn’t improve the ROC AUC scores or overfitting of the baseline models.

I then automatically generated all interaction terms between the original features and plotted their feature importances.

Feature Importance

A subset of the interaction features’ importances.

I used the information gained from the feature importance graph to add various interaction features, but eventually realized that simply removing x9 (one of the original MFCC features) yielded the best improvement in ROC AUC as compared to the baseline model.

Model Selection & Tuning

The next step was to tune the hyperparameters on each baseline model, using the final set of selected features, to achieve the best ROC AUC scores and least overfitting for each. This was all done using models in scikit-learn. For the Logistic Regression model, tuning involved:

Running the model with both L2 (Ridge) and L1 (Lasso) regularization.
Optimizing the inverse regularization strength to strike a balance between maximizing the ROC AUC validation score and reducing overfitting (minimizing the difference between the training and validation ROC AUC scores.)

The difference between training and validation ROC AUC scores with respect to inverse regularization strength.

The training and validation scores plotted separately.

This analysis resulted in a Logistic Regression with L1 regularization and C = 0.1 which successfully reduced overfitting to a negligible amount while retaining a good ROC AUC score of 0.85.

The K-Nearest Neighbors (KNN) model was optimized using similar metrics, but using instead the “number of neighbors” hyperparameter. The optimal KNN used 6 neighbors and resulted in an ROC AUC score of 0.92, with reduced overfitting.

Finally, for the Random Forest I tweaked the number of estimators, max depth, and maximum number of observations per leaf. The best Forest yielded an ROC AUC score of 0.83.

Among these three models, the KNN performed the best. However, I wanted to try ensembling different combinations of these three models to see if I could outperform the individual KNN. It turned out that the best model was an ensemble of the KNN and Logistic Regression, determined by examining a more detailed set of scores including accuracy, precision, and recall of the classifier.

Bonus: I also played around with XGBoost, but had limited time to optimize the hyperparameters and was satisfied enough with the performance of my existing ensemble model.

Threshold Tuning

Classification models in scikit-learn use a threshold of 0.5 by default to classify predicted probabilities as the positive class (for probabilities above the threshold) or the negative class (for probabilities below). However, by changing the threshold, it is possible to get better accuracy out of your model.

Performance of model at different thresholds.

A threshold of 0.46 yielded negligible change in accuracy while providing a slightly better balance between precision and recall.

Final Scoring

Scoring the final model on the test set yielded an overall accuracy of 0.89. I chose accuracy as the main scoring metric because in the case of classifying accents, there’s no reason to optimize precision or recall (whereas you might care more about these metrics in higher risk domains such as medicine.)

Test set confusion matrix of final model.

Flask App

I deployed the model in a Flask application on ~~Google App Engine~~ Heroku (migrated because it was cheaper). The app allows you to select audio examples of each accent, listen to the sample, view it’s MFCC coefficients, and play around with them to yield different predicted classes (accents).

Screenshot of the Flask application.

Summary

I learned a couple of important things from completing this project:

It’s important to understand the limitations of your dataset before going too far down an impossible path. In my case, it was wise to quickly switch to a binary classification after realizing how small and imbalanced my dataset was.
Methodically working through model selection and tuning in an organized fashion can lead you to a model you are happy with. It can help you avoid the infinite loop of model tweaking.

References

Fokoue, E. (2020). UCI Machine Learning Repository - Speaker Accent Recognition Data Set. Irvine, CA: University of California, School of Information and Computer Science.

Predicting Popularity of Hip-Hop Music on Spotify

17 Jul 2020

Disclaimer: Much of this post assumes that the reader has some basic data science knowledge, as opposed to my inaugural post which was a self-reflective, career oriented post.

I just completed the first major independent project during my time as a student at Metis. We were given two weeks to complete a project that satisfied the following constraints:

Train a model that predicts a continuous, numeric value using a mix of numerical and categorical data.
The model must be fit using only linear regression, polynomial regression, or any regularization-enhanced variant of linear regression (Lasso, Ridge, ElasticNet).
At least a portion of the training data must be acquired via web-scraping.
Use a relatively small training set (on the order of 100’s or 1000’s of records).

Generally, I enjoyed working through the full data science workflow for the first time and being able to go deep into a data set. This blog post details my process, from the initial search for data to deploying the model.

Motivation

My opinion going into my first project was that aside from all technical metrics, there are two determinants of a compelling portfolio project:

The project should use a topic/data set that excites you. That excitement and passion will come through in your work.
The project should not just produce “interesting” or “cool” results. The results should be actionable and carry value. Your future hiring manager might like to see something that is fun, but they also most likely want to see that you can make valuable predictions and/or interpretations with data.

As a music producer and someone exploring music technology as a potential career path, I felt compelled to search for music-related data. I find the field of music recommendation and personalization to be exciting. Reading the Spotify Engineering blog gave me the prior knowledge that they are on the cutting edge of music technology, and that they publish interesting, high-quality data sets on their API. One particular Spotify data set that stood out to me contains “track audio features”. These data are generated by a proprietary algorithm for every track on the Spotify platform, and contain such features as “danceability”, “energy”, and “loudness”. Most of these features are numerically encoded on a scale of 0 to 1.

As far as yielding valuable conclusions, it seemed that the best way to apply this data was to predict the success of a song based on its various qualities and what factors carry the most weight. Choosing a target to predict that accomplished this goal took some time, but was a good lesson in selection of project scope.

Narrowing the Scope

Initial Scope

The first thing that came to mind to predict was Pitchfork album ratings. Pitchfork is arguably the most well-established music blog on the internet. The website’s most prominent feature is its album reviews, each of which contain a written review and a rating, scored on a scale of 0 to 10. While these reviews have received heavy criticism and scrutiny, their stamp of approval has been credited with jump starting the successful music careers of artists like Arcade Fire and Bon Iver. Surely, a good review on Pitchfork is a strong indicator of (and sometimes reason for) a successful album.

However, initial exploratory data analysis showed that Pitchfork album rating might not be a promising target variable,
for two main reasons:

When plotting all of my feature data against Pitchfork album rating, there was no obvious relationship. Could I have chosen new features? Maybe, but I really was enjoying exploring the Spotify audio feature data, and believed that there was some value in it.

Pitchfork Pairplot

Pitchfork Album rating plotted versus a few track audio features/metadata, showing very little feature-by-feature correlation with the target.

Running a baseline Linear Regression model on the data yielded a validation R-squared score of 0.091, which felt too low to do extensive feature engineering and training on given the time constraint.

Narrowing Scope

The process of choosing an appropriate project scope.

Why was this model performing so poorly? If you remember, the Spotify audio features are available for individual tracks, but the Pitchfork ratings are for albums. I had to do some data processing to aggregate the track-level audio features as album-level features, which mostly amounted to taking averages across the album. My suspicion is that since an album can contain songs with varied “vibes”, averaging features like “energy” and “danceability” neutralized the significance of these metrics. Beyond that, it seems that album reviews and ratings are highly subject to the tastes of the person writing a particular review, or perhaps what side of the bed they wake up on that morning. Something so subjective is unlikely to show a clear pattern in the data.

If I was being asked by an employer or client to specifically predict Pitchfork Album Ratings, I would have persisted. While I’m looking forward to having a job, being a student gave me the flexibility to pivot my project slightly.

Setting a New Target

My next objective was to find a track-level feature that in some way represented commercial success. Spotify, being the good data stewards that they are, provides a “popularity score” for each track on their platform via the API. Spotify doesn’t provide documentation on how they calculate popularity, a score given on a 0-100 scale, but they do state that it is highly dependent on the number of streams a track has, and how recent those streams are - in other words, how “hot” or “viral” a track has been. Once again, by creating a pair plot of the new target (popularity), I saw no obvious correlations and knew that I hadn’t quite hit the mark on my project scope. Running another baseline linear regression model doubled my previous R-squared score to around 0.2, but this still didn’t seem like a good starting point for feature engineering and tuning, given my relative lack of exposure to those skills at the time.

Popularity Pairplot

Popularity plotted versus a few track audio features/metadata, showing very little feature-by-feature correlation with the target.

Choosing the Final Scope

It was then that I learned the value of stepping away from the Jupyter Notebook for a few minutes to think about the problem at hand from a practical, common-sense perspective. “Different genres of music are popular for different reasons, right?”, I thought to myself. For instance, a fan of Folk/Country probably won’t like a “danceable”, “high energy song”, but a fan of electronic music might. Fortunately, I already had a column for genre in my data set, so I separated the data into each genre and plotted a correlation matrix for each. Hip Hop music showed the most highly correlated features, so I made the decision to finalize my scope, and title my project, “Predicting Popularity of Hip Hop Music on Spotify”. I did some data cleaning on the filtered hip hop data, and persisted my data set in preparation for the next phase of the project.

Feature Engineering, Modeling, & Technicals

The first way I improved the baseline model was through a new features and feature engineering. Both of these tasks can take you into deep rabbit holes - given the time limit, I decided to set a reasonable goal of adding at least one brand new column of data, and one new feature created by transforming an existing feature or combining features.

Feature Addition: Artist Followers

While Spotify’s definition of song “popularity” implies that even up and coming artists can have very popular songs, artist follower count (available via Spotify’s API) still sounded like a strong contender for predicting song popularity. After querying Spotify’s API for each artist of each song and adding the number of followers as a column to the data, I plotted a correlation matrix:

Popularity Pairplot

Correlation matrix of data after adding number of artist followers as a feature.

Sure enough, artist followers was the most highly correlated feature.

Feature Engineering: Log Transforming Artist Followers

The feature I added turned out to be the feature I engineered. Upon inspecting the plot of popularity versus artist followers, I noticed that the relationship could be fit to a logarithmic equation (log(x)).

Song Popularity vs. Artist Followers, a sparse, but logarithmic relationship.

Log transforming the feature (taking the log of artist followers) yielded a relatively strong linear relationship.

Song Popularity vs. the Log of Artist Followers, showing a linear relationship.

Finally, plotting the correlation matrix with the addition of Log(Artist Followers) shows a more highly correlated feature to use in a model that predicts song popularity.

Correlation matrix after feature engineering.

Modeling & Evaluation of Metrics

I began developing the final model by splitting the full data set into a training/validation (train-val) set (75% of data) and test set (25% of data). Running cross validation on a linear regression model using the train-val set yielded a training R-squared score of 0.61, a validation R-squared score of 0.58, and a root mean square error (RMSE) of 12.29.

In an attempt to reduce model overfitting, I ran both a Lasso and Ridge regression with varying regularization parameters (alpha). Both yielded similar best results of a training R-squared score equal to 0.62, validation R-squared score of 0.61, and similar RMSE to the baseline linear regression. Both models succeeded in reducing complexity and raising the validation R-squared score, without affecting RMSE.

From there, I experimented to see if I could raise the R-squared score and lower RMSE. I tried a combination of polynomial regression and Lasso regression to eliminate unneeded polynomial features, which yielded very similar results. Since the coefficients of polynomial regression are harder to interpret than linear features, and there was no improvement in model performance, I threw the model out.

Another experiment I tried was pre-removing the features from the dataset that the initial Lasso removed (by reducing their coefficients to zero). I then ran additional Lasso and Ridge regressions on that dimension-reduced feature set to see if that had any effect on the model performance. This also yielded similar results, so I decided to go with the model that I felt would be the easiest to interpret. The Lasso model, run on the reduced feature set, left me with a model with reduced complexity from the original baseline regression, and a slightly reduced feature set. It had a few slight advantages: a validation R-squared score on the higher end of all of my models (0.619), the smallest difference between training R-squared score and validation R-squared score (0.05) indicating very little overfitting, and an RMSE on the lower end of all of my models (12.1).

Scoring the model on the test set yielded a final R-squared score of 0.579 and RMSE of 13.8, indicating that on average the predictions of song popularity are off by 13.8. If an artist is trying to determine if their song will be popular a score of, for instance, 70 out of 100 might be, in some cases, anywhere from 57 to 83. Based on my understanding of the distributions of popularity on Spotify, even a score on that lower end indicates a successful song. This model could be improved quite a bit, but still provides potential value for hip hop artist trying to predict the impact of their music.

As an additional quality check on the model, I evaluated a few of the essential linear regression assumptions:

The residuals have a fairly constant variance for all predictions.
The predictions are approximately normally distributed.

Final Steps

Deploying the Model

After scoring the final model, I challenged myself to create a simple web application (using Streamlit) that allows anyone to play around with the model parameters and make predictions.

~~As a final bonus, I learned how to make a Dockerfile for the application and deploy it to Google Cloud App Engine.~~ Update (2020-09-17): I’ve since migrated the app to Heroku because it is much cheaper.

Lessons Learned

Aside from the theory and technical skilled learned, I took away a couple of general lessons from this project:

Refining scope can lead to a much more successful project, model, or product.
When feature engineering and model tuning gets tedious, take a step back and think about the problem at hand from a more global, common sense perspective.

The code for this project can be accessed here.

The Streamlit app I created for this project can be accessed here.

Older Newer