Background Image

Beats to Blocks

Peter Akdemir, Dustin La, Joshua Quizon, Rahul Shah

pja4@njit.edu, drl3@njit.edu, jbq2@njit.edu, rns22@njit.edu

CS 485 - Machine Listening

New Jersey Institute of Technology

Dr. Mark Cartwright

Overview

Given any song in a .wav file format, split the song into its unique stems and convert it to MIDI format. Using the MIDI file, create a Minecraft Note Block cover using the same or similar instruments used within the original song. This program will allow any song to be converted into a Minecraft Note Block cover automatically without needing to identify notes manually and creating a MIDI file for the song or creating fully within Minecraft itself.

Motivation

There are many covers of existing songs that are made using Minecraft Note Blocks. However the process of converting these songs to Note Block format is obscure. This project aims to automate the transcription process using machine learning models to split the song's stems, transform into MIDI format, and MIDI format to a set of Minecraft Note blocks. While the production level version of this project can be used to create more noteblock covers for existing songs, this project additionally allows for experimentation and analysis of different combinations of models for music transcription and stem splitting.

What is a Minecraft Note Block?

Minecraft Note Blocks are interactive blocks in Minecraft which allow players to produce a range of musical notes on different instruments. Each block can be tuned to a specific note by right-clicking, which cycles through a set of pitches. Different instruments can be mimicked depending on the block type placed beneath the Note Block. This functionality enables the creation of elaborate musical sequences within the game See below for the different sounds a Note Block can make:

Machine Learning Models

Spleeter

A music source separation tool that uses pre-trained models crated by Deezer. It takes in a waveform as input, and outputs a 5 splits max: vocals, drums, bass, piano, and other. A paper by Hennequin states that it is one of the best performing 4-stem separation models using the musdb18 benchmark.

Demucs

Short for “Deep Extractor for Music Sources,” it is a stem splitting model created by Facebook that takes in a waveform as input. The output is a 4-way split of bass, drums, vocals, and other sounds. Specifically, the version of Demucs being used is an implementation using transformers, which performs better by accounting for both long and short music content. In a paper by Défossez, it has been stated to perform better source separation for bass and drum splits.

MT3

A music transcription model that takes a waveform as input and transcribes it to a MIDI equivalent created by Google Magenta. It performs well in regards to evaluating how accurate its estimations of frames, onsets, and offsets are according to a paper by Gardner, presented at ICLR 2022. While MT3 is not a lightweight technology, it is very strong in its transcription capabilities–it can detect multiple instruments and notes played by those instruments.

Spotify Basic Pitch

In the context of automatic music transcription, Spotify Basic Pitch aims to deliver a product that is both powerful, but lightweight enough to be used in a production environment. The library accepts a waveform as input, and translates it into MIDI format. Performance demonstrates that the system exceeds the benchmark of other AMT systems in terms of frame-wise onset, multipitch, and note-activation estimations as stated by a paper by Bittner, presented at ICASSP 2022. It is important to note that unlike MT3, Spotify Basic Pitch only estimates notes, not instruments.

Open Note Block Studio

Open Note Block Studio is a software that allows users to create MIDI files and map to Minecraft Note Blocks through a graphical user interface. Within our program, we can create our own Minecraft Note Block Studio (NBS) files using the OpenNBS Python package. To create our final audio outputs, we use the nbswave Python package to transform the .nbs file into a .wav file.

Approach

Using the input .wav file, we will be using combinations of Demucs and Spleeter to split the stems of the song and then transcribe using MT3 or Spotify Basic Pitch. Once we have each combination of Demucs and Spleeter transcribed using MT3 or Spotify Basic Pitch, we will combine specific MIDI files to then be transcribed into Minecraft Note Blocks. This will use the OpenNBS and nbswave Python packages to produce a NBS file and final audio output .wav file that represents the Minecraft Note Block cover.

Pipelines

These pipelines represent our combinations of Demucs, Spleeter, MT3, and Spotify Basic Pitch.

Pipeline #1 - MT3 Only

This pipeline only uses the MT3 model. It transcribes the original source file and uses it as input for MT3.

Pipeline #2 - Demucs, Spleeter, MT3

This pipeline uses Demucs to split the track into bass, drums, vocals, and other sounds. It then splits the other stems through Spleeter to obtain the piano and other stems separately. Finally, it transcribes all of these files through MT3. The specific files chosen for this pipeline are:

Pipeline #3 - Spleeter and MT3

This pipeline uses Spleeter to split the track into piano, drums, bass, and other. These stems are then transcribed into MIDI using MT3. The specific files chosen for this pipeline are:

Pipeline #4 - Demucs, Spleeter, Spotify Basic Pitch

This pipeline uses Demucs to split the track into bass, drums, vocals, and other sounds. It then splits the other stems through Spleeter to obtain the piano and other stems separately. Finally, it transcribes all of these files through Spotify Basic Pitch. The specific files chosen for this pipeline are:

Pipeline #5 - Spleeter and Spotify Basic Pitch

This pipeline uses Spleeter to split the track into piano, drums, bass, and other. These stems are then transcribed into MIDI using Spotify Basic Pitch. The specific files chosen for this pipeline are:

Dataset

In this project, we accessed the performance of each model on all 20 tracks of the BabySlakh dataset. The BabySlakh data set is a subset of Slakh2100 , consisting of first 20 tracks of Slakh2100 (i.e., Track00001 through Track00020, inclusive). All of the audio is in the wav format and has a sample rate of 16 kHz. You can find the dataset here.

We used a subset of the Slakh2100 dataset because it was used as one of the datasets for training the MT3 model. Additionally, we chose this dataset over other datasets because it had well-formatted annotations for evaluation–they clearly defined the instrument classes that were used and also included the ground truth MIDI transcriptions. The tracks we used from BabySlakh were reduced to the first minute due the extensive run time for splitting and transcribing each stem of each track through our program.

Outputs and Results

The following are the results of feeding Track00001 from the BabySlakh dataset through our program with Demucs and Spleeter stem splitting, MT3 and Spotify Basic Pitch MIDI transcription, and finally producing the final audio output.

Input File

This is our input file from the BabySlakh dataset. This is the first minute of Track00001 that we will be inputting into our program:

Demucs Splitting

When putting our input file of Track00001 through Demucs, the following splits of bass.wav, drums.wav, vocals.wav, and other.wav are created:

bass.wav
drums.wav
vocals.wav
other.wav

Spleeter Splitting

When putting our input file of Track00001 through Spleeter, the following splits of bass.wav, drums.wav, vocals.wav, and other.wav, and piano.wav are created:

bass.wav
drums.wav
vocals.wav
other.wav
piano.wav

Demucs into Spleeter

After obtaining the outputs from Demucs, to further isolate the piano instrument from the other.wav split, Demucs output of other.wav is put through spleeter to obtain the following two files of other.wav and piano.wav.

other.wav
piano.wav

Spotify Basic Pitch Transcription

After obtaining the split files from Demucs, Spleeter, and Demucs into Spleeter, the files are then transcribed to MIDI using Spotify Basic Pitch:

Demucs

bass.mid
drums.mid
vocals.mid
other.mid

Spleeter

bass.mid
drums.mid
vocals.mid
other.mid
piano.mid

Demucs into Spleeter

other.mid
piano.mid

MT3 Transcription

After obtaining the split files from Demucs, Spleeter, and Demucs into Spleeter, the files are then transcribed to MIDI using MT3:

Demucs

bass.mid
drums.mid
vocals.mid
other.mid

Spleeter

bass.mid
drums.mid
vocals.mid
other.mid
piano.mid

Demucs into Spleeter

other.mid
piano.mid

Combining MIDIs

The following are the above five mentioned Pipelines with the MIDI stem files combined to create each pipeline MIDI file:

Pipeline #1 Combined MIDI
Pipeline #2 Combined MIDI
Pipeline #3 Combined MIDI
Pipeline #4 Combined MIDI
Pipeline #5 Combined MIDI

MIDI to Noteblock Studio (NBS) to Final Audio Outputs

The MIDI files are then transcribed to Note Block Studio to be converted into Minecraft Note Blocks using Note Block Studio (NBS) Files from the OpenNBS package in Python. Finally, the NBS Files are converted to wav audio and produce the final output using the nbswave Python package:

Pipeline #1
Pipeline #2
Pipeline #3
Pipeline #4
Pipeline #5

Evaluation

We evaluated the performance of the specific MIDI files chosen for each pipeline in two ways. The following shows our evaluation methods, results, and conclusion.

Evaluation Method #1

First we compared frequencies of instrument classes in transcribed MIDI files to those in the ground truth MIDI files on. For predicted MIDIs, we assigned a representative instrument class to the notes whose MIDI program numbers fell within the designated ranges for that particular class. For example, we assign the instrument class of “piano” to notes that have program numbers between 0 and 7, inclusive. Note that this only applies to pipelines that use MT3 (Pipelines #1, #2, and #3) because the pipelines with Spotify Basic Pitch (Pipelines #4 and #5) only transcribes to a singular instrument of piano and does not transcribe to other instrument classes like MT3 does.

Evaluation Method #2

To further evaluate our results quantitatively across all pipelines (Pipelines #1-#5), we measured the note onset, offset, and pitch accuracy of each pipeline. We used metrics including overlap of notes, precision, recall, and f1 score to understand how well each pipeline accomplished each type of detection for the MIDIs it predicted for a Minecraft Note Block cover. Specifically, we evaluated note accuracy based on the following combination of note properties:

  1. Onset, offset, and pitch prediction
  2. Onset and pitch prediction
  3. Onset prediction
  4. Offset prediction

We use the following mir_eval functions to evaluate the above combinations:

The following thresholds are defined for evaluation:

Barplots - Evaluation Method #1

Pipeline #1: True vs. Predicted Instrument Class Frequencies

This bar plot shows frequencies of predicted instruments classes and true instrument classes for Pipeline #1, transcribed by MT3 with no stem splitting. We see that 4 instrument classes - synth effects, percussive, ethnic, and sound effects - were undetected in the predicted transcribed MIDI that MT3 generated. Overall, frequencies of predicted instrument classes are close to those that are present in the ground truths, except for the guitar, piano, and string (continued) classes, for which the predicted frequencies had a larger difference between the predicted and actual frequencies of them.

Pipeline #2: True vs. Predicted Instrument Class Frequencies

This bar plot shows frequencies of predicted instruments classes and true instrument classes for Pipeline #2, Demucs and Spleeter transcribed by MT3. As shown in this bar plot, it overestimated the frequencies of all instrument classes from our predictions except bass and synth lead. Tracks of the predicted MIDI also did not detect instruments from four classes, which were percussive, ethnic, synth effects, and sound effects.

Pipeline #3: True vs. Predicted Instrument Class Frequencies

This bar plot shows frequencies of predicted instruments classes and true instrument classes for Pipeline #3, Spleeter transcribed by MT3. In this bar plot, it has also mostly overestimated the frequencies of all instrument classes. Tracks of the predicted MIDI did not detect instruments from four classes, which were percussive, ethnic, synth effects, and sound effects. These results are similar to Pipeline #2 as they overestimated most of the frequencies and even overestimated more than the previous pipeline.

Heatmaps - Evaluation Method #2

Onset, Offset, and Pitch

These are evaluation results that consider note offset, onset, and pitch accuracy. Across these 3 properties, Pipeline #1 performed the best in precision, recall, and f1 values. However, Pipelines #4 and #5 (the ones that use Spotify Basic Pitch) had the best overlap performance between the true and prediction notes.

Onset and Pitch

These are the evaluation results considering the onset and pitch accuracies. Overall, Pipeline #1 had the best performance in terms of precision, recall, and f1 values, while Pipeline #5 had the poorest performance. However, Pipeline #5 had the best performance in overlap while Pipeline #1 had the worst performance. This signals the strengths of Basic Pitch and MT3; MT3 performs better in pitch accuracy while Basic Pitch is better at onset detection.

Onset

These are evaluation results considering only note onset prediction performance. In terms of precision, Pipelines #4 and #5 performed the best, but in terms of recall, they performed the worst. Observing the f1 scores, Pipelines #4 and #5 performed the best. Overall, the range of performance scores is quite small.

Offset

These are evaluation results that consider only note offset prediction performance. Pipelines that use Basic Pitch for transcription (Pipelines #4 and #5) clearly performed the best.

Conclusion

With Evaluation Method #1, for instrument class predictions (which only apply to Pipelines #1, #2, and #3), the goal of evaluating was to see whether or not stem splitting prior to MT3 transcription helps. As it was observed, MT3 transcription without prior stem splitting led to better instrument class identification. Overall, prior stem splitting led to an overestimation of instrument classes.

Evaluation Method #2 was used to calculate the precision, recall, F1-score, and overlap values for each pipeline per comparison criterion of onset, offset, and pitch. Considering all metrics, Pipelines #1, #2, #3 performed the best in these observations. In terms of onset and offset accuracy, pipelines that use Spotify Basic Pitch (Pipelines #4 and #5) performed the best.

Listen in Minecraft!

By taking the NBS file that was created through our pipelines, we can import this file into Open Note Block Studio and convert it into a Minecraft schematic. This schematic can then be placed into a world in Minecraft and be played within the game. Seen below is Track00001 of our dataset using Pipeline #1 in Minecraft using the actual Note Blocks:

© 2023 Beats to Blocks