OMR Act 4 - Three Machine Learning Approaches
During the dataset creation, you may have seen some pixel-wise segmentation mask and bounding boxes, so you obviously got spoiled with what will come next. In this article, we will explore three different techniques in our machine learning pipeline to tackle the problem of OMR.
Disclaimer:
We focus on simple toy examples
The following techniques will be applied:
- Pixel-wise semantic segmentation
- Simple regression
- Bounding-box regression
01 - Semantic Segmentation
You may think to yourself, why exactly do you want to do a pixel-wise segmentation in the first place? If your main goal is to read the musical symbols and retrieve something that the computer understands, then yes, you may not need it at all. However, it may be beneficial as a pre-processing step, for understanding the problem space we are dealing with and simply for the sake of trying out an easily interpretable approach with approachable complexity.One of the benefits of using a pixel-wise segmentation is the variability of size. Typically, the U-Nets selected for these tasks are fully convolutional and thus, $N_{\text{in}} = N_{\text{out}}$ with $N$ as the number of neurons.
First of all, we need to load the data which is easiest by writing a custom
Dataset
:
@dataclass
class Sample:
image: torch.Tensor
mask: torch.Tensor
class SheetMusicSegmentationDataset(Dataset):
def __init__(
self,
samples: List[Tuple[str, str]],
transform: Optional[Any] = None,
) -> None:
self.all_files = samples
self.transform = transform
def __len__(self):
return len(self.all_files)
def __getitem__(self, idx) -> Sample:
pil_to_tensor = transforms.Compose([transforms.PILToTensor()])
img_path, mask_path = self.all_files[idx]
img = Image.open(img_path).convert("L")
mask = np.load(mask_path)
sample = Sample(
image=pil_to_tensor(img),
mask=torch.from_numpy(mask).permute(2, 1, 0).float(),
)
if self.transform:
sample = self.transform(sample)
return sample
For the model, we use an efficient-net making use of the segmentation_models_pytroch
library. As this problem will be quite simple, we don't even use pre-trained weights.
class EfficientNet(smp.Unet):
def __init__(self, n_classes: int) -> None:
super().__init__(
in_channels=1, # expects a grayscale image
encoder_name="efficientnet-b3",
classes=n_classes,
activation="sigmoid",
)
For the loss, we will make use of the pytorch_toolbelt
library and use the JaccardLoss
.
Without going into too much details regarding the training process, we will show some results for different datasets.
The first dataset that we will use is the already discussed set from Act 3. It always includes exactly two quarter notes which at a random height. The stem is pointing upwards, there is no clef or accidental and therefore the staff-lines are always of equal length. Examples of the dummy dataset are shown below:
Now, let's use the very same network with an advanced dataset. This dataset that we name
DummyOMRv02
has quarter and half notes placed at random heights on one staff.
It may include flat and sharp accidentals, have stems in both directions, a varying amount of of notes per image and therefore varying sizes.
02 - Simple Regression
Ultimately, our goal is to read sheet music i.e. by providing an image of a score, we want to retrieve a digitized data representation of the score. The semantic segmentation as we used it may be used as pre-processing step, however, we still need to apply advanced computer vision to get to our representation. Instead, we could just take our data as it is and try to predict the data directly (and thus, being able to render it again!). We do this by phrasing the problem as a regression problem.Again, we focus on our toy example i.e.
DummyOMRv01
.
To represent the data, we transform the score into a single vector of 1s and 0s.
A vector with no note, will be all zero.
A score with one note, will have exactly one 1 and a score with two notes will have exactly two 1s.
The position of the 1s within the vector encode where the note is (on the staff).
E.g. for a C with octave 4 it will be at the 24th position, a D with octave 4 on the 25th position and so on.
A chord which has a C and a E both with octave 4 will have a one at the 24th as well as the 26th position.
Ground Truth: [0 1 0 1 0 0 0 0 0 0 0 0 ...]
Prediction: [0 1 0 0 1 0 0 0 0 0 0 0 ...]
With this encoding, we can formulate the problem as a regression, where the network will predict this vector of 0s and 1s which can be parsed and rendered by our program immediately (without any further computer vision tasks / logic).
As loss function, a weighted mean squared error is used, and as model a ResNet50 that was pre-trained on the ImageNet dataset. The pre-training boosted the performance by a significant degree to the point, where I was unable to get any meaningful results without the pre-training.
Below, the convergence as well as some predictions for
DummyOMRv01
are shown.
Although we see a rapid convergence in the beginning, the network is not able to converge to a global minimum.
For the predictions, it correctly predicted sample 2 and 4, but it failed for sample 3 by a bit (just predicted one note) and for sample 1 by a large amount (it predicted >10 notes).
Problem:
Expressiveness of Loss Function?
One problem may be the expressiveness of the loss function. For example, consider the following situation:
Ground Truth: [0 1 0 1 0 0 0 0 0 0 0 0 ...]
Prediction 1: [0 1 0 0 1 0 0 0 0 0 0 0 ...]
Prediction 2: [0 1 0 0 0 0 0 0 0 0 0 1 ...]
Prediction 1 and 2 both predicted the "correct amount" of 1s
but for both predictions, one is at a wrong position.
The loss (mean squared error) will be the same.
However, semantically, prediction 1 is much closer to the ground truth than prediction 2 as each 1
is position encoded.
Prediction 1 is therefore just one step off, while prediction 2 is very far away from the solution.
In our loss function we didn't encode these semantics which may be one reason why the training is slow and gets stuck in a local minimum.
Problem:
Lack of Transfer Learning
Compared to the semantic segmentation, there is no point of throwing other examples of different distributions to this approach.
If the network has never seen an image before (e.g. one where 30 notes are present), it will produce garbage for the vector.
This approach is generally not well suited for transfer learning where even toy examples may yield good results for example in different distributions.
However, the by far largest benefit of this approach is that all the post-processing steps which are necessary in other techniques are not needed at all. The result is the data represented in a way that is clear immediately.
03 - Bounding Box Regression
Finally, as a third approach we tackle something that is kind of a mix between a pixel-wise semantic segmentation and the simple regression. Even though the semantic segmentation gave us nice and (human-)interpretable results, the amount of post-processing needed is still enormous. Instead of predicting which pixel belongs that which class, it would be much easier if we predict locations of the classes. More specifically, predict bounding boxes of the classes. With these bounding boxes, we have much easier interpretable results which we can then parse to a digitized score with ease using simple logic.At first, I had many trouble trying to get a minimum working example for the SSD bounding box detection using this Tutorial with a pre-trained network on ImageNet. After I finally got it working, I progressed quickly and have tried out many things including different datasets, data augmentations such as photometric distortions, cropping, expanding, rotation and more. Results are shown for the more complex datasets including data augmentation.
In this case, we actually predict two different classes. One for the note heads and one for the staff. The staff is needed, such that we can reconstruct the notes height. Without knowing the location of the staff, we cannot infer the heights.
Convergence and predictions for the more complex
DummyOMRv02
are shown below.
The convergence is quick, although it never reaches zero (mainly due to the augmentations used)
Investigation: Transfer Learning with Bbox Regression
Although I expected the network to perform well in transfer learning such as using a network that has been trained on just two note heads to predict notes with three or more note heads, it actually did not. Data augmentation as well as a variation of notes prove to be very important techniques to solve the bounding box regression.In the following we will be using three different models:
- Trained on
DummyOMRv01
- Trained on
DummyOMRv01
+ Augmentations - Trained on
DummyOMRv02
+ Augmentations
DummyOMRv01
dataset.
All of them perform well, just as expected.
DummyOMRv01
but with the only change of containing three instead of two notes.
Here, only the model trained on the more diverse DummyOMRv02
is able to do good predictions.
The first two models perform equally bad.
DummyOMRv02
dataset.
As expected, the last model performs well.
Although model one and two both do not perform well on these samples, it is interesting to see, that v01
+ Aug performs significantly better than v01
without augmentation.
Therefore, the network did learn the concepts of note heads (at least in some way, and at this specific resolution).
Summary
Although I initially thought an UNet + Semantic Segmentation makes most sense due to the flexibility of input size = output size, I didn't even try to use the results of the network to get back to the note representation. Investigating the other end of the spectrum, i.e. predicting a vector representing the data immediately ultimately didn't achieve the expected performance on a dummy dataset and was therefore not further explored. The most promising results were achieved with the bounding box regression, where also a small parser back from bounding boxes to the score could be written in a simplified form at around 100 lines of code. It has shown to be a good balance between computer-interpretability, performance and robustness in situations where it's applied to unseen data. However, there is no magic and a more complex or complete example could not be detected so far.Next Steps
It may be interesting to investigate whether the output of the semantic segmentation can be used as an input of the bounding box regression and if it is possible to predict the note heads with an even higher accuracy.More modern techniques such as vision transformers or attention may be used. In addition to that, it may be possible to apply some seq2seq approaches, as reading music notes is similar to reading text.
Final Words
Following Act 1 to Act 4, I have shown the implementation and usage of thenotensatz
library to render music.
Due to the design of a filetype agnostic rendering, it is possible to create various formats that come in handy when creating datasets for machine learning tasks.
In combination with the typestate builder pattern and a RandomConfig
struct, creating such datasets is incredibly easy, user friendly and fast.
Alongside other use-cases for this library, it may serve as an educational toolbox to teach and explore machine learning topics. I heavily promote the field of optimal music recognition for educational purposes as it is
- visually intuitive,
- well structured and human interpretable,
- simple, yet scalable to an infinite complexity, and
- possible to solve in various ways.