OMR Act 4 - Three Machine Learning Approaches

During the dataset creation, you may have seen some pixel-wise segmentation mask and bounding boxes, so you obviously got spoiled with what will come next. In this article, we will explore three different techniques in our machine learning pipeline to tackle the problem of OMR.
Disclaimer: We focus on simple toy examples
The following techniques will be applied:
  1. Pixel-wise semantic segmentation
  2. Simple regression
  3. Bounding-box regression

01 - Semantic Segmentation

You may think to yourself, why exactly do you want to do a pixel-wise segmentation in the first place? If your main goal is to read the musical symbols and retrieve something that the computer understands, then yes, you may not need it at all. However, it may be beneficial as a pre-processing step, for understanding the problem space we are dealing with and simply for the sake of trying out an easily interpretable approach with approachable complexity.
One of the benefits of using a pixel-wise segmentation is the variability of size. Typically, the U-Nets selected for these tasks are fully convolutional and thus, $N_{\text{in}} = N_{\text{out}}$ with $N$ as the number of neurons.

First of all, we need to load the data which is easiest by writing a custom Dataset:
@dataclass
class Sample:
    image: torch.Tensor
    mask: torch.Tensor

class SheetMusicSegmentationDataset(Dataset):
    def __init__(
        self,
        samples: List[Tuple[str, str]],
        transform: Optional[Any] = None,
    ) -> None:
        self.all_files = samples
        self.transform = transform

    def __len__(self):
        return len(self.all_files)

    def __getitem__(self, idx) -> Sample:
        pil_to_tensor = transforms.Compose([transforms.PILToTensor()])
        img_path, mask_path = self.all_files[idx]
        img = Image.open(img_path).convert("L")
        mask = np.load(mask_path)
        sample = Sample(
            image=pil_to_tensor(img),
            mask=torch.from_numpy(mask).permute(2, 1, 0).float(),
        )
        if self.transform:
            sample = self.transform(sample)
        return sample
For the model, we use an efficient-net making use of the segmentation_models_pytroch library. As this problem will be quite simple, we don't even use pre-trained weights.
class EfficientNet(smp.Unet):
    def __init__(self, n_classes: int) -> None:
        super().__init__(
            in_channels=1,  # expects a grayscale image
            encoder_name="efficientnet-b3",
            classes=n_classes,
            activation="sigmoid",
        )
For the loss, we will make use of the pytorch_toolbelt library and use the JaccardLoss. Without going into too much details regarding the training process, we will show some results for different datasets.

The first dataset that we will use is the already discussed set from Act 3. It always includes exactly two quarter notes which at a random height. The stem is pointing upwards, there is no clef or accidental and therefore the staff-lines are always of equal length. Examples of the dummy dataset are shown below:
Six samples of the DummyOMRv01 dataset.
After running the pipeline, the convergence and prediction results look very promising. The loss converges quickly and there is no appearance of staff or stem lines that would disrupt the prediction.
Convergence of the training for the semantic segmentation.
input image (left) - ground truth (middle) - prediction (right)
Let's try out some transfer learning and apply the prediction to images outside of the sample distribution:
Simple network applied to different distribution.
Even for those vastly different examples, the network does an overall good job at filtering out staff lines. However, it will detect anything that is blob-like as note head which is best visible for the last row, where the pixels of an flat accidental got detected as note head. Still, due to the property of U-Nets it is possible to feed in any size and retrieve a somewhat reasonable output. Pretty cool!

Now, let's use the very same network with an advanced dataset. This dataset that we name DummyOMRv02 has quarter and half notes placed at random heights on one staff. It may include flat and sharp accidentals, have stems in both directions, a varying amount of of notes per image and therefore varying sizes.
Six samples of the DummyOMRv02 dataset.
Predictions on this dataset are plotted below. In general, the note heads are detected clearly with a high accuracy and resolution. The only visible problem is, that some of the lower resolution quarter note heads have a small hole as if they were half notes.
Predictions of network trained on DummyOMRv02.
It may be an insightful investigation to separate the filled and half note heads into two different classes and see how the network performs. However, using the semantic segmentation with multiple classes has not been investigated from my side so far.


02 - Simple Regression

Ultimately, our goal is to read sheet music i.e. by providing an image of a score, we want to retrieve a digitized data representation of the score. The semantic segmentation as we used it may be used as pre-processing step, however, we still need to apply advanced computer vision to get to our representation. Instead, we could just take our data as it is and try to predict the data directly (and thus, being able to render it again!). We do this by phrasing the problem as a regression problem.

Again, we focus on our toy example i.e. DummyOMRv01. To represent the data, we transform the score into a single vector of 1s and 0s. A vector with no note, will be all zero. A score with one note, will have exactly one 1 and a score with two notes will have exactly two 1s. The position of the 1s within the vector encode where the note is (on the staff). E.g. for a C with octave 4 it will be at the 24th position, a D with octave 4 on the 25th position and so on. A chord which has a C and a E both with octave 4 will have a one at the 24th as well as the 26th position.
Ground Truth: [0 1 0 1 0 0 0 0 0 0 0 0 ...]
Prediction:   [0 1 0 0 1 0 0 0 0 0 0 0 ...]
With this encoding, we can formulate the problem as a regression, where the network will predict this vector of 0s and 1s which can be parsed and rendered by our program immediately (without any further computer vision tasks / logic).

As loss function, a weighted mean squared error is used, and as model a ResNet50 that was pre-trained on the ImageNet dataset. The pre-training boosted the performance by a significant degree to the point, where I was unable to get any meaningful results without the pre-training.

Below, the convergence as well as some predictions for DummyOMRv01 are shown. Although we see a rapid convergence in the beginning, the network is not able to converge to a global minimum. For the predictions, it correctly predicted sample 2 and 4, but it failed for sample 3 by a bit (just predicted one note) and for sample 1 by a large amount (it predicted >10 notes).
Convergence of the training for the regression.
input image (left) - ground truth (middle) - prediction (right)
The reasons why I wasn't able to solve the regression remain unsolved for now and need further investigations. Especially, as this is an artificial dummy dataset without any data augmentation which should be easy to learn.
Problem: Expressiveness of Loss Function?
One problem may be the expressiveness of the loss function. For example, consider the following situation:
Ground Truth: [0 1 0 1 0 0 0 0 0 0 0 0 ...]
Prediction 1: [0 1 0 0 1 0 0 0 0 0 0 0 ...]
Prediction 2: [0 1 0 0 0 0 0 0 0 0 0 1 ...]
Prediction 1 and 2 both predicted the "correct amount" of 1s but for both predictions, one is at a wrong position. The loss (mean squared error) will be the same. However, semantically, prediction 1 is much closer to the ground truth than prediction 2 as each 1 is position encoded. Prediction 1 is therefore just one step off, while prediction 2 is very far away from the solution. In our loss function we didn't encode these semantics which may be one reason why the training is slow and gets stuck in a local minimum.
Problem: Lack of Transfer Learning
Compared to the semantic segmentation, there is no point of throwing other examples of different distributions to this approach. If the network has never seen an image before (e.g. one where 30 notes are present), it will produce garbage for the vector. This approach is generally not well suited for transfer learning where even toy examples may yield good results for example in different distributions.

However, the by far largest benefit of this approach is that all the post-processing steps which are necessary in other techniques are not needed at all. The result is the data represented in a way that is clear immediately.


03 - Bounding Box Regression

Finally, as a third approach we tackle something that is kind of a mix between a pixel-wise semantic segmentation and the simple regression. Even though the semantic segmentation gave us nice and (human-)interpretable results, the amount of post-processing needed is still enormous. Instead of predicting which pixel belongs that which class, it would be much easier if we predict locations of the classes. More specifically, predict bounding boxes of the classes. With these bounding boxes, we have much easier interpretable results which we can then parse to a digitized score with ease using simple logic.

At first, I had many trouble trying to get a minimum working example for the SSD bounding box detection using this Tutorial with a pre-trained network on ImageNet. After I finally got it working, I progressed quickly and have tried out many things including different datasets, data augmentations such as photometric distortions, cropping, expanding, rotation and more. Results are shown for the more complex datasets including data augmentation.

In this case, we actually predict two different classes. One for the note heads and one for the staff. The staff is needed, such that we can reconstruct the notes height. Without knowing the location of the staff, we cannot infer the heights.

Convergence and predictions for the more complex DummyOMRv02 are shown below. The convergence is quick, although it never reaches zero (mainly due to the augmentations used)
Convergence of the training for the bounding box regression.
ground truth (left) - prediction (right)
The first and third image are predicted perfectly. For the second image, with much more note heads, we have two bounding boxes that have blue color. I plotted them blue, if the confidence was low (in that case just 20%) although most of the other notes are predicted with a confidence of more than 90%.

Investigation: Transfer Learning with Bbox Regression

Although I expected the network to perform well in transfer learning such as using a network that has been trained on just two note heads to predict notes with three or more note heads, it actually did not. Data augmentation as well as a variation of notes prove to be very important techniques to solve the bounding box regression.

In the following we will be using three different models:
  1. Trained on DummyOMRv01
  2. Trained on DummyOMRv01 + Augmentations
  3. Trained on DummyOMRv02 + Augmentations
In the first example, all three models predict samples of the DummyOMRv01 dataset. All of them perform well, just as expected.
v01
v01 + Aug
v02 + Aug
In the next example, the three models will predict samples of a dataset similar to DummyOMRv01 but with the only change of containing three instead of two notes. Here, only the model trained on the more diverse DummyOMRv02 is able to do good predictions. The first two models perform equally bad.
v01
v01 + Aug
v02 + Aug
In this example, we are testing against the DummyOMRv02 dataset. As expected, the last model performs well. Although model one and two both do not perform well on these samples, it is interesting to see, that v01 + Aug performs significantly better than v01 without augmentation. Therefore, the network did learn the concepts of note heads (at least in some way, and at this specific resolution).
v01
v01 + Aug
v02 + Aug
For the last example, we used something that is even further away from the training distribution. Here, all networks fail to detect anything meaningful.
v01
v01 + Aug
v02 + Aug

Summary

Although I initially thought an UNet + Semantic Segmentation makes most sense due to the flexibility of input size = output size, I didn't even try to use the results of the network to get back to the note representation. Investigating the other end of the spectrum, i.e. predicting a vector representing the data immediately ultimately didn't achieve the expected performance on a dummy dataset and was therefore not further explored. The most promising results were achieved with the bounding box regression, where also a small parser back from bounding boxes to the score could be written in a simplified form at around 100 lines of code. It has shown to be a good balance between computer-interpretability, performance and robustness in situations where it's applied to unseen data. However, there is no magic and a more complex or complete example could not be detected so far.

Next Steps

It may be interesting to investigate whether the output of the semantic segmentation can be used as an input of the bounding box regression and if it is possible to predict the note heads with an even higher accuracy.

More modern techniques such as vision transformers or attention may be used. In addition to that, it may be possible to apply some seq2seq approaches, as reading music notes is similar to reading text.

Final Words

Following Act 1 to Act 4, I have shown the implementation and usage of the notensatz library to render music. Due to the design of a filetype agnostic rendering, it is possible to create various formats that come in handy when creating datasets for machine learning tasks. In combination with the typestate builder pattern and a RandomConfig struct, creating such datasets is incredibly easy, user friendly and fast.

Alongside other use-cases for this library, it may serve as an educational toolbox to teach and explore machine learning topics. I heavily promote the field of optimal music recognition for educational purposes as it is Thanks for reading!

Back to Start