Musical notes recognition using artificial neural networks.
Moise, Adrian ; Constantin, Adrian ; Bucur, Gabriela 等
1. INTRODUCTION
Artificial neural networks have known until now periods with
extreme activity and periods with disappointing results. It seems that
the first decade of 21st century is a period in which research focuses
more on practical applications in very diverse areas. Starting with
Hermann von Helmohltz in 1869 and Pavlov (Pavlov, 1927) who developed
theories about learning, going on with Hebb (Hebb, 1961) who enunciated
the principle of synaptic plasticity, to Kohonen (Kohonen, 1995) and
Hopfield (Hopfield, 1982) who developed new structures and training
methods, all periods have known practical applications for artificial
neural networks. In the last decade, the main domains in which
artificial neural networks proved their utility and efficiency are
functions approximations, data classifying, pattern recognition, shape
recognition, vocal identification, industrial process control, robotics,
and financial prediction. In (Yiadid-Pecht et al, 1996) musical notes
are recognized using a modified Neocognition model while the method
described here uses feed forward neural networks. This paper can be
included in the area of pattern recognition and automatic image to sound
conversion. The scales are relatively simple and the notes taken into
consideration are full, half, quarter and eight.
2. NOTE FEATURES
The authors suggest the following phases to solve the proposed
problem: acquiring an image using a web camera; identifying the stave
lines of the current scale and deleting them from the image; identifying
the properties of each note and erasing the current note; identifying
the notes by using the procedural method; exporting the characteristics
to the neural network; training the net using the training set; testing
the network; displaying the notes and playing the scale. Finding the
characteristics is the main processing step and it has as an objective
identifying the properties of each musical note. This algorithm includes
the image pre-processing and extracting the note properties. This step
is important because it does not contain redundancy elements and if the
step of extracting the note properties fails, the whole program will be
affected.
Input data. The program accepts as input data an image that will be
processed to obtain the characteristics. In the image, some noise pixels
could be present, stem, and flag may not be "standard" and
stave lines may not be exactly parallel and the note is green. The image
has to have the following properties: to have a five line stave, there
are not overlapped notes (or two, three voices), the distance between
two consecutive notes is at least one note (or, it can be setup in the
program).
Image conversion to binary (bitmap). The first step after acquiring
the image is to obtain the black and white version of it. The conversion
is made as follows: read each pixel of the image on rows and columns and
identify the local color levels (red - R, green - G, blue - B). Then,
convert the color to grayscale according to the effective luminance of a
pixel formula (Moise, 2005):
Y = 0.3 x R + 0.59 x G + 0.11 x B. (1)
Then, the grayscale image is converted to binary using a threshold
procedure. After this step, the scale is converted to black and white.
Noise rejection. When images are captured, there will be noise
pixels or groups of pixels due to scale light irregularities or
incorrect conversion. The noise rejection function eliminates all the
black pixels that have 4-connected and 8-connected white pixel
neighbors.
Stave lines. After the noise rejection is done, the next step is
enclosing the image into an area bordered by two vertical lines. That
means, all the lines will have the same length after the procedure is
applied. The authors called this process identifying the start and stop
points. The algorithm is the following: read the image columns upwards
from the lower left corner. When five consecutive transitions from 1 to
0 and five consecutive transitions from 0 to 1 will be found, the
corresponding vertical line will be considered. Similarly, to get the
left-hand sideline, the columns will be read from the lower right. After
applying these procedures, the original image will be bordered by the
two lines just found, as in Fig. 1.
Identifying characteristics. To identify a note one should find
some characteristics that define the note. One of them is the stave line
that is a relationship with the note. There are two kind of
relationships that can exist between notes and lines: the note is on the
line n or under the line n. The note head gives another characteristic:
full head or empty head. The flag gives the third characteristic: note
with or without flag.
[FIGURE 1 OMITTED]
In order to find the line interacting to the note, one should
identify the following points: the left end, the bottom end, the right
end, and the upper end of the note. To find the left end of a note, the
image is read starting with the lower left corner, on columns, until a 1
(black) pixel is found. Because there is a possibility to be many pixels
(a segment) on the left end of the note, the pixel in the middle of the
segment is kept. To find the bottom end of a note, the image is read
starting from the left end to the right, row by row. The reading ends
when the y coordinate of a pixel is smaller than the previous y
coordinate. The relative center of the note is found by using the
following reasoning. Since we have the left end (x _ s, y _ s) and the
bottom end (x _ j, y _ j), the center of the note will have the
coordinates (x _ j, y _ s). By finding the center of the note, one can
identify if the note head is full or empty: if the central pixel has the
value 1, the note has a full had, otherwise the head is empty. The upper
end and the right end of the note can be found by reading the image
starting from the center of the note in a vertical direction (for the
upper end) and in a horizontal direction (for the right end).
The flag and the stem. To find the length of a note (stem and flag)
the image is read according to the representation in Fig. 2. When such
an image is read, positive edges (changes from white to black) and
negative edges are found. If in the end of the reading the maximum
number of positive edges followed by negative edges equals 2, then the
note has a flag, if it equals 1, the note has only a stem.
The characteristics will be converted into binary as follows. The
first 7 bites represent the line number which interacts with the note.
For example, if the note interacts with the line 3, then the binary
number will be 0001000. Bit 8 represents the note position against the
interacting line. The value is 1 if the note is on the line, or 0 if the
note is under the line. Bit 9 represents the head of the note. It is 1
if the note has full head, 0 if it has an empty head. Bit 10 represents
the stem. It is 1 if the note has a stem, 0 if it has not. Bit 11
represents the flag. It is 1 if the note has a flag, 0 if it has not.
The characteristics for the note in the example above are 00010001111.
3. NETWORK DESIGN
The 11 features mentioned above will be inputs for a totally
interconnected neural net, which has 11 neurons on the input layer, a
hidden layer with 100 neurons and 2 neurons on the output layer. The
activation function for the neurons in the output layer was the linear
function. The back propagation algorithm (Chauvin & Rumelhart, 1995)
was used to train the network and a fragment of the training set is
shown in Tab. 1. For example, 1000000 represents the note DO. Le is the
note length, L1, ..., L7 are the lines of the stave, Full means full
head, Stem indicates the existance of a stem.
Output t1 indicates the note number and the output t2 gives the
note length (a real number 1 for a whole note, 0.5 for a half note, 0.25
for a quarter note or 0.125 for an eighth note). The training algorithm
was used to train the network with different activation functions.
[FIGURE 2 OMITTED]
Twelve different activation functions have been used for the hidden
layer neurons and only six ended the training process. They were
sigmoid, radial basis, hyperbolic tangent, triangular basis, linear
saturation, and linear symmetrical saturation. The triangular basis was
the fastest and the sigmoid was the slowest. After comparing the results
with different activation functions, different algorithms were taken
into consideration. The fastest algorithm was traingdx (gradient descent momentum and variable learning rate) and the slowest was (gradient
descent with variable learning rate).
Playing a scale. The above-described method was used to play the
two scales with 4 and 3 notes. The results obtained after training the
network are shown in Tab. 2. One can see that errors are in the range of
0.01 and the results are very accurate.
4. CONCLUSIONS
The authors developed an application for musical notes recognition
and playing scales directly from a video image. The main contributions
of the paper are the algorithms developed to find the 11 features of the
notes. These features have been used as inputs for a feed forward net.
One problem when using this method is the capture and scale binary
representation in the computer memory. Although the scale is ideal, that
means the stave lines are parallel, this scale will not be the same in
memory. This work can be continued by considering other artificial
structures that could be better used for recognizing and playing more
difficult scales with much more stiles of notes.
5. REFERENCES
Chauvin, Y., Rumelhart, D. E. (1995). Backpropagation: Theory,
Architecture, and Applications, Lawrence Erlbaum, ISBN 0-8058-1259-8,
Hillsdale, New Jersey, U.S.A.
Hebb, D.O. (1961). Distinctive features of learning in the Higher
animal, Oxford University Press, London, England
Hopfield, J.J. (1982). Neural networks and physical systems with
emergent collective computational abilities, Proceedings of the National
Academy of Sciences of the U.S.A., vol. 79 no. 8, pp. 2554-2558, U.S.A.,
April 1982.
Kohonen, T. (1995). Self-Organizing Maps, Springer, Vol. 30, ISBN
3-540-67921-9, Berlin, Germany
Moise, A.(2005). Neural networks for pattern recognition, MatrixRom
Ed, ISBN 973-685-904-5, Bucharest, Romania
Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the
Physiological Activity of the Cerebral Cortex, Translated and Edited by
G. V. Anrep, Oxford University Press, London, England
Yadid-Pecht, O., Gerner, M., Ddvir, L., Brutman, E, & Shimony,
U. (1996). Recognition of handwritten musical notes by a modified
Neocognitron. Machine vision and applications, vol.9, no. 2, pp. 65-72,
ISSN 0932-8092, Springer, Germany
Tab. 1. Identifying the length of a note
Le L1 ... L7 Pos Full Stem Flag t1 t2
full 1 0 0 1 0 0 0 1 1
Tab. 2. The network output after training
Note no. Note value Length
Note 1 4.99868--sol 0.119873--eighth
Note 2 1.99765--re 0.131964--eighth
Note 3 2.00323--re 0.229565--quarter
Note 4 4.98409--sol 0.253838--quarter
Note 5 2.99876--mi 0.122752--eighth
Note 6 2.99876--mi 0.122752--eighth
Note 7 2.99876--mi 0.494526--half