Using a neural network to determine a cat's emotional state by its meow.
Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等
Abstract: Early on, cats notice that meowing brings attention, contact, food and play from their human companions. Some behaviorists suggest that certain cats seem to alter their meows to suit different purposes, and that some people can differentiate between, say the "I'm Hungry!" meow from the "Let Me Out!" meow (Schultz, 2006). By using a neural network and some sound processing techniques, these emotional states can be determined with acceptable results, in order to create a communication bridge between these two species.
Key words: cat, meow, emotional state, neural networks
1. INTRODUCTION
After purring, the second most common vocalization is the meow. Rarely heard between cats, this vocalization seems tailor-made for communication between cats and humans. The meow is the most often used of the vowel patterns--vocalizations produced with the mouth first open and then gradually closing. (Schultz, 2006)
* The sound cats make when highly aroused by the sight of prey is called chirping.
* When a cat is frustrated (such as when an indoor cat finds he is unable to get to the birds at the feeder), you may hear him chatter.
* When a neonate kitten is cold, isolated from his mother or trapped, he issues a distress call-also sometimes called an anger wail. As the kitten matures, the distress call is used when play is too rough or the cat finds something else to protest.
Although it may seem hard and complex, designing a system that is able to differentiate between different emotional states of the cat is feasible. The proper solution to address this problem is to use plain old neural network theory.
Instead of using a supervised learning algorithm, we pursued in letting the neural network decide in what emotional state a certain meow can be included. The idea was to use a self organizing neural network which would perform similar to a clustering algorithm which splits different sounds into categories, and at the same time to have a human operator who would name each category with the emotional states observed in real life. We decided to use adaptive resonance theory for the initial grouping.
Adaptive Resonance Theory (ART) was developed by Carpenter and Grossberg over the period of 1976-1986, during their studies of the behavior of models of systems of neurons. The ART architectures use a combination of feedback, higher-level control, and nonlinearities to form regions in state space that correspond to concepts, based on the statistical structure of the input. The earlier ART models developed grandmother cells at an output layer. New patterns to be classified were judged either as new or as new examples of old concepts. If they were sufficiently far away from previously classified patterns in the state space, new grandmother cells were formed. On the other hand, if they were within a certain distance from a previously classified pattern, they would be interpreted as prototypes of known concepts. As such, ART models employ a combination of bottom up (input-output) competitive learning with top-down (output-input) learning (Krafft, 2002).
As you can see in Fig. 1, the main components are the attentional subsystem and the orienting subsystem. The attentional subsystem consists, among others, of two fields of neurons, F1 and F2, where each field may consist of several layers of neurons. These fields are connected with feed forward and feedback connection weights. The connection weights form the long term memory (LTM) components of the system and multiply the signals along these pathways. The concept "short term memory "(STM) will be associated with the pattern of activity that develops on a field as an input pattern is processed. The orienting subsystem is necessary to stabilize the processing of STM and the learning in LTM. As can be seen from the figure, the F1 field receives input from possibly three sources. These three input sources are the bottom-up input to F1, the top-down input from F2 and the gain control signal. To avoid the possibility that mere feedback from F2 can generate spontaneous activity at level F1, i.e., to avoid that the system hallucinates, system dynamics are limited in such a way that at least two out of three inputs must be active to generate activity at the F1 field. This is called the 2/3 rule in ART. The same rule applies to the three possible input sources for the F2 level.
[FIGURE 1 OMITTED]
2. PROPOSED METHOD
The main issue in this research project was the data extraction subsystem. The best method to address this problem was to use FFT in a predetermined number of points.
Our data (wav files with different meowing samples), consisted in an average length of 18000 samples with a sample rate of 11025 Hz. Considering this high values, and based on our intuition, we considered that a signal length of 1000 samples would be appropriate for this task. In order to obtain this value we used FFT with 1000 points.
It can be seen in Fig. 2 and Fig. 3 that the signal representation for two different emotional states are quite different, and our data indicates that a neural network not only will be able to differentiate between them, but it will be able to detect even the most subtle differences.
Our experiment was realized using the following emotional states samples:
* Anger * Hunger * Purring * Frustrations * Sleepiness * joy
Since none of us has a cat, our two primary sources of wav files containing different types of meowing, were the internet and friends with a lot of patience. We obtained almost 500 distinct samples, and by inserting small amounts of noise we multiplied the corpus to 5000 samples. From these samples we extracted 1000 (containing at least one original sample with all the modified samples containing noise for a concluding test), so as to have a training corpus consisting of 4000 meow samples and a test corpus of 1000 meow samples.
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
3. RESULTS
The ART module returned after the training phase a total of 56 distinct categories. Although the resulting number of categories is higher than the actual number of categories, apparently this happened because of the noise inserted to create more training samples. The neural network successfully identified the differences between the different emotional states, regardless of age and race. The only thing that created discrepancies was the sound quality.
Although purring and meowing for food were always different, sometimes, we had some similarities between meowing for attention and meowing for food.
4. CONCLUSIONS
Our research shows that acceptable results can be obtained by developing a system that can differentiate between different sounds that correspond to different emotional states using neural networks.
We noticed that the age, sex or race of the subjects, do nor interfere with the outcomes, but rather the quality sound. The only misclassifications we obtained were due to a large amount of noise.
We would also like to experiment with Kohonen's SOFM networks types, in order to see if using another type of neural network would considerably influence the results.
As future developments, we intend to continue the experiment at a larger scale, using a library of sounds 10 times the one we used for this project. At the same time we would like to study the implications of this experiment on humans, and see whether this theory can also be applied in case of small babies.
5. REFERENCES
Schultz J. (2006). Meow more than ever, Available from: www.adoptatscat.org/documents/Newsletters/2006%20Win ter%20Newsletter.pdf Accessed: 2007-05-16
Krafft M. (2002). Adaptive Resonance theory Available from: http://www.ifi.unizh.ch/~krafft/papers/2001/wayfinding/ht ml/node97.html Accessed: 2007-01-16
Carpenter, G. & Grossberg, S. (1991). Supervised real-time learning and classification of nonstationary data by a self-organizing neural network, In: Pattern recognition by self organizing neural networks, Carpenter, G. & Grossberg, S., (Ed. MIT press), 21-144, Publisher MIT press, ISBN 0-262-03176-0, Cambridge Massachusetts
Campbell P. J. (1997). Speaker Recognition: A Tutorial Available from: www.cs.bgu.ac.il/~orlovm/storage/speaker-recognition.pdf Accessed: 2007-01-16