首页    期刊浏览 2025年12月19日 星期五
登录注册

文章基本信息

  • 标题:Challenges of natural language communication with machines.
  • 作者:Delic, V. ; Secujski, M. ; Jakovljevic, N.
  • 期刊名称:DAAAM International Scientific Book
  • 印刷版ISSN:1726-9687
  • 出版年度:2013
  • 期号:January
  • 语种:English
  • 出版社:DAAAM International Vienna
  • 摘要:As the most common way of communication between humans, speech has been considered as a convenient medium for human-machine interaction for a long time. Firstly, speech is a natural interface and humans are already fluent in it. Furthermore, while communicating with machines using speech, humans are free to simultaneously perform other tasks (Shafer, 1994). It has even been suggested that the very invention of speech by humans was not related principally to their desire to express their thoughts (for that might have been done quite satisfactorily using bodily gesture), but rather to their desire to "talk with their hands full" (Paget, 1930). Throughout the history, humans have continued to use the same communication interface not only between themselves, but also to address animals, who have been the principal technological aid to the mankind for a long time. It is therefore quite natural that, since animal power was replaced with machines, humans have been interested in the development of the technological means to extend speech communication interface to machines as well.
  • 关键词:Computational linguistics;Language processing;Machine learning;Natural language interfaces;Natural language processing;SMS (Short messaging service)

Challenges of natural language communication with machines.


Delic, V. ; Secujski, M. ; Jakovljevic, N. 等


1. Introduction

As the most common way of communication between humans, speech has been considered as a convenient medium for human-machine interaction for a long time. Firstly, speech is a natural interface and humans are already fluent in it. Furthermore, while communicating with machines using speech, humans are free to simultaneously perform other tasks (Shafer, 1994). It has even been suggested that the very invention of speech by humans was not related principally to their desire to express their thoughts (for that might have been done quite satisfactorily using bodily gesture), but rather to their desire to "talk with their hands full" (Paget, 1930). Throughout the history, humans have continued to use the same communication interface not only between themselves, but also to address animals, who have been the principal technological aid to the mankind for a long time. It is therefore quite natural that, since animal power was replaced with machines, humans have been interested in the development of the technological means to extend speech communication interface to machines as well.

The design of a machine which mimics human communication capabilities in terms of understanding spoken utterances and responding to them properly has been recognized as a scientific problem for centuries (Juang & Rabiner, 2005). Since the first system for "speech analysis and synthesis" proposed in the 1930s (Dudley, 1939, Dudley et al, 1939), there has been tremendous progress in the field. The systems pretending to understand speech utterances have progressed from simple machines that respond to small sets of sounds to sophisticated systems able to respond properly to fluently spoken natural language, taking into account the statistics of the natural language in question as well as the variability introduced by different communication channels or speaker characteristics. on the other hand, the systems producing human speech have evolved from machines able to reproduce only individual speech sounds to systems able to produce sentences of natural language virtually indistinguishable from those produced by a human speaker. The research in the field of speech technology has been further accelerated by a rapid advent of powerful computing devices, leading to the emergence of a range of commercially available applications based on human-machine speech interaction, including personal assistants, dictation systems, information servers as well as aids to the disabled.

The full potential of speech as a human-machine interface can be reached only in case of natural language interfaces, which, unlike directed dialog interfaces, allow humans to communicate in the same conversational language they would use when talking to other humans (Minker & Benacef, 2004). However, the development of such an interface is burdened with the incorporation of a large quantity of domain knowledge into a very complex model of understanding, in order to be able to handle any user input successfully. consequently, human-machine speech interaction is closely tied to the area of natural language processing (NLP), i.e. the study of computertreatment of natural human languages, including a wide variety of linguistic theories, cognitive models, andengineering approaches. With the rapid development of the Internet, large quantities of textual and speech data have become available, which, together with the technological progress in the computer industry, enables new advances in natural language processing and the development of algorithms which may have not been computationally feasible until now. the research in the field of speech technology today focuses on a number of fields, among which the following are recognized as the most important:

* Spoken language understanding(SLU), aimed at the extraction of meaning from uttered words. When related to the conversion of a spoken utterance to a string of lexical words only, it is referred to as automatic speech recognition (ASR). Within multimodal communications systems, other input modalities including touch and image can be used together with speech (fig. 1).

* Spoken language generation(SLG), aimed at the generation of a spoken utterance from the meaning represented according to an existing semantic model (in which case it also comprises the problem of composing a sentence that would convey particular meaning) or from a readily available string of words (in which case it is referred to text-to-speech synthesis or TTS). Within multimodal communication systems appropriate visual output can be generated as well (e.g. a talking head), enhancing the efficiency and naturalness of communication (fig 1.).

* Human machine dialogue management, aimed at the enabling of machines to support a dialogue similar to one between humans, and related to the implementation of semantic models and dialogue processes. A dialogue strategy should be based on the language act theory, and should take into account contextual information and the knowledge of the interaction domain.

* Recognition and production of vocal emotion, aimed at modelling the links between human emotion and the features of human speech communication related to it. Namely, emotion appears to be conveyed through changes in pitch, loudness, timbre, speech rate and timing which is largely independent from linguistic and semantic information.

[FIGURE 1 OMITTED]

It should be kept in mind that all of the aforementioned fields are heavily language dependent, in most cases the extent of language dependency being such that it is necessary to develop a great deal of speech and language resources and techniques independently for each language. however, models and algorithms used to treat corresponding problems across different languages are largely the same, which leads to the conclusion that there exist both global and language dependent challenges of enabling fluent human machine speech communication. Recently, great scientific and technological development has been is observed both as regards global and language specific challenges, the latter principally owing to the fact that speech and language resources have recently begun to appear for small languages as well (Delic et. al., 2010).

2. Challenges of automatic speech recognition

The term automatic speech recognition (ASR) refers to the automatic identification of the lexical content in a spoken utterance. Research in this field has been conducted for over 60 years, during which many different paradigms were explored. The early ASR systems were based on acoustic-phonetic theories which explain how phonemes are acoustically realized in spoken utterances. It was assumed that phonemes can be characterized by a set of acoustic properties that make distinction between phonemes. It was also assumed that coarticulation effects are straightforward and can be easily learned by a machine as well. In the recognition phase the first step was the segmentation of an utterance into stable acoustic regions and the assignment of possible (acoustically closest) phonemes to each segmented region, which resulted in a phoneme lattice. The second step was the determination of a valid word from the phoneme lattice, applying linguistic constrains such as a vocabulary or syntax or semantic rules (Juang & Rabiner, 2005).

In the 1970's stochastic paradigm was introduced and it became the main framework for further development of ASR in the next three decades. It was assumed that speech can be considered as a code that was transmitted over a noisy channel, and for that reason a number of algorithms from the information theory were applied with some adaptations (Jurafsky et. al., 2000). The basic premise is that hidden Markov model (hMM) state sequence can be used to describe the dynamics of phoneme sequence in a spoken utterance and the probabilistic nature of correspondence between linguistic codes and speech waveforms. Speaker and channel variability were modelled by Gaussian mixture models (GMM). In this model, the goal of the recognition process is to find the most probable hMM state sequence, and the information theory already offered fast solutions in Viterbi and A * decoding algorithms (Huang et. al., 2001). An efficient procedure to estimate HMM parameters exists as well (Baumet. al., 1970). Additionally, the stochastic framework provides an elegant way to incorporate an acoustic model, which contains knowledge about acoustics, phonetics, channel and environment, with a language model, i.e. a system of knowledge about word constituents, order of words in a sequence etc.

There are many reasons why this statistical approach was dominant for so many years. A competitive approach was the one based on artificial neural networks (ANN), but one of the issues it faced was the temporal nature of speech signal. This can be overcome by using recurrent neural networks (Huang et. al., 2001) or hybrid models which use combinations of HMM and ANN where ANN are used to estimate HMM state emission probabilities (Bourlard & Morgan, 1993). However, in the 1990's it was challenging to set the number of parameters of ANN which can match the number of parameters of GMM because the ANN training algorithms are based on stochastic gradient descent. Additionally, the development of discriminative training algorithms for GMM based on maximum mutual information criterion and minimum phone/word error compensate forthe discriminative nature of ANN (He et. al., 2008). However, ANNs have found their role in robust feature extraction (since they can obtain class discriminative features) (Hermansky et. al., 2008) and in language modelling (since they are efficient in probability smoothing by word context similarity) (Bengio et. al., 2008).

One of the advantages of GMM-based ASR systems is the existence of efficient adaptation techniques, which transform features and/or parameters of the acoustic models to better match a given test environment. The standard adaptation technique is maximum likelihood linear regression (MLLR), which adapts Gaussian means to the test environment using maximum likelihood and reduces the word error rate significantly (Leggetter & Woodland, 1995). Nevertheless, the adaptation techniques in feature space, such as cepstral mean and variance normalization, vocal tract length normalization (Jakovljevic et. al., 2009) and feature-space MLLR (Gales, 1998) in conjunction with MLLR have additionally reduced the word error rate.

The ASR systems have achieved a significant development but their performances are still 3 times worse than humans in terms of word error rate (see fig 2.). They are far less robust to different acoustic environments (noise, reverberation, background talk, etc.), communication channels (far-field microphones, cell phones, etc.), speaker characteristics (speaking style, accents, emotional state, etc.), language characteristics (dialects, vocabulary, topic domain, etc.).

Nowadays the amount of training data for resource-rich languages is not a matter of concern, and it is thus easy to obtain more data. However, the increase in the quantity of training data cannot improve the performance of the system significantly. In the experiments the amount of data has been increased by a factor of 5, and model complexity by a factor of 6, but the relative improvement in recognition performance was only 15% (Evermann et. al., 2005). It is our belief that the next breakthrough in ASR can be achieved by applying machine learning algorithms such as deep belief networks, graphical models and sparse representation.

2.1 Deep Belief Networks

Recently presented results suggest that further improvement in ASR can be achieved by neural-networks, more precisely, using deep belief networks (DBN) (Deng et. al., 2005). Deep belief networks are probabilistic generative models that are composed of multiple layers of stochastic latent variables (Hinton, 2009).

[FIGURE 2 OMITTED]

They can be viewed as multi-layer perceptrons (MLp) with many hidden layers, but the training procedure assumes learning of one layer at a time, treating the values of the latent variables of the lower layer as data for training the higher layer. The last step in the training procedure is fine-tuning by back-propagation algorithm. Such a structure is beneficial since many simple non-linearities in each hidden layer can generate a complicated non-linearity which transforms data in a space where a linear classifier is sufficient.

It has been shown that DBN, with an appropriate training strategy and structure can result in a model which is speaker, channel and language independent (Deng et. al., 2005). The adaptation of DBN is more difficult than the adaptation of GMM, but word error rate of DBN without adaptation is far lower that the best GMM. The future work on DBN should include the pursuit for more effective deep architecture and learning algorithms.

2.2 Graphical Models

Graphical models describe dependences between a set of random variables with a graph (where each node represent random variable and edge correlation between them) (Bilmes, 2010). They provide a formal language for systematic model design and analysis, as well as efficient learning and inference techniques. Transforming an existing model into an appropriate graphical model one can get tractable and proved algorithms for learning an inference. For example restricted Boltzmann machines, which are the basis for DBN, can be transformed into Markov random fields (Deng & Li, 2013).

The standard way to obtain model-based robust ASR is to model each source of sound independently. The resulting audio file is a result of nonlinear mixing of the source models. Graphical models provide a systematic knowledge for discovering and representing such a structure and for exploiting it during inference. This approach has resulted in a multi-talker ASR algorithm which can separate and recognize the speech of four concurrently talking speakers using a single microphone (Rennie et al, 2009). It is based on tractable loopy belief propagation algorithm for iterative decoding multiple speakers. It is interesting to note that the algorithms proposed in (Rennie et al, 2009) out-performed human listeners. Similar approach can be used for speech recognition in noisy conditions as well.

2.3 Sparse Representation

The term sparse representation denotes a representation of a signal as a linear combination of a small number (smaller than the dimension of signal space) of elementary signals called atoms. The number of atoms is usually much larger than the dimension of signal space. Usually sparse representation is used for denoising, but it can be used for classification or both. The main idea is that a signal can be decomposed into atoms which model speech, and atoms which model noise, and to use only the speech part in a classification task. This idea has been tested in (Gemmeke et al, 2009) and it has outperformed the GMM model. The models which are described by examples instead by parameters are called exemplar based. These models have not been widely used in ASR community, thus they can constitute another direction for future research. Sparse representation can be used in combination with DBN (Yu et al, 2012). The results show that the model size can be reduced significantly (by 70-90%), almost without increasing the word error rate at all.

3. Challenges of Text-To-Speech Synthesis

The technology of text-to-speech synthesis deals with the conversion of arbitrary text into human speech in a particular language. Bridging the gap between plain text and synthesised speech with all its typical features such as intelligibility and naturalness is a complicated task, spanning multiple linguistic domains from phonetics to discourse analysis. As there is no explicit information in a plain text concerning phone durations, pitch contours nor energy variations, these factors have to be recovered from the text in the specific prosodic or expressive framework of a given speaker. The dependency of these factors from linguistic factors has to be properly modelled in order to attain high naturalness of synthesised speech (Dutoit, 1997, Morton & Tatham, 2005). The recovery of prosodic features from text is an exceedingly language dependent task referred to as high-level synthesis, and it includes natural language processing of text and its conversion into a suitable data structure describing the speech signal to be produced (referred to as the utterance plan).

The necessary steps of high-level synthesis include expanding numbers, abbreviations and other non-orthographic expressions, as well as resolution of morphological and syntactic ambiguities. A correct resolution of ambiguities is important because any error may easily lead to errors in the prosodic features of speech, impairing its naturalness. It should be kept in mind that the naturalness of synthetic speech is not merely a question of aesthetics, because incorrect intonation can mislead listeners or force them to temporarily focus their attention to lexical segmentation (identification of individual words in the input speech stream) instead of the actual meaning of the text. The largely language independent low-level synthesisis related to the production of the actual speech signal, whether by concatenation of pre-recorded segments of speech or as an output of a statistical model of speech, as in the case of hidden Markov model (HMM) based synthesis. While concatenation based techniques were the approach of choice to a majority of researchers and developers until recently, the popularity of synthesis methods based on statistical models has begun to increase, owing to their flexibility (ability to switch between speakers or speaker styles), smaller computational load and memory footprint (making them a more suitable option for environments such as portable devices), as well as speech integrity (the enhanced impression that the speech comes from a single speaker).

The focus of text-to-speech synthesis has recently shifted from intelligibility to naturalness, and some issues have not yet been addressed in a satisfactory way. Namely, current state-of-the-art speech synthesizers are still unable to produce speech which would be indistinguishable from human, and this constitutes one of the last frontiers of speech synthesis. The sources of expressiveness in human speech are not yet well understood, and whatever they may be, they are highly variable, which further complicates the task. Furthermore, rather than adding specific prosodic or expressive content to a "neutral" acoustic rendering of the sentence, rendering the utterance plan within a specific prosodic or expressive content thought of as a wrapper should be considered (Morton & Tatham, 2005).

The study of introducing expression into synthesized speech is related to the understanding that speech is generally influenced by intrinsic phenomena which are physical, but can nevertheless be deliberately interfered with (partially negated or enhanced), in order to "depart from listener's expectations" and thus convey particular meaning or any other feature of expressive speech. For example, subglottal air pressure progressively decreases with speaking, however, this does not prevent the speaker from controlling the fundamental frequency in order to convey a particular lexical accent or to give a particular word some prosodic prominence.

For that reason, the utterance specification has to be enriched with specific prosodic markup, which will reflect the changes in the expression and initiate appropriate events in the prosodic rendering of the utterance. The prosodic markup should also account for the (supposed) reaction of the speaker to the semantic or pragmatic content of the text, expressed through continuous changes in the prosodic framework including intonation, rhythm, as well as the precision of articulation. The connection between the prosodic markup and the actual prosodic rendering of the sentence is highly non-linear, and the efforts to model this relationship still fall short of producing speech which would be indistinguishable from human, principally because the gap in our understanding of how speech is produced by human beings has been underestimated.

4. Towards Emotional Speech Recognition and Synthesis

People take emotion expression and recognition for granted, but it is actually a complex process that everybody learns from the day they were born. communication through emotions presents a huge part of everyday communication between people, and emotions are present in almost any interaction. In the near future, it will be impossible to examine any speech recognition/understanding or a speech synthesis system, or build a facial and gesture tracking system without analyzing one of the key elements of communication--emotion.

Emotions can be expressed through voice (speech emotions), face (facial expressions), and/or body (emotional body gestures). In their study, Ekman and Friesen (Ekman & Friesen, 1997) discussed six emotions: happiness, sadness, surprise, anger, fear, and disgust, which became known as the "basic" emotions (Ekman & Friesen, 2002), used in much related research since. Any complex emotion is considered to be a mixture of several basic ones.

The field of emotion recognition has shown tremendous potential in many areas, such as the commercial use of emotion recognition in voices in call center queuing systems (Petrushin, 2000). The use of emotion recognition technology has recently been brought under the spotlight in terms of its potential to support countering terrorism with technology. Ball (Ball, 2011) discusses enhancing border security with automatic emotion recognition. Another possible use of emotion recognition is as an aid to speech understanding (Nicholson et al, 1999). They stress that emotion in speech understanding is traditionally treated as "noise", but that a better approach would be to subtract emotions from speech and improve the performance of speech understanding systems.

A number of further applications have been proposed, which might benefit from emotion recognition components (Hone & bhadall,2004), such as intelligent tutors which change the pace or content of a computer-based tutorial based on sensing the level of interest or puzzlement of the user (Lisetti & Schiano, 2000, Picard, 1997), entertainment applications such as games or interactive movies, where the action changes based on the emotional response of the user (Nakatsu et al, 1999), help systems which detect frustration or confusion and offer appropriate user feedback (Klein, 2002) and so on. Recent studies (Stankovic et al, 2012) have shown that emotion recognition systems that utilize only speech or facial expressions do not represent a realistic way of communication and expressing emotions. In everyday life, humans use both vision and hearing to recognize emotions, thus bimodal approaches, that utilize both vision and hearing, presents a more realistic and intuitive way of detecting emotional states. It seems that people rely on both "ear" and "eye" when detecting emotions, and that for recognizing some emotions we use sound signals, while for some other emotions are more "visual" (Stankovic et al, 2012). Similarly, for expressing different emotions we employ different strategies--facial or gesture expressions, or emotional speech.

However, even though there is much work on facial expression and gesture recognition, emotion speech recognition, understanding, and synthesis, it seems that emotion recognition and synthesis is still an unsolved field. The lack of a standard, a universally agreed method for detecting and conveying emotions perhaps lies in the fact that recognizing and expressing emotions comes so naturally to us, humans, thus being particularly difficult for us to tell what distinguishes one emotion from another.

In facial expression recognition, Ekman and Friesen (Ekman & Friesen, 1977) defined facial action coding system by closely examining facial movements. Every emotion facial expression is just a combination of the movements of several facial muscles, and each basic facial movement is represented as action units (AU). Thus, presence/absence of certain AUs can tell us a lot about an expressed emotion. There are certain facial movements (AUs) that are present in almost all expressions of one emotion, and absent from expression of all other emotions.

For example, happy expression is almost always expressed by pulling lip corners --smile. However, facial expression also depends on many factors (culture, temperament, etc.), so expressions differ from one subject to another. Due to the shift of facial expression research from the recognition of acted to more spontaneous expressions, the major obstacle in the future seems to be the lack of a spontaneous expression database. Studies have proven that human behaviour becomes unnatural the moment subjects know or suspect that they are being recorded, so it is yet to be discovered what kind of approach should be used in order to capture data with reallife spontaneous expressions in different illumination and occlusion conditions.

On the other hand, in emotional speech recognition and synthesis, we still lack a standard method for capturing and conveying emotions into speech. This is probably due to the fact that, unlike in facial expressions, there are more factors that influence speech/language, such as culture, language group, education etc., making this field of emotion recognition and synthesis a bit more challenging. While the task of emotional speech recognition is to recognize a particular emotion in human speech, the task of emotional speech synthesis is to convey it through synthesized speech. One of the main goals in this field is to detect and subtract emotional "noise" from speech, making speech recognition and comprehension easier, or synthesis a particular emotion and add it to a "flat" synthesized speech.

Many papers have studied emotions in different languages, but the future challenges would be to address multi-cultural and multi-lingual evaluations. In different languages, different speech characteristics (features) show different importance to recognize or reproduce emotions. There are several methods, such as mel-frequency cepstral coefficients (MFcc), that generally show good results in speech and emotion recognition. Also, some speech features show similar "behaviour" in most of the examined languages, and represented a starting point in any research. As Pantic and Rothkrantz (Pantic & Rothkrantz, 2003) presented, emotional speech can be examined in most of the Indo-European languages by monitoring features such as pitch, intensity, speech rate, speech contour, etc., while in some tonal languages (Thai and Vietnamese), speech emotion is more easily studied using MFCC, fundamental frequency, and zero-crossing rate (Stankovic et al, 2012).

Recently, bimodal systems have become increasingly popular, due to their stress on naturalness. These systems contain both speech emotions and facial expression, thus audio and video have their influences on one another. For example, subjects are unable to express surprise as they would express it if only expressing facial gestures without speaking. This influence of emotion speech on face movements makes it more difficult to recognize facial expressions in bimodal systems (Stankovic et al, 2012), simply because those facial expressions are less expressive, hence less informative, and more difficult to recognize. But they also represent more realistic expressions.

It is clear that in the near future more systems will focus on employing bimodal information (speech and vision), because it represents a more natural and realistic research environment, which is surely one of the main goals in the field of emotion recognition and in engineering in general. Unfortunately, bimodal systems still lack a standard database, so it is particularly difficult to compare those systems. However, with several new bimodal systems introduced every year, this problem will soon be overcome and these systems even more employed in studies.

How exactly emotional body gestures and emotions are interrelated is still an unsolved question. Most of the research is focused on detecting and synthesising emotions in speech and facial expressions, yet our bodily movements reveal an equally significant portion of emotional state as other cues of human-human interaction. For instance, talking over phone with a person that speaks language that we do not understand is almost impossible. on the other hand, communicating with that same person face-to-face will, due to the lack of a common language, lead us to employ gestures much more in order to compensate the lack of speech. As research shows, depending on a culture, people use over 200 gestures in everyday interaction, proving that this is as an important field in emotion recognition and synthesis as face expressions and speech are.

5. Challenges of the Representation of Meaning in Dialogue Systems

The essence of the system's advanced communicative competence lies in the ability to properly interpret the user's utterance, and to adaptively manage a natural and consistent dialogue. The human-machine dialogue is natural to the extent that the system is able to address various inherently present dialogue phenomena, such as ellipses, anaphora, ungrammaticalities, meta-language, context-dependent utterances, corrections and reformulations, mixed initiative, miscommunications, uncooperative user's behavior, etc. The dialogue is consistent to the extent that the system is able to dynamically capture and represent the meaning of the dialogue, and to evaluate the user's dialogue acts with respect to it. These tasks significantly differ from the machine learning tasks such as speech recognition. At the conceptual level, they inevitably require contextual analysis, which raises the important research question of modelling the contextual information and the general knowledge of the interaction domain. In other words, an approach to meaning representation in dialogue systems should be analytically tractable and with the explanatory power.

This methodological requirement implies that the currently dominant practice of applying statistical methods to language corpora in order to derive data-driven rules for pattern recognition does not suffice. It is fair to say that the statistical approaches are quite prevalent today in the field of language processing (Chomsky, 2011), and often dogmatically anti-representational (Wilks, 2007). This trend is a consequence of relative successes of statistical approaches--in comparison with the early approaches based on logic and formal rules--over the last two decades in some aspects of machine learning. The dogma reflects in the assumption that systems may be trained to manage dialogues only by means of automated analysis of large corpora (i.e., recorded conversations).

However, the state-of-the-art in the field shows that this assumption was much too strong. Although significant scientific work was devoted and a number of sophisticated prototype systems delivered, the requirement for a natural and consistent machine dialogue still remains an elusive ideal. The general criticism levelled at statistical approaches is that they are epistemic devices taking into account only the external dialogue behaviour (cf. (Searle, 1993)), and ignoring the fact that language is a biologically innate ability that involves different linguistic and mnemonic structures, and cognitive processes, that cannot be derived simply from language corpora (cf. Chomsky, 2011, Chomsky, 2000). We propose that taking into account the insights from behavioural and neuroimaging studies on various aspects of the human language processing system (e.g., attention, memory, etc.) is a promising research direction for further advancement of the field. The idea that the development of intelligent machines should be based on modelling people is not new (cf. (Schank, 1980)), but it is only now that results of neuroimaging studies may shape the field of human-machine interaction.

As a case in point, recent work of Gnjatovic and colleagues introduces a cognitively-inspired representational approach to meaning representation in dialogue systems that integrate insights from behavioural and neuroimaging studies on working memory operations and language-impaired patients. The approach is computationally appropriate with respect that it is generalizable to different interaction domains and languages, and scalable. For detailed argument, the reader may consult (Gnjatovic et al, 2012, Gnjatovic & Delic, 2012).

At the level of strategic challenges, it is reasonable to expect that the research question of adaptive dialogue management will have one of the central roles in the emerging fields of social and assistive robotics, and in the development of companion technologies. The robots' capacity to engage in a natural language dialogue would significantly--if not crucially--contribute to establishing long-term social relations to robots. Future prospects of the field of adaptive dialogue management include many challenging research problems. We shortly state some of them, although the list is by no means complete.

(1) Enabling the dialogue systems to manage multi-party dialogues in a dynamical and rich spatial context. In general, the users and the system share two interrelated contexts during the interaction--a verbal and a spatial context. Information about the spatial context is often essentially important for understanding and organizing communications (Bohus & Horovitz, 2010). It implies that the system should be aware of the surrounding environment (including the relevant interlocutors) in order to manage dialogue processes. For example, the systems should be able to robustly process linguistic inputs that instantiate different encoding patterns of motion events (e.g., bipartite and tripartite spatial scene partitioning, etc., cf. (Gnjatovic et al, 2012 and 2013)) and spatial perspectives (e.g., user-centred frame of reference, etc., cf. (Gnjatovic & Delic, 2013)).

(2) Investigation of the role of emotion and trust in human-machine interaction. An aspect of this broad research direction is focused on the investigation of linguistic cues for early recognition of negative dialogue developments, and development of dialogue strategies for preventing and handling negative dialogue development. The research on emotions is essentially supported by corpora containing samples of emotional expressions. A methodological challenge here is how to produce an appropriate, realistic emotion corpus in a laboratory setting. Reference (Gnjatovic & Rosner, 2010) proposes a substantial refinement of the Wizard-of-oz technique in order that a scenario designed to elicit affected behaviour in human-machine interaction could result in realistic and useful data. The proposed approach integrates two lines of research: taking into account technical requirements of a prospective spoken dialogue system, and psychological understanding of the role of the subject's motivation. The evaluation of the corpus reported in (Gnjatovic & Rosner, 2010) demonstrated that it contains recordings of overtly signalled emotional reactions whose range is indicative of the kind of emotional reactions than can be expected to occur in the interaction with spoken dialog systems. Since the subjects were not restricted by given predetermined linguistic constraints on the language to use, their utterances are indicative of the way in which non-trained, non-technical users probably like to converse with conversational agents as well.

Although in the last decade we have coincidentally witnessed the rapid increase of research interest in affected user behaviour, research in this domain is usually primarily concentrated on the detection of emotional user behaviour. However, less attention is devoted to another important research question--how to enable dialogue systems to overcome problems in the interaction related to affected user behaviour. Adaptive dialogue management is a promising research direction to address the latter question. Reference (Gnjatovic & Rosner, 2008) discusses the basic functionalities of the adaptive dialogue manager, including modelling contextual information (including the emotional state of the user), keeping track of the state of the interaction, and dynamically adapting both analytical and generative aspects of the system's behaviour according to the current state of interaction. In other words, this can be formulated as: recognizing that a problem occurred in the interaction, providing support to the user in an appropriate form--tailored to a particular problem and to the user's individual needs--and trying to advance the interaction.

(3) Enabling the dialogue systems to use reinforcement learning--e.g., by analyzing the history of interaction and the profile of the user--in order to dynamically adapt its dialogue strategy for a given user in a given situation. This is particularly important for the development of long-term collaborative conversational agents.

We take these research problems to be of great importance for increasing the level of adaptivity of human-machine dialogues. Adaptive dialogue management is of primary importance for increasing the level of naturalness of human-machine interaction and, consequently, the level of acceptance of such interfaces by users. Although considerable research effort is already to be noticed in this field, its possibilities are by no means sufficiently explored. It is to be expected that further advancements in this field will be influenced both by technological and socio-cultural trends. Currently, there is an enthusiastic and (sometimes unduly) optimistic atmosphere in the scientific community with respect to the anticipated progress. This might by partially influenced by the fact that cognitive sciences and robotics are, at the moment, in the group of the research fields that are prioritized in fundraising activities. However, the influence is mutual--a progress in the field may have a strong influence on society, culture, economy, etc. This places even greater responsibility on the researchers. one of the certainly most difficult challenges will be to address many ethical issues raised by this technology, including military use, unethical exploitation of human social drives (Bryson, 2000), giving the users a misleading impression of the system's expertise level (Weizenbaum, 1993), etc. A great deal of the responsibility for unethical exploitations of scientific results lies with the researchers. In words of the Joseph Weizenbaum, computer scientists do not have the right to accuse politicians for leading their countries into wars--it would not be possible without computer scientists.

6. Acknowledgements

The presented study was sponsored by the Ministry of Education and Science of the Republic of Serbia under the Research grants TR32035, 11144008 and OI178027. The responsibility for the content of this paper lies with the authors.

7. References

Schafer, R. W. (1994) Scientific Bases of Human-Machine Communication by Voice, in: Voice Communication Between Humans and Machines, Roe, David B., Wilpon, Jay G. (Eds.), National Academy Press, Washington D.C., USA, pp. 15-33. Paget, R. (1930) Human Speech, Harcourt, New York, USA

Juang, B.H. & Rabiner, L. R. (2005) Automatic Speech Recognition--A Brief History of the Technology Development, in: Encyclopedia of Language and Linguistics, Brown, Keith (Ed.), Elsevier, Amsterdam, the Netherlands

Dudley, H. (1939) The Vocoder, Bell Labs Record, Bell Labs, NJ, USA, Vol. 17, pp. 122-126

Dudley, H, Riesz, R. R., Watkins, S. A. (1939) A Synthetic Speaker, J. Franklin Institute, Philadelphia, PA, USA, Vol. 227, pp. 739-764

Minker, W. & Bennacef, S. (2004) Speech and Human-Machine Dialog, Kluwer, Dordrecht, the Netherlands

Delic, V., Secujski, M., Jakovljevic, N., Janev, M., Obradovic, R., Pekar, D. (2010) "Speech Technologies for Serbian and Kindred South Slavic Languages", in: Advances in Speech Recognition, Noam R. Shabtai (Ed.), SCIYO, ISBN 978-953-307-097-1, pp. 141-164

Jurafsky, D., Martin, J. H., Kehler, A., Vander Linden, K. & Ward, N. (2000) Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, Prentice-Hall Inc., Eglewood Clifs, New Jersey

Huang, X., Acero, A., & Hon, H.-W. (2001) Spoken language processing: a guide to theory, algorithm, and system development, Prentice Hall PTR New Jersey

Baum, L. E., Petrie, T., Soules, G. & Weiss, N. (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains The annals of mathematical statistics, JSTOR, Vol. 41, pp. 164-171

Bourlard, H. & Morgan, N. (1993) Continuous speech recognition by connectionist statistical methods Neural Networks, IEEE Transactions on, IEEE, Vol. 4, pp. 893-909

He, X., Deng, L. & Chou, W. (2008) Discriminative learning in sequential pattern recognition Signal Processing Magazine, IEEE, Vol. 25, pp. 14-36

Hermansky, H., Ellis, D. P. & Sharma, S. (2000) Tandem connectionist feature extraction for conventional HMM systems, Proc. ICASSP'00, Vol. 3, pp. 1635-1638.

Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. (2003) A neural probabilistic language model, J. Mach. Learn. Res., Vol. 3, pp. 1137-1155

Leggetter, C. J. & Woodland, P. (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models Computer Speech & Language, Elsevier, Vol. 9, pp. 171-185

Jakovljevic, N., Secujski, M. & Delic, V. (2009) Vocal Tract Length normalization strategy based on maximum likelihood criterion, Proc. of EUROCON 2009, pp. 417-420

Gales, M. J. (1998) Maximum likelihood linear transformations for HMM-based speech recognition Computer speech & language, Elsevier, Vol. 12, pp. 75-98

Evermann, G., Chan, H., Gales, M., Jia, B., Mrva, D., Woodland, P. & Yu, K. (2005) Training LVCSR systems on thousands of hours of data, Proc. of ICASSP 2005, Vol. 1, pp. 209-212

Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J. Gong, Y. & Acero, A. (2013) Recent advances in deep learning for speech research at Microsoft, Proc. of ICASSP 2013

Hinton G. E. (2009) Deep belief networks, Scholarpedia 4(5):5947 J. Bilmes, (2010) Dynamic graphical models, IEEE Signal Process. Mag., Vol. 33, pp. 29-42

Deng, L. & Li, X. (2013) Machine learning paradigms for speech recognition: An overview, IEEE Trans. on Audio, Speech, and Lang. Process., IEEE, Vol. 21, pp. 1060-1089

Rennie, S. J., Hershey, J. R. & Olsen, P. A. (2009) Single-channel speech separation and recognition using loopy belief propagation, Proc. ICASSP 2009, pp. 3845-3848

Rennie, S. J., Hershey, J. R. & Olsen, P. A. (2009) Hierarchical variational loopy belief propagation for multi-talker speech recognition, Proc. ASRU 2009, pp. 176-181

Gemmeke, J. F., Virtanen, T. & Hurmalainen, A. (2011) Exemplar-based sparse representations for noise robust automatic speech recognition, IEEE Trans. on Audio, Speech, and Lang. Process., Vol. 19, pp. 2067-2080

Yu, D., Seide, F., Li, G. & Deng, L. (2012) Exploiting sparseness in deep neural networks for large vocabulary speech recognition, Proc. of ICASSP 2012, 4409-4412

Dutoit, T. (1997) An Introduction to Text-to-Speech Synthesis, Kluwer, Dordrecht, the Netherlands

Morton, K., Tatham, M. (2005) Developments in Speech Synthesis, Wiley, Chichester, UK

Ekman, P., Friesen, W. V. (1977) "Manual for the facial action coding system," Consulting Psychologists Press, Palo Alto, USA

Ekman, P., Friesen, W. V., and Hager, J. C. (2002) "Facial action coding system: a human face," Salt Lake City, Utah, USA

Petrushin, V. (2000) "Emotion recognition agents in real world," Association for the Advancement of Artificial Intelligence (AAAI) Fall Symposium on Socially Intelligent Agents: Human in the Loop, November 3-5, 2000, North Falmouth, Massachusetts, US

Ball, L. (2011) "Enhancing border security with automatic emotion recognition," International Crime and Intelligence Analysis Conference 2011 (ICIAC11), November 2011, Manchester, UK

Nicholson, J., Takahashi, K., and Nakatsu, R. (1999) "Emotion recognition in speech using neural networks," Proceedings of the Sixth International Conference on Natural Information Processing (ICONIP'99), Perth, Australia, November 16-20, Vol. 2, pp. 495-501

Hone, K., Bhadal, A. (2004) "Affective agents to reduce user frustration: the role of agent gender," Human-computer interaction (HCI) 2004, Vol. 2, pp. 173-174.

Lisetti, C. L., Schiano, D. J. (2000) Automatic facial expression interpretation: where hci, artificial intelligence and cognitive science intersect, Pragmatics and Cognition, Vol. 8, No. 1, pp. 185-235

Picard, R. (1997) Affective Computing, The MIT Press, Cambridge, Massachusetts, USA

Nakatsu, R., Nicholson, J., and Tosa, N. (1999) Emotion recognition and its application to computer agents with spontaneous interactive capabilities, Proc. of the 7th ACM International Conference on Multimedia, October 30-November 5, 1999, Orlando, Florida, USA, pp. 343-351

Klein, J., Moon, Y., and Picard, R.W. (2002) "This computer responds to user frustration: Theory, design and results," Interacting with Computers, Vol. 14, pp. 119-140

Stankovic, I., Karnjanadecha, M., Delic, V. (2012) "Improvement of Thai speech emotion recognition using face feature analysis", International Review on Computers and Software (IRECOS), Vol. 7, No. 5

Ekman, P., Friesen, W. V. (1977) "Manual for the facial action coding system," Consulting Psychologists Press, Palo Alto, USA

Pantic, M. & Rothkrantz, L. J. M. (2003) "Toward an affect-sensitive multimodal human-computer interaction," Proceedings of the IEEE, Vol. 91, No. 9, September 2003

Chomsky, N. (2011) Language and the Cognitive Science Revolution(s), Carleton University, April 8, 2011, http://chomsky.info/talks/20110408.htm

Wilks, Y. (2007) Is there progress on talking sensibly to computers? Science, Vol. 318, 927-8

Searle, J.R.(1993) The Failures of Computationalism, Think 2, 68-73 Chomsky, N. (2000) New Horizons in the Study of Language and Mind, Cambridge University Press, 2000

Schank, R.C. (1980) Language and Memory. Cognitive Science, 4:243-284

Gnjatovic, M., Janev, M., Delic, V. (2012) Focus Tree: Modeling Attentional Information in Task-Oriented Human-Machine Interaction. Applied Intelligence 37(3), 305-320

Gnjatovic, M., Delic, V. (2012) A Cognitively-Inspired Method for Meaning Representation in Dialogue Systems. In: Proc. of the 3rd IEEE International Conference on Cognitive Infocommunications, Kosice, Slovakia, pp. 383-388

Bohus, D., Horvitz, E. (2010) On the Challenges and Opportunities of Physically Situated Dialog. In Proc. of the AAAI Fall Symposium on Dialog with Robots, Arlington, VA, 7 pages, no pagination

Gnjatovic, M., Tasevski, J., Nikolic, M., Miskovic, D., Borovac, B., Delic, V. (2012) Adaptive Multimodal Interaction with Industrial Robot

In: Proc. of the IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics (SISY 2012), Subotica, Serbia, pp. 329-333

Gnjatovic M., Tasevski J., Miskovic D., Nikolic M., Borovac B., Delic V. (2013) Linguistic Encoding of Motion Events in Robotic System. In Proc. of the 6. PSU-UNS International Conference on Engineering and Technology--ICET, Novi Sad, 5 pages, no pagination

Gnjatovic, M., Delic, V. (2013) Encoding of Spatial Perspectives in Human-Machine Interaction. In Proc. of the 15th International Conference SPECOM 2013, Plzen, Czech Republic, LNAI, vol. 8113, Springer, 8 pages, in press

Gnjatovic, M., Rosner, D. (2010) Inducing genuine emotions in simulated speech-based human-machine interaction: The nimitek corpus.IEEE Trans Affect Comput 1, pp. 132-144

Gnjatovic, M., Rosner, D. (2008) Adaptive Dialogue Management in the NIMITEK Prototype System. In: Andre, E., Dybkjaer, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS (LNAI), vol. 5078, pp. 14-25. Springer, Heidelberg

Bryson, J.J. (2000) A Proposal for the Humanoid Agent-builders League (HAL). In the Proc. of The AISB 2000 Symposium on Artificial Intelligence, Ethics and (Quasi-)Human Rights, John Barnden (ed.), pp.1-6

Weizenbaum, J. (1993) Computer Power and Human Reason: From Judgement to Calculation, Penguin Books, Limited

Authors' data: Delic, V[lado] *; Secujski, M[ilan] *; Jakovlevic, N[iksa] *; Gnjatovic, M[ilan] *; Stankovic, I[gor] **, * Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad, Serbia, ** European Center for Virtual Reality, Brest National Engineering School, Parvis, vdelic@uns.ac.rs; secujski@uns.ac.rs; jakovnik@uns.ac.rs, milangnjatovic@yahoo.com; bizmut@neobee.net

DOI: 10.2507/daaam.scibook.2013.19
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有