Human Responses to Machine-Generated Speech with Emotional Content

Ilves, Mirja

Human Responses to Machine-Generated Speech with Emotional Content

Ilves, Mirja (2013)

Avaa tiedosto

978-951-44-9174-0.pdf (632.6Kt)

Lataukset:

Ilves, Mirja

Tampere University Press Tampereen yliopisto

2013

Vuorovaikutteinen teknologia - Interactive Technology
Informaatiotieteiden yksikkö - School of Information Sciences

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

Väitöspäivä

2013-06-19

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-951-44-9174-0

Tiivistelmä

Keinotekoisesti tuotetut sanalliset tunneilmaisut aktivoivat ihmisen tunnejärjestelmää

Puhe on yksi ihmisen tärkeimmistä ja keskeisimmistä kanavista ilmaista tunteita ja rakentaa sosiaalisia suhteita. Tunteita voidaan ilmaista puheessa sekä sanallisen sisällön että puheen prosodian, kuten äänen korkeuden, voimakkuuden ja sävelkulun, avulla. Tämän väitöstutkimuksen tavoitteena oli tutkia, miten pelkkä puheen sanallinen tunnesisältö aktivoi kuuntelijan tunnejärjestelmää. Puhesynteesi eli keinotekoinen puhe tarjosi tähän hyvän mahdollisuuden, koska keinotekoisesti tuotetussa puheessa puheen prosodiaa voidaan tarkasti säädellä. Toisaalta, tunteiden merkitys on tunnistettu tärkeäksi osaksi myös ihminen-tietokone vuorovaikutusta ja on ehdotettu, että tietokoneiden tunneälyä kehittämällä voitaisiin parantaa ihmisen ja koneen välisen vuorovaikutuksen laatua. Aikaisempien tutkimusten perusteella ei ole kuitenkaan esimerkiksi selvää, miten tietokoneen ilmaisemat tunteet vaikuttavat ihmisiin. Voiko esimerkiksi tietokoneen ilmaisema tunne ikään kuin tarttua ihmiseen? Erilaiset puhesyntetisaattorit mallintavat kukin eri tavalla puheääntä kuulostaen enemmän tai vähemmän ihmismäisiltä. Näin ollen oli mahdollista tutkia myös sitä, miten äänen ihmismäisyys vaikuttaa tunnereaktioihin.

Väitöskirjan osatutkimuksissa ihmisille esitettiin erilaisia tunnesisältöisiä sanoja ja lauseita, jotka tuotettiin erilaisilla puhesynteesitekniikoilla. Tutkimuksen tarkoitus oli selvittää, miten sanalliset tunneilmaisut vaikuttavat ihmisiin tunnekokemuksen ja fysiologian (pupillin koko, sähköinen kasvolihasaktivaatio ja syke) tasolla. Lisäksi tutkittiin, vaikuttaako puheen sisältö siihen, millaiseksi ihmiset arvioivat äänen laadun. Tulokset osoittivat, että keinotekoisesti tuotetun tunnesisältöisen puheen passiivinen kuuntelu aiheutti muutoksia sekä ihmisten tunnekokemuksissa että fysiologisissa vasteissa. Yleisesti ottaen tunneilmaisut, jotka oli tuotettu ihmismäisemmällä äänellä, aiheuttivat voimakkaampia tunnearviointeja ja tunteisiin liittyviä kehon reaktioita kuin vähemmän ihmismäisellä äänellä tuotetut tunneilmaisut. Tulokset osoittivat myös, että puheen sisältö vaikutti siihen, miten ihmiset arvioivat äänen laatua. Kun viestin sisältö oli myönteinen, ihmiset arvioivat äänen miellyttävämmäksi ja selkeämmäksi kuin silloin, kun puheen sisältö oli kielteinen tai neutraali.

Tulokset osoittavat sanallisen ilmaisun merkityksen ihmisten viestinnässä. Vaikka tunneviestit oli puhuttu pääasiassa täysin monotonisella äänensävyllä ilman vuorovaikutuskontekstia ihmisen ja tietokoneen välillä, tunneviestit aktivoivat ihmisen tunnejärjestelmää aiheuttaen muutoksia sekä tunnekokemuksissa että kehon reaktioissa. Tulosten perusteella näyttää siis siltä, että keinotekoisesti tuotetut tunneilmaisut voivat herättää samankaltaisia tunnereaktioita ihmisissä kuin toisen ihmisen välittämät tunneviestit.

The aim of the present thesis was to examine how people respond to synthetically produced lexical expressions of emotions. When speaking, both the content of spoken words and the prosodic cues, such as pitch and the speed of the speech, can mediate emotion-related information. To study how the pure content of spoken words affects human emotions, speech synthesizers offer good opportunities as they allow for good controllability over the prosodic cues. Synthetic speech can be generated using different techniques. Such speech can be purely machine generated or it can be based on different types (i.e. shorter or longer) of samples from human speech. On the basis of synthesis techniques, synthesizers can be classified according to the degree of human-likeness of the voice. Four different speech synthesizers were employed in this study, which all differed in their speech-production techniques. This also enabled an examination of the effects of the human-likeness of synthetic voices on human emotions.
Three key reasons motivated this research. First, even though spoken language is one of the most important means for expressing and conveying emotion-related information, there is scant research on the effects of the lexical meaning of spoken words on human emotions. Thus, it is important to study how the lexical content of spoken words affects emotional responses in humans. Second, because emotions have been recognized as an important part of human–computer interaction (HCI), it is essential to examine how people respond to the emotional expressions of computers. Third, because interfaces that utilize speech synthesis have become increasingly popular, it is important to understand which kinds of emotional reactions synthetically produced speech induces in people.
This thesis summarizes five publications that investigated how the lexical emotion-related content of synthesized speech affected people’s emotion-related experiences and physiological responses in terms of facial-muscle and autonomic nervous system activity (i.e. pupil size and heart rate changes). In addition, the effects of emotional content on the perception of the quality of speech synthesis were studied.
First, the results showed that the emotional messages (i.e. sentences and words) produced by synthesized speech had significant effects on people’s physiology and the ratings of their emotional experiences. Thus, passive listening to verbal emotional material induced changes both on the subjective and physiological levels. Second, the results provided evidence that the human-likeness of synthetic voices matters in respect to emotions. Generally, more human-like synthesizers evoked stronger ratings for emotions than less human-like synthesizers. Further, comparisons between less and more human-like voices showed that only the more human-like synthesizers evoked significant emotion-related facial-muscle and pupil responses. Third, the results highlighted how the content of a message affected how people perceived and rated the speech synthesizers. When the content of the message was positive, the participants rated the voice as more pleasant and clear than when the content was negative or neutral.
In summary, the results presented in this thesis suggest that the synthesized lexical expressions of emotions can evoke emotions in people. This finding indicates the importance of language in human communication. Even though the spoken stimuli were generated by the monotonous voices of speech synthesizers and lacked interaction context, the stimuli activated the human emotional system. However, the features of the voice also matter when evoking emotions through computers. Finally, the results showed that the lexical content of the messages had such a strong effect on people that the impression of the voice quality was affected by the content of the spoken message. Previous research has suggested that interaction with computers is intrinsically social; consequently, people tend to use similar interaction rules both in HCI and in human–human interaction. Overall, this thesis finds that computers also evoke emotionality in their users. It seems that the emotional expressions produced by computers could evoke similar emotional responses in humans as the emotional expressions of another human could.

Kokoelmat

Väitöskirjat [4769]