The subtitle - “Using Voice Chips and Speech Recognition Chips to Explore Structures of Participation in Sociotechnical Scripts” - tells the story, partly. But there’s more in store.
If Things Can Talk, What Do They Say? If We Can Talk to Things, What Do We Say?
If Things Can Talk, What Do They Say? If We Can Talk to Things, What Do We Say?
Introduction (The Gossip on Voice Chips)
This essay develops a frequently asked question (FAQ) list for Voice Chips. Like the questions in most FAQs, these questions are not actually frequently asked, but they might be, and like every FAQ, the attempt is to structure the accumulation of experiences in a sociotechnical project.
Voice Chips and their newer partners, speech recognition chips, are small low power silicon chips that synthesize voice, play prerecorded voice messages, or recognize voice commands. Although this functionality is not new, what makes voice chips unique is that they are small and cheap enough to be deployed in many, in fact almost any, product. Sprinkled throughout the technosocial landscape, their presence in products is a (not quite arbitrary) sampling mechanism, and enables us to compare very different products. So their secondary function, the concern of this essay, is as a simple instrument to slice through the history of our attempts to swap attributes with machines and be able to understand the nuances of complex sociotechnical systems – precisely because the systems are rendered in the form in which we can best recognize nuance: English, be it our own or the machines’.
These chips represent in-the-wild models of the interactions between humans and machines – as reductive as they are comic, but at least a manageable examination. They are caricatures of more complex human-machine interactive systems. We will ask: What is the structure of participation scripted with these new products and with increasingly ubiquitous information technologies? By examining the “structure of participation” (who addresses what, what addresses whom, who listens, what hears and who or what acts… and other forms of participation elaborated later) rather than focusing on the interaction between the device and the “user,” we pay attention to peripheral participation, the participation between users and around things; between users and things within systems. It is an approach to human (singular)-computer (singular) interaction that reconsiders interaction as a form of participation and escapes the simple dichotomy between social and technological.
The question we begin with is simply, when things can talk, what do they say? Our intent is to actually listen, and try to figure out whose voice it is and what it means. We then ask the complementary question: when we can talk to things (i.e., when there is speech recognition capacity embedded), what do we say? Who are we addressing? And what do we sound like? Are we polite, at least? And what is the appropriate way to talk to things (social norms)? Does it change language to be talking across the human/nonhuman divide, or how does it change us? Can we get some new insights into the old question, Is language uniquely human? We ask the voice chips these questions because they literally talk back, insisting on the scripts of participation that they were built with, reflecting the expectations and failures of our interactive technologies.
How have the things that voice chips say changed over the years since voice functionality was implemented? Does the “Chatty Cathy” of the sixties (a tape mechanism precursor to the voice chip) have anything different to say vis-à-vis the Barbie of the eighties or contemporary interactive toys? Or what is the relationship between novelty and familiarity, stability and instability, managed in these devices? Which things are given voices and which things are not, and why not? Why are they different from other talking hardware? What did the patents say they would say? And what did they actually say? What are the differences between these innovations as intellectual property and the novel devices as viable products? Exploring these questions tells us about the process of commodification of an ephemeral device, and explores the pattern of propagation of an innovation.
Where are the voices coming from? Who is hearing them, who isn’t? Whose accent do they have? Does the failure of voice chips in automobiles predict anything for the future of speech recognition chips inserted in other modes of transportation, and other places? Do they work in public or private places? What does any of this tell us about ubiquitous computing? Do these voices actually work? Does a voice chip reminder to not leave your things behind, watch your step, stand back, actually make you take your thing, watch your step, or stand back? How does the function of the product change the meaning of the voice? How does the voice change the product? How do nonhuman speech devices change language? And similarly: Now that we can talk to things, what do we say? What would we prefer to say? What would be the correct thing to say? What could we say? What does this tell us about the contingency of meaning?
Can You Capture Voice?
Voice is the icon of person. “To be given a voice” is shorthand for the fundamental units of democracy: voting, “being represented,” or participating. A device of sociality and therefore interaction, it is used to interpolate a subject (presumably a person) into society (Althusser 1971), or as a performative device to instantiate social agreements and identities (Butler 1993). We will trace how the responsive and ephemeral social device of voice interaction is commodified and sold back to us.
What’s So Special about Voice Chips?
Talking hardware has existed since before the time of Thomas Edison (who is generally credited with having invented the phonograph around 1877), when Alexander Graham Bell’s telephone learnt to talk. The proliferation of talking hardware since has brought about the recording industry, the broadcast industry and the multimedia industry. Our exposure to voices (and other communicative sounds) that emanate from inanimate objects has become a significant part of our daily interactions: from radios to the more recent talking elevators, answering machine messages and prerecorded music, television, automated phone menus, automatic teller machines, alarms and alerts, each of which, as we will show, speaks in a language or dialect that makes little distinction between music, sound effects and articulated words, and privileges the situational function of language over the semantic and interactive.
There are, however, distinctions to make between the voice chips, the concern of this essay, and noisy hardware more generally. Voice chips refers colloquially to: Texas Instrument TSP50C04/06 and TSP50C13/14/19 synthesizers; Motorola MC34018 or any other “speech synthesis chip implemented in C-MOS to reproduce various kinds of voices, and includes a digital/analog (D/A) converter, an ADPCM synthesizer, an ADPCM ROM that can be configured by the manufacturer to produce sound patterns simulating certain words, music or other effects.”Quoted from the North American patent literature. The speech recognition chip is exemplified by the ISD-SR3000 Embedded Speech Recognition Engine.
The voice chip differs from other technologies of automated sound production in that it offers autonomous voices, as opposed to broadcast voices. That is, voices which are not necessarily associated with a performer, a brand, or any other preestablished identity. These chips present what we will call “local talk” in products that refer to themselves and don’t often make claims to another’s identity, or to the faithful reproduction of someone else’s voice. In fact, their sound quality has effectively limited this. The “I” in “I’m sorry, I could not process your request,” or the “I will transfer you now” voice of the automated operatorPacific Bell voice mail system 1996, 1997, and AT&T automated customer help. claims agency by using the first person pronoun. Presumably, the machine is referring to itself when saying “I,”Benveniste showed how linguistic categories not only allow human subjects to refer to themselves, but actually create the parameters of human self-consciousness. “`Ego’ is he who says `ego.’ That is where we see the foundation of `subjectivity’ which is determined by the linguistic status of `person.’ Consciousness of self is only possible if it is experienced by contrast. I use I only when I am speaking to someone who will be a you in my address” [Benveniste, 225]. Linguistic categories such as “I” rely wholly on the identity of the speaker for their meaning. because it is not identifiably anyone else.
Attributing agency to technologies is a strategy that has been used by theorists to better understand the social role of technologies (Latour 1988; Callon 1995). It is a strategy that dislodges the immediate polarization of techniques and society, a strategy that refuses reduction to a situation that is merely social or only technological. Bruno Latour bases his Actor Network Theory – a theory that regards things as well as people as actors in any sociotechnological assemblage – on the ability of humans and nonhumans to swap properties. He claims that “every activity implies a generalized principal of symmetry or, at least, offers an ambiguous mythology that disputes the unique position of humans.” Michel Callon and John Law (1982) have also explored nonhumans as agents, but their strategy starts with an indisputable agent (a white male scientist) and strips away his enabling network of humans and nonhumans to demonstrate that his agency, his ability to act as a white male scientist, is distributed throughout his network of people, places and instruments. The more traditional (default) theory of technological determinism rests on the assumption that technology has an agency apart from the people who design, implement or operate it, and hence can determine social outcomes. Voice chip products take these ideas literally and actually attribute, with little debate or contest, the human capacity of speech to technological devices. Voice chips humbly preempted the theory.Latour published the book Science in Action in 1987. The following year, in Dallas (June 11, 1978), Texas Instruments Incorporated announced the launch of its speech synthesis monolithic integrated circuit in the new talking learning aid Speak & SpellTM. The speech synthesizer IC accurately reproduced human speech from stored (a capacity of 200 seconds in dynamic ROM) or transmitted digital data, in a chip fabricated using the same process as that of the TI calculator MOS ICs.
The voices of chips also differ from those of loudspeakers, TV/radio, and other broadcasting technologies in the social spaces they inhabit. Although radio and TV have become so portable that their voices can emanate from any vehicle, serving counter, or room – voice chip voices, by virtue of their peripheral relationship to the product, inhabit even more diverse social spaces. The identity of the voice that emanates form TV and radio reminds us that it is coming from elsewhere: “..for CBS News,” “It is 8 o’clock GMT; this is London.” And although Channel 9 is not a physical place, its resources and speech are organized around creating its identity, as an identifiable place on the dial. The voice chip that tells you “your keys are in the ignition” is not creating a Channel 9 identity, however. Its identity is “up for grabs,” not quite settled; it speaks from a position of a product in the social space of daily use.
Similarly, recording media and hardware refer to what they record. We know we are listening to someone when we listen to an Abba CD. And although it is the tape player in the car that produces the sound, we claim to be listening to the violin concerto itself. The tape player as a product does not itself have a voice; it never pretends to sing, speak or synthesize violin sounds itself. The recording industry and associated technologies, born at a very different historical moment from voice chips, came out of the performance tradition.See M. Paton’s forthcoming paper in Social Studies of Science for a detailed examination of the initial construction of the virtues and values of the phonograph recording technology. Its claim to represent someone, from the earliest promotions using opera singers, to contemporary megastars, has focused the technologies around “fidelity” issues. Additionally, telephones, telephonic systems and the telecommunications industry, motivated by the communication imperative, prioritize real-time voices passing to real-time ears over fidelity. Simply stated, it is an industry that puts technologies between people, things to communicate through, “overcoming the tyranny of distance” (Minneman 1991). Invisible distance and seamless technology reflect the recording industry’s ambition to “overcome the tyranny of time,” enabling people to duplicate the performance regardless of when or where it was originally performed. Voice chips and their inferior sound quality do not refer beyond themselves. Their position in a product becomes their position as a product.
How Are Voice Chips Distributed?
Voice chips provide the opportunity to add “voice functionality” to the whole consumer-based electronics industry. They are the integrated circuits that can record, play and store sounds, and more importantly, voice. They are the patented chips that play “Jingle Bells” in Hallmark greeting cards.Hallmark first included voice chips in their cards in 1988. Five years later they introduced a recordable card on which you could record your own voice. They are the voice in the car that reminds you, “Your lights are on.”Nissan Maxima 1986. They are the technology that makes dolls say “Meet me at the mall,”Barbie had a few statements when she was given a voice in late 1980s, including: “Meet me at the mall,” “Math is hard,” “I like school – don’t you?” and gives voices to products ranging from picture frames to pens.MachinaR, a San Francisco-based company, had on the market in 1997 several talking pens or “Pencorders,” a talking key ring, several talking photo frames and many “Cardcorders,” including “Autonotes.” The well-sung virtues of integrated circuits (chips) are that they are cheap, tiny and require little power. Smaller than a baby’s fingernail, they have the force of a global industry behind them and an entire economic sector invested in expanding their application. Technically, they can be incorporated into any product without significant changes in their housing, their circuit design, power supply, or price. Wherever there is a flashing light, there could instead, or as well, be a voice chip.
Although most personal computers can record and play voice, the voice chip is different in that it is dedicated solely to that function. The same integrated circuit technology found in calculators and computers allows this tiny package to be placed ad hoc in consumer devices. Their development exploited the silicon chip manufacturing processes and its dedication to miniaturization. With sound storage capacities ranging from seconds of on-board memory to minutes and hours of recording time when configured with memory chips, they were conceived to enable voices in existing hardware, to be incorporated into products. They are the saccharin additive of consumer electronics.Saccharin is perhaps the first product to be parasite-marketed (i.e. “This product uses saccharin”), which is similarly Intel Inside’s marketing strategy. They were first mass marketed in 1978 by Texas Instruments, though they had existed in several forms before that, particularly in the vending industry. It was not until seven years later, in 1985, that the Special Interest Group in Computer-Human Interface (SIGCHI) of the Association for Computing Machinery (ACM) professional society broke off into their own conference from other more general computing conferences. This institutionalization formalized the discussion in design communities on the Human-Computer Interface as a site of scientific investigation that differs from earlier formulations of this interface, such as Englebart’s human augmentation thesis or Turing’s standing-in-for ideal (Bardini 1997), but whose concerns for evaluating an interface tends toward task decomposition, with metrics of efficiency still dominating (Dourish 2001). This liminal zone where people and machine purportedly interact is where the voice chips were intended to reside. The voice chips arrived to mediate, even to negotiate, this boundary. Voice chips promised to make hardware “user-friendly,” a phrase that defines the technical imagination of the time, by turning the person into an interchangeable standardized “user” and attributing a personality (i.e., friendliness) to the device. In this context the problem for designing user-friendly devices begins with the assumption that the hardware has agency in the interaction.
Writes Turkle: “Marginal objects, objects with no clear place, play important roles. On the lines between categories, they draw attention to how we have drawn the lines. Sometimes in doing so they incite us to reaffirm the lines, sometimes to call them into question, stimulating different distinctions” (Turkle 1984, 31).
Do Marginal Voices Have Any Say in the Market?
Finally, before listening to the voices themselves, I want to emphasize the peripheral relationship of the voice chip to the product. It is the position of the voice chip as marginal, not particularly intended to be the primary function of the product, that increases the present curiosity in it. The motor vehicle, for example, is not purchased primarily for its talking capacity, and pens that speak are still useful for writing. This marginality gives voice chips a mobility to become distributed throughout the product landscape and mark, like fluorescent dye, a social geography of product voices.
The chips are usually deployed – to borrow the economic sense of the term – for their marginal effects, to distinguish one product (e.g., an alarm system) from another, and give it some marginal advantage over a competing product. However, the chips are not evenly distributed throughout competitive markets (e.g., consumer electronics) in the manner one would expect for the propagation of a low-cost technical innovation driven by market structure alone.
Although consumer preferences are often claimed to have a causal determination on the appearance or disappearance of marginal benefits, it is difficult to see how the well-developed paths of product distribution have the capacity to communicate those “preferences” developed after the point of purchase. Lending the market ultimate causality (or agency) ignores the specific experience of conversing with products, the micro-interactions that enact the market phenomenon, and occludes the attribution of agency to the voice chip products, insomuch as these products speak for themselves. The voice chip products themselves have something to say, although their voices are usually ignored. In this essay we will not be examining voice chip products in the interactions of daily use, as contrapuntal to market descriptions - however, by recognizing the social assumptions that determine their physical design, we frame the imagined interactions and social worlds in which these products make sense.
The marginality of the product makes it difficult to systematically study. Neither of the two largest manufacturers of voice chips of various types (Motorola and Texas Instruments) keep information on what products incorporate this technology, partly because they can be configured in many different ways – not necessarily as voice chips – and partly because products that talk are not a marketing category of general interest. This essay traces voice chips in two ways: first via the patent literature, and second through a more ad hoc method of searching catalogues, electronics, and toy and department stores, to compile a survey of products that have been available in the last six years (my voice chip collection was begun in June 1996).A complete list of the collected products and patents is planned for http://www.cat.nyu.edu/neologue. This will be updated constantly.
What is initially observable from the list of products and patents that contain voice chips is that there is no obvious systematic relationship between the products that include voice chips and the uses or purposes of those products. Except for children’s toys, no particular electronics market sector is more saturated with talkative products. These chips are distributed throughout diverse products. However, we can view the voices as representatives, as in a democratic republic where voices are counted. Just as in a republic, each citizen has a vote, but most chose not to exercise it; likewise, most products could incorporate voice chips but most do not. We will count what we can.
What Do Voice Chips Say?
A review of the patent literature yields a loose category scheme or typology, not by where the voice chips appeared (a technology sector approach that we will visit later), but by what they said. The patents themselves hold a tentative relationship to the products. For only two of the products on the market did I find the corresponding patents, the CPR devicePatent #4863385 (Sept. 5, 1989). and the recordable pen.Patent #5313557 (May 17, 1994). Though patents do not directly reflect the marketed products, they do represent a rather strange world of product generation, a humidicrib for viable and unfeasible proto-products. Patents track how products have been imagined and protected; while they do not by any means demonstrate market success, they do reflect a conviction of their worth, being invested in and protected. Patents are a step in the process of becoming owned, are therefore worth money, and thereby demonstrate how voice, a social technology, becomes property.
There were as of October 2001 only 163 North American patents that included a voice chip. (More recent years show a proportional increase.) In the context of the patent literature, the first thing to note is that this is a very small number – compared, that is, to the integrated circuit patent literature more generally. The question “Why not more?” we will return to later. The federal trademark office offers a suggestive list of speech-invoking names, including: who’s voice; provoice; primovox; ume voice; first voice; topvoice; voice power; truvoice; voiceplus; voicejoy; activevoice; vocalizer; speechpad; audiosignature. These monikers introduce how the voice is conceptualized in the realms of intellectual property, in a different form, claiming that these voices are premium (should be listened to?) in various ways. However, the voice chips themselves seem to fall into the following loose categories:
1. Translators, which range from reporting and alerting to alarming and threatening and include “interactive” instructional voices;
2. Transformers, which transform the voice;
3. Voice as Music, that make speech indistinguishable from music or that present voice as sound effect;
4. Locating Voices, speaking from here to there about being here;
5. Expressive Voices, expressing love, regret, anger and affection;
6. Didactic Voices and Imitative Voices, mainly as in educational and whimsical children’s toys;
7. Dialogue Products, which explicitly intend to be in dialogue with the user, as opposed to delivering instructions to a passive listener.
Products and patents often exist in more than one of these categories; for instance, the Automatic Teller Machine will not only apologize (expressive) for being out of order but will also simply function to translate the words on the screen into speech. This said, the categories remain, for the most part, distinguishable and useful.
A large category, this is the voice that translates the language of buzzes and beeps into sentences – whether English, French, or Chinese. A translator is a chip that translates the universal flashing LED, the lingua franca of the peizo electric squeal, the date code, the bar code, the telephone ringer adapter that translates that familiar ring, the tingling insistent trill of an incoming call, into “a well-known phrase of music”Patent #5014301 (May 7, 1991). (an approach that has since become popular in cell phones, where this function is useful in differentiating whose phone is ringing), or the unrelated patent that translates the caller identification signal into a vocal announcement.
Within the translators there are distinct attitudes; for instance, the impassive reporting, almost a “voice of nature.” This is exemplified by the patent for the menstrual cycle meter. The voice reports the date and time of ovulation, in addition to stating the gender more likely to be conceived at a particular date or time during a woman’s fertility cycle. Another example is the patent for the “train defect detecting and enunciating system,” which “reports detected faults in English.” These chips speak with a “voice of reality,” reporting “fact” by the authority of the instrument that triggers them.
Another type of translator claims more urgency than those that simply report fact. These raise an alarm and expect a response. They are less factual, more contestable perhaps. Take the “Writing device with alarm,”Patent #4812968 (Mar. 14, 1989). an “invention which relates to a writing device which can emit a warning sound – or appropriate verbal encouragement – in order to awaken a person who has fallen asleep while working or studying”; or the baby rail device which exclaims, “The infant is on the rail, please raise the rail”… and then if there is no subsequent response from an attendant caregiver, raises it automatically.Patent #4951032 (Aug. 21, 1990). A product on the market that will politely tell you if there is water on the ground is pictured in figure 19.2. These voice chips ask for and direct the involvement of their humans counterparts – they assume “interactive humans.”
These chips articulate not only simple commands, but series of instructions as well. The CPR devicePatent #4863385 (Sept. 5, 1989). in figure 19.3 guides the listener through the resuscitation process. And finally, these chips translate menus of choices into questions. The car temperature monitor that asks the driver, “Would you like to change the temperature?” translates from the visual menu of choices, but in the process also takes over the initiating role. What is lost or gained in the translation generates many questions: Does translating from squeals to a more articulate alarm make it any more alarming? How do spoken instructions transform written instructions? We will try to address these questions later.
There is an notable set of aberrant but related patents that exist in this “alarming” category: “Alarm system for sensing and for vocally warning a person approaching a protected object,”Patent #5315285 (May 24, 1994). “Alarm system for sensing and for vocally warning of an unauthorized approach towards a protected object or zone,”Patent #4987402 (Jan. 22, 1991). and “Alarm system for sensing and for vocally warning a person to step back from a protected object.”Patent #5117217 (May 26, 1992).
What seems almost like hair-splitting turns of phrase to get three separate patents has little technical consequence: the second patent has the extra functionality to detect authorized persons (or their official badge), and the third can, but need not, imply a different sensor – but each implies a different attitude. Although all patents are contestable, patent attorneys typically advise that you would not be able to successfully claim as separate patents an alarm system that warned at 15 feet and one that alerted at two feet. The “novel use” being patented here depends on the wording: the phrasing of the instruction that determines the arrangement of the sensor and alarm/voice chip. On the strength of a differently worded warning, the importance of the technically defined product description seems to have diminished. Perhaps ElectroAcoustic Novelties, the owner of the patents, has a linguist generating an alarm system for other phrases. These patents seem to be articulating the semantics of the technology. The intentionality of the system is its voice.
Transformers are distinct from patents that translate the voice. They translate in the other direction – not from the buzzes and squeals to spoken phrases, but from the human voice to a less particular voice. For instance: to assist the hearing impaired, a chip that transforms voices into frequency range the listener can still hear (usually a higher frequency); or the “Electronic Music Device” effecting a “favorable musical tone.” “The voice tone color can be imparted with a musical effect, such as vibrato, or tone transformed.”Patent #5254805 (Oct. 19, 1993).
Into this category fall children’s products like the “YakBak,” popular in the 1997-1999 seasons, which plays back a child’s voice with a variety of distortions; and the silicon-based megaphones that allow children to imitate technological effects, or sound like machines. These are voice masks, for putting on the accent of techno-dialect. The socializing voices broadcast on radio and TV, the voices of authority heard over public address systems, and the techno-personalities of androids and robots are practiced and performed by playing with these devices. This is also the category of voice chips that is concentrated in products for the hearing impaired or the otherwise disabled, and for children. These transforming devices act as if to integrate these marginalized social roles into a sociotechnical mainstream.
Speech as Music
Many of the patents that are granted specifically collapse any difference between music and speech. This contrasts with the careful attention given to the meaning of the words used in the alarm system family of the translators. An explicit example is the business card receptacle, which solves the problem of having business cards stapled onto letters – making them more difficult to read – and provides an “improved receptacle that actively draws attention to the receptacle and creates an interest in the recipient by use of audio signals, such as sounds, voice messages, speech, sound effects, musical melodies, tones or the like, to read and retain the enclosed object.”Patent #5275285 (Jan. 4, 1994). Another example is the Einstein quiz game that alternately states, “Correct, you’re a genius!” or sounds bells and whistles when the player answers the question correctly. This interchangeability of speech and music is common in the patent literature presumably because there is no particular difference technically. In this way patents are designed to stake claims – the wider the claim the better. The lack of specificity, and deliberate vagueness in the genre of intellectual property law contradicts the carefulness of copyright law, the dominant institution for “owning” words.
Local Talk from a Distance
One would expect chips that afford miniaturization and inclusion in many low-power products to be designed to address their local audience, in contrast to booming public address systems or broadcast technologies. However, several of these voice chip voices recirculate on the already-established (human) voice highways, imagined to transmit information as you or I would. The oil spill detectorPatent #5481904 (Jan. 9, 1996). that transmits via radio the GPS position of the accident, or “the cell phone-based automatic emergency vehicle location system” that reports the latitude and longitude into an automatically dialed cell phonePatent #5555286 (Sept. 10, 1996). – these are examples of a voice chip standing in for and exploiting the networks established for humans, transmitting as pretend humans. This class of products, local agents speaking to remote sites, is curious because the information can easily be transmitted efficiently as signals of other types. Why not just transmit the digital signal instead of translating it first into speech? The voice networks are more “public access,” more inclusive, if we count these products as part of our public, too. The counterexample, of voice chips acting as the local agent to perform centrally generated commands, is also common, as in the credit card-actuated telecommunication access network that includes a voice chip to interact locally with the customer while the actual processing is done at the main switchboard. Although the voice is generated locally, the decisions on what it will say (i.e., the interactions) are not.
The realm of expressiveness, often used to demarcate the boundaries between humanity and technology, is transgressed by voice chips. There are, of course, expressive voice chips ranging from a key ring that offers a choice of expletives, swear words and curses to the “portable parent” that plays stereotypical advice and parental orders to the array of Hallmark cards that wish you a very happy birthday, or say, “I love you.” These expressive applications also remind us of the complexities of interpreting talking cards. The meaning of these products is of course dependent on the details of the situation, rather than on the actual words being uttered: who sent the card, and when; or what traffic situation preceded the triggering of the key ring expletive.
These novelty devices lead into the most populous voice chip category: those intended for children. The toy department store Toys “R” Us currently has seven aisles of talking and sound-making products – approximately 45 different talking books alone, in addition to various educational toys, dolls and figures that speak in character. The voices are intended for the entire age range, from the earliest squeaking rattles for babies, to strategy games for children 14 years of age and up – for example, the “Talking Battle Ship,” in which you can “hear the Navy Commander announce the action” as well as “exciting battle sounds.” The categorization of the multitude of toys extends far beyond “expressive” types, from the encouraging voices inserted in educational toys (“Awesome!,” “No, try again” or “You’re rolling now”) such as the Phonics learning system, the Prestige Space Scholar, and Einstein’s trivia game, to the same recordable voice chips used for executive voice memo pads. Chips for children are placed in pens, balls, and YakBaks; then there is the multitude of imitative toys that emulate cute animals, nonfunctional power tools and many trademarked personae, from Tigger and Pooh to Disney’s recent animation characters Sampson and Delilah, Ariel the mermaid, and others.
This listing demonstrates a cultural phenomenon that enthusiastically embraces children interacting with machine voices, and articulates the specific didactic attitudes projected onto products. These technological socialization devices have already been subject to analysis, as in Sherry Turkle’s study of children’s attitudes towards “interactive” products.Turkle demonstrates how children enter into social relationships with their computers and computer games; thinking of them as alive, they get competitive and angry, they scold them, and even want revenge on them. She finds that they respond to the rationality of the computer by valuing in themselves what is most unlike it. That is, she raises the concern that they define themselves in opposition to the computer, dichotomizing their feeling and their thinking. Barbie, for instance, was taken very seriously for what she had to say about the polarized notions of gender she embodies. Since Barbie’s introduction in 1957 she has been given a voice three times (each with slightly different technology); her most controversial voice during the 1980s was censored for saying, “Math is hard.” This controversy rests on the assumption that voice chips are social actors and do have determining power to affect attitudes – in this case a young Barbie player’s attitude to math.
Although Barbie is currently silent, a myriad of talking dolls remain, from Tamagotchi virtual pets, with their simple tweets, to crying dolls that ask to be fed, and an ever-increasing taxonomy of robotic dolls and creatures. The utility patent literature continues to award “new and novel” applications in this area. One of the “new” voice chip patents is for a doll that squeals when you pull her hair (dolls that cry when they are wet or turned upside-down are technically differentiated by their simple response triggers).Patent #5413516 (May 9, 1995). There is also a new doll patent that covers an “electronic speech control apparatus and methods and more particularly for… talking in a conversational manner on different subjects, deriving simulated emotions… methods for operating the same and applications in talking toys and the like.”Patent #5029214 (July 2, 1991). The functional categories at work here are not linguistic, nor do they resemble other ways in which a voice has been transformed into a document – for example, as in the copyright of a radio show. It would, in other realms, be very difficult to get copyright on “talking in a conversational way.” In the material world the ownership of voice has been redefined.
This category encompasses many of the most recent voice chip products. It is the existence of these products that tests the nature of the communication we have with these technologies: do we, can we, converse with these products? This category draws from the other typologies but is distinguishable, for the most part, by the recording functionality that is the raison d’être of the product. The category includes those products that perform a more specific speech function that could not be alternatively represented by lights, beeps, or visual display, i.e., perhaps they are more communicative. This category includes the products that seem to hold dialogue.
The category’s range of products includes the shower radio (see figure 19.6) that reinterprets bathing as a time for productive work, an opportunity to capture notes and ideas on a voice chip, consistent with the theory that there is an ongoing expansion of the work environment into “private” life. It also includes both the recordable pen and its business-card-size counterpart, the memo pad. Both the pen and the pad have many versions on the market currently, and they seem to be becoming more and more populous. The YakBak is the parallel product for children, deploying the same technology with different graphics, and to radically different ends.
The growing popularity of this category compared to the others arouses a number of questions. Firstly, how do we understand why this category is popular? Is the popularity driven by consumers because these products are successful at what they do? And is what they do dialogue? Or is it that the cost and portability of the technology make it an affordable newtech symbol beyond what is attributable to its function alone? Is this category popular because it alone can be marketed as a work product?Work and the products of work can be shown to take on meaning that transcends their use-value in commodity capitalism. See Susan Willis, Primer for Daily Life. And then conversely, why are these devices not more popular? Why is it that only a few types of products become the voice sites? Pens, photo frames and memo pads are all documents of a sort, in contrast to switches or menu choices.
According to the patent literature, “the failure of the market place to find a need for voice capability on home appliances has discouraged the use of voice chips in other products,”Patent #5140632 (Aug, 18, 1992): a telephone having a voice capability adapter. but lending the market agency for design assumptions is circular logic. This does express, however, the sentiment that many more products could have speech functionality then do.
Although miniaturization has made these products possible, the concept of embedding recording capability in products has been possible with other technologies. There has been no technical barrier to providing recording capability in cars, or in any of the larger products – a refrigerator, for instance – certainly since the existence of cheap magnetic recording technologies. Why is it that now we want consumer products that talk to us?
It is striking that the majority of talking products on the market currently are for conversing with oneself. Although deeply narcissistic, this demonstrates a commodification of self-talk that transforms the conceptualization of the self into subjectivity in relationship with our products. It suggests, without subtlety, that the relationship with these products is a relationship with the self. The constitution of personal and social identity by means of the acquisition of goods in the marketplace (Shields 1992) – the process of identifying products that provide the social roles we recognize and desire – cannot be excluded from the consideration of the social role of products.
Where Are the Voices Coming From?
The preceding typologies focus on what the voice chips say rather than where they say it. However, because voice chips are distributed throughout the product landscape, where they appear (and disappear) is also interesting to examine. Although a very detailed analysis could yield an interesting geography, it is beyond the scope of an essay intended to generate preliminary questions about why they say what they do where they do.
The automobile industry, a highly competitive, heavily patented industry that quickly incorporates cheap technical innovations (where they do not substantially alter the manufacturing process) is a place to expect the appearance of voice chips. Indeed, there was early incorporation of voice chips in automobiles. A 1985 luxury car, the Nissan Maxima, came with a voice chip as a standard feature in every vehicle. The voice chip said, “Your lights are on,” “Your keys are in the ignition” and “The door is ajar.” There were also visual displays that marked these circumstances, yet the unfastened seatbelt warning only beeped. By 1987, you could not get a Nissan Maxima with a voice chip, even on special request. In this case, the voice was silenced, but only for a time, reemerging with a very different role to play in the automobile.
By 1996, the voice chips reappeared in the alarm system of cars. Cadillac’s standard alarm system uses proximity detection to warn, “You are too close. Please move away.” In this 10-year period the voice shifted from notification to alarm, a trajectory from user-friendly to a distinctly unfriendly position. It is also interesting to note another extension of the action/reaction voice chip logic, if not the voice itself. The current Nissan model no longer notifies that the lights have been left on, it simply turns the lights off if the keys are taken out of the ignition. The courtesy of notification has been dispensed with, as well as the need for a response from the user. The outcome of leaving the lights on is already known, so the circuit will instead address that outcome. This indicates that when the results are exhaustively knowable, the need for interaction diminishes.
Of the seven patents specifically for vehicles,Within the patent literature, what appeared in relation to transportation was: #5555286: A cellular phone-based automatic emergency vessel/vehicle location system; translates a verbal rendition of latitude and longitude to a cell phone. #5509853: An automobile interior ventilator with voice activation, which queries the driver when the door closes and gives menu options. #5508685: A vehicle and device adapted to revive a fatigued driver; a voice reminder combined with spray device. #5428512: A sidelighting arrangement and method, with a voice warning of impending obstacles. #5045838: A method and system for protecting automotive appliances against theft. #5278763: Navigation Aids (presumably for application in transportation). #4491290: A train defect detecting and enunciation system.
all bar one are intended for private and not public transportation. However, in late 1996, voice chips began to appear in the quasi-private/public vehicles of New York’s Yellow Cabs. After debate about what ethnic accentSee the New York Times discussion. should be ascribed to the voice that reminded you to “please fasten your seatbelt” and “please check for belongings that you may have left behind,” the prerecorded (68k-quality) voices of Placido Domingo and other celebrities won the identity contest, and have since proliferated into many well-known New York characters, from sports stars to Sesame Street’s Elmo. The voice chip in this quasi-public sphere adopted a broadcast voice, albeit one of poor quality, or a microbroadcast voice. Whether they are effective in increasing seatbelt wearing or reducing the number of items left in the cabs in any accent is less certain than the manner in which they articulate the social relations of the cab. The voice chips address only the passengers and assume that the drivers don’t hear them, although it is the drivers who bear the brunt of their monotony. Their usefulness delegates the human interaction of service and rests on the assumption that the chips are more reliable and consistent in repeating the same thing over and over, no matter the circumstance, and that the customer responds to Placido Domingo’s impassive, recorded reminder more than they would to a driver who may be able to bring some judgment to bear upon the situation. In the transformation of the passenger into a public audience (not unlike that of a radio station) the product or service itself is not attributed with the voice. Instead the voice becomes identified with a celebrity.
In the transportation sector alone we can see the voice chip develop from an anonymous to an identifiable voice, and from a polite notification to an alarm for deterring approach. Cars have struggled with the problem of talking to humans and seem to have exploited the nonhuman qualities of their speechThis is in contrast to the popular depiction of cars with voices on mainstream television. In programs such as My Mother the Car or Knight Rider, the voice was used to lend the car personality. – the things that the technology is better at doing, like faithful repetition or careful reproduction of the identity of another – rather than any particularly human attribute of their speech. It is also notable that talking cars have not endured.
In the health industry, another social sector highly saturated with electronic product, the distribution of voice chips is almost exclusively on one side of the home/professional, expert/non-expert divide. Although in number there are more products made for hospitals and clinics than the home market, the placement of voice chips is inversely represented. In home products, from the menstrual cycle meter to the CPR device, electronic voices seem to play the role of the health professional or “expert.” In addition, the large number of products for the visually impaired are intended for patients and not professionals (a demographic with more spending power); see, for example, the addition of a sound indicator to the syringe-filling device “for home use,” which testifies that the user of this device is imagined at home, without the help of the professional for whom the product can stand in. Ironically, the most vocal equipment in this industry are the relaxation and stress reduction products, e.g., those by which you talk to yourself or are reassured and relaxed by the sounds of the ocean (see figure 19.7). The reassuring factuality of these technovoices focuses its attention on the lay audience. These are preliminary observations of the voices introduced into transportation and in the health and medical areas, and are cursory at best. But they demonstrate that for the voice to make sense, the technological relationship itself needs to make sense. The speech from devices is as culturally contingent as language.
There are many other areas in which the introduction of voice chips provides insight into what technological relationships make sense. Their incorporation into work products articulates the transformation and reorganization of work structure, particularly into “mobile” work (Zuboff 1984).See particularly “The Abstraction of Industrial Work.” They speak to a culture’s popular notions of where work gets done, a culture in which providing a product to take voice notes while in the shower makes sense. The voice chip population of areas of novelty products, children’s toys, and educational products, and of the safety, security and rescue products also maps the social relationships we engage in with our products. Conversely, where we don’t find voice chips, for example in biomedical equipment for health professionals, also maps the social relationships that the technologies play out – they stand in for experts with an authoritative voice one wouldn’t use on a colleague. However, to understand the dialogue we are having with these voices requires us to also examine how we listen.
Discussion: What Do the Voice Chips Actually Mean When They Speak? Do They Actually Work?
Voices Chips as Music
The preceding categories survey what voice chips say, where it is they say it, and to whom they say it. To understand what the voice chips are saying, however, means engaging strategies for listening that may not be automatic. Products, with or without voices, are well-camouflaged by what Clifford Geertz described as “the dulling sense of familiarity with which… our own ability to relate perceptively to one another is concealed from us.” Modes and strategies for listening that can help us hear these voice chips can be borrowed from music. Music, unlike machinery, is commonly understood as “culture,” or a cultural phenomenon, and its analysis looks very different from the analysis of technology. The structure of participation enables multiple listeners (vs. a “user”); the “use” of music is widely divergent; and interaction with it is more specifically understood as interpretation (we don’t speak of the task’s decomposition, efficiency or effectiveness). Perhaps the most glaring difference is the concept of improvisation, which is prevalent in theories of music, yet is unusual in the analysis of human-machine interaction. (The striking exception is Lucy Suchman’s work, which we will discuss in depth later.) Is it that improvisation is absent from our interaction with machines, or our models for designing interaction?
Our strategy here is to avoid the contested terms “reality,” “progress,” and “rational choice” that usually inform the analysis of technology - thus we can provide more emphasis on the interpretative experience. Additionally, some of the voice chip products themselves demonstrate an indifference to the distinction between speech and music, by blurring the distinction between words and beeps (see the “speech as music” category of products).
Music, like product, is also easily recognized as involved in the production of identity. That is, subcultures identify through and with music (Fabbri 1981). Where technological product is presented to the consumer, at what Cowan calls the “consumption junction,” we are at such an identity-producing site.See also Laura Oswald, who describes the site for purchasing product as the staging of the subject in consumer culture. For this reason it is difficult to ascribe any one particular meaning or mode of listening to the voice chips. In the wide spectrum of musical styles available, each piece of music can and does exist in widely different listening situations. This means that each listener has a variety of listening experiences and an extensive repertoire of modes of listening. The hearing person who listens to radio, TV, the cinema, goes shopping to piped music, eats in restaurants, or attends parties, has built up competence in translating and using music impressions. This ability does not result from formalized schooling, but through the everyday listening process in the soundscape of modern city. Stockfelt asserts that mass media music can be understood as something of a nonverbal lingua franca,Stockfelt supports her work with Tagg and Clarida’s studies on listeners’ responses to film and television title themes, which demonstrates common competence to adequately understand and contextually place different musical structures; listeners for the most part understand musical semiotic content in similar ways, across dissimilar cultural areas. See also Philip Tagg, Kojak: 50 Seconds of Television Music – Toward the Analysis of Affect in Popular Music. without of course denying the other more specialized musical subcultures to which we may simultaneously belong.
Listening modes are not, of course, limited to music, and nor for that matter is a musical experience limited to music. Even so, teasing out the musical modes of listening from listening modes that focus on the sound’s quality, its information-carrying aspect, or other nonverbal aesthetic modes is difficult. The “cultural work” of using unmusical sounds as music is not uncommon; for example, Chicago’s Speech Choir, John Cage’s 4’33”, The Symphony of SirensThe Symphony of Sirens was first performed in 1923 by Arseni Avraamov. and the sounds created with samplers, particularly for percussive effects. At the same time, the sirens, speech choirs, etc., do not lose their extramusical meaning as they become music. Conversely, using musical sounds for nonmusical ends is the conceit of many voice chip applications.
The two products in figures 19.8 and 19.9 demonstrate the confusion of musical listening vs. other modes of musical sound consumption. The Soother uses unmusical sounds for musical effect while the Funny Animal Piano uses musical sounds to respond to toddler’s feet. The alignment of voice chips with music has interesting implications for their linguistic claims; if they produce meaningful speech, why don’t they differentiate between music and speech?In particular the products that use speech and music interchangeably: the children’s applications, the bells and whistles that substitute for spoken encouragement, the alarm systems that use vocal warnings or sirens, and the pen (patent #4812068). Is it that the social position of the product determines the meaning of the sounds and utterances? Indeed, if the speech they produce is linguistic, then when the voice of the alarm system warns us, are we altering the meaning of the sound, whether it resembles speech or siren? Or can we expand linguistic theories to accommodate all meaningful sounds that humans or machines make? These questions about how we understand the sounds that voice chips produce complicate the attribution of agency to these “things with voices.” Voice chips seem to frame sound as a prepackaged cultural product, the identity of which is located in the manufactured materiality. At the consumption junction these voices are heard in the buzz and squeal of products, but can we call it language?
Voice Chips as Speech
What do voice chips tell us about our understanding of language? The voice chips’ languages provide a picture of our on-the-ground, in-the-market operationalization of our explicit understanding of language. Even though some voice chips use music and speech indistinguishably, the words that they say cannot be overlooked. Voice chips talk and say actual words, but how do we understand these voices as communicative resources? Are they “speech acts” as defined by linguistic theorists?In particular the products that use speech and music interchangeably: the children’s applications, the bells and whistles that substitute for spoken encouragement, the alarm systems that use vocal warnings or sirens, and the pen (patent #4812068).
Speech acts See J.L. Austin, How to Do Things with Words, the general point of which is not to look at how language is composed, but what it does. are used to categorize audible utterances that can be viewed as intending to communicate something, to make something happen, or to get someone to do something. To construe a noise or a mark as a linguistic communication involves construing its production as a speech act (as opposed to a sound that we decide is not communicative). Categories of speech acts are given next (examples quoted from voice chips).
1. Commissives: The speaker places him/herself under obligation to do something or carry something out; for example, in a telephone system, “I will transfer you to the operator”;
2. Declaratives: The speaker makes a declaration that brings about a new set of circumstances; for example, when your boss declares that you are fired, or when the car states, “The lights are on”;
3. Directives: The speaker tells the listener to do something for the speaker; “Please close the door,” “Move away from the car”;
4. Expressives are without specific function except to keep social interactions going smoothly, like “please” and “thank you,” or the more expressive “I love you.”
Each of these categories is performed by the voice chips examined in this essay, as are other verbs and verb phrases associated with the wider category of elocutionary acts: “to… state, assert, describe, warn, remark, comment, command, promise, order, request, criticize, apologize, censure, approve, welcome, express approval, and express regret.” Searle uses this list to introduce his paper “What is a speech act?” The category in which voice chips are least convincing is the declarative that requires the reliability or trustworthiness of the agent (human and nonhuman) to understand whether or not this thing is going to come about. We note that the declarative notification that your car will turn off the lights has been removed, and the car simply enacts the turning off the lights. The voice chips also tend to inhabit the present tense, or the very recent past tense. Future tense is less common, perhaps because the autonomy of a system is held in check by the interactive scripts. And they also prefer the first person, which supports the idea that they are not referring beyond themselves.
Searle defines the “speech act” as an utterance (action) intended to have an effect on the hearer, with preconditions and effects. This has been criticized by other theorists who have pointed out that meaning is imparted by the work of an “interpretative community.” Stanley Fish’s essay “How to do things with Austin and Searle” analyzes Shakespeare’s Coriolanus as a speech-act play. When Coriolanus responds to his banishment from Rome by stating a defiant “I banish you,” the discrepancy in the elocutionary force in both the performatives of banishment is obvious. Rome, embodying the power of the state and community, vs. Coriolanus’s sincere wish to banish Rome (i.e. his intentionality) is illustrative. The limitation of speech act theory in explaining voice chips is that it ascribes the most intention to the least animate thing in the interaction. In its failure to elaborate on interpretation, it provides no place for information about the significance of any particular assertion, warning, or, more generally, any speech act. Voice chips amplify this problem because they can inhabit so many different situations yet repeat the same thing. Because the voice doesn’t change, all flexibility in understanding to accommodate the changing circumstances needs to be accounted for by the listener’s interpretation. The case of the Cadillac’s alarm voice illustrates this. During a demonstration of the Cadillac’s alarm system, the salesman instructed me to move away from the car and then approach it again. Despite coming as close as I could to the car, the voice did not sound. On hearing no voice, the demonstrator toggled the key fob switch. I approached again and the voice sounded. In the first approach, the voice chip’s silence was interpreted as “the alarm is not working or is not on.” In the second approach, the voice communicated, “Now the alarm is on and functioning.”
By staying in the proximity range of the alarm system, the voice answered several questions, despite simply repeating the same words: “move away…” What is the area range in which we are detected? Will the alarm keep repeating, or will it escalate its command? Although moving away from the car stopped the voice, we also came to understand the types of motions that it detected, the speed of approach, what happened when we physically shook the car, etc. The simple interaction with the car and its voice demonstrates the interpretative flexibility that transcends the directive of the words stated, and how, as hearers, we respond to the voice’s imperatives. So in asking how we understand the significance of speech performed by the voice chip, we are asking whether speech is abstractable. Broadcast voices and prerecorded voices, although abstracted onto technologies, still belong to an identity; however it is the combined sense of abstraction that connotes the identity of the voice as that of the car. This could be interpreted alternatively as an abstracted voice of authority performed by the car, or the abstraction of the car itself. In other words, is there a difference between talking with a voice chip and talking with something (human) with which we share capacities other than speech?
Is Speech Abstractable?
Speech in action, rather than in theory, is conversation. If we are to claim that we interact with voice chip speech, we need to examine the fundamental structure of conversation as the primary model for interaction.If certain stable forms appear to emerge or recur in talk, they should be understood as an orderliness wrested by the participants from interactional contingency, rather than as automatic products of standardized plans. Form, one might say, is also the distillate of action and interaction, not only its blueprint. If that is so, then the description of forms of behavior, forms of discourse… included, has to include interaction among their constitutive domains, and not just as the stage on which scripts written in the mind are played out” [Schegloff, 73]. One of the voice chip patents claims the rights for “electronic apparatus(es) for talking in a conversational manner on different subjects, deriving simulated emotions which are reflected in utterances of the apparatus.” Although the other voice chip products make no explicit claim to be conversing, they do claim to be “interactive.”Patent #4517412, the card-actuated telecommunication network, is an example of this: “Local processor (11) controls a voice chip (15) coupled to telephone set (10), which interacts with the caller during the card verification process.”
Lucy Suchman’s (1987) work, however, proves more appropriate to describing the interactive “speech” of voice chips. Her work focuses on the inherent uncertainty of intentional attributions in the everyday business of making sense via conversational interaction with another machine, the photocopier. Like voice chips, she characterizes these machines by the severe constraints on their access to the evidential resources on which human communication relies. She elaborates the resources for constructing shared understanding, collaboratively and in situ, rather then using an a priori system of rules for meaningful behavior.
Suchman shows that the listening process of situated language depends on the listener to achieve the shared understanding of successful communication. The listener attends to the speaker’s words and actions in order to understand. Although institutional settings can prescribe the type, distribution and content of talk (e.g., cross-examinations, lectures, formal debates, etc.), they can all still be analyzed as modifications to conversation’s basic structure. Suchman characterizes one form of interactional organization (or structure of participation) – in this case the interview – as a) the pre-allocation of turns: who speaks when and what form their participation takes; b) the prescription of the substantive content and direction of the interaction, or the agenda.Suchman explains that this interpolation of verbal nuances and the coherence that the structure represents is actually achieved moment by moment, as a local, collaboratively, sequential accomplishment. The actual enactment of the interaction is an essentially local production, accomplished collaboratively in real time rather than born whole out of the speaker’s intent or cognitive plan. More generally she describes a system for situated communication, or conversation as “an organization designed to support local endogenous control over the development of topics or activities and to maximize accommodation of unforeseeable circumstances that arise, and resources for locating and remedying the communication troubles as part of its fundamental organization.”
Conversation with a Voice Chip?
Prerecorded voices on voice chips are ill-equipped to detect communication troubles, and although they are usually triggered by local inputs, the content of what is said does not change. They will repeat the same thing, or a set of prerecorded phrases, over the indefinite range of unpredictable circumstances. Although they localize control, they for the most part do not localize the direction of speech.
The applications that seem closest to Suchman’s characterization of conversations are the products that include “dialogue chips.” These chips hand over control of the content of talk to the listener, fulfilling Suchman’s characterization of conversational interaction in this respect. The listener literally controls the speaker and sets up a relationship with the device. Further, the dialogue chip products use the turn-taking of conversation collaboration, not as the alternation of contained segments of talk in which the speaker determines the unit’s boundaries, but in the manner illustrated by the joint production of single sentence (Suchman 1987, 81, 125).Suchman uses the example of the joint production of a single sentence to demonstrate the fluid division of labor in speaking and listening. The “turn-taking system for conversation demonstrates how a system for communication that accommodates any participants, under any circumstances, may be systematic and orderly, while it must be essentially ad hoc” (Suchman 1987, 78). The alarm clocks that incorporate voice recording functions are a new example of how that control is extended over time, but remains very local.
The response to voice chips, like the applause at the end of a play, is not a response to the final line uttered, or the fact that it just stopped. “The relevance of an action… is conditional on any identifiable prior action of event, insofar as the previous action can be tied to the current action’s immediate local environment.” The conditional relevance does not allow us to predict a response from an action, but only to project that what comes next will be a response, and retrospectively, to take that status as a cue to how what comes next should be heard. The interpretability therefore relies on “liberal application of post hoc ergo propter hoc ” (Suchman 1987, 83). The response that a listener can have to the voice of the train defect enunciation system is not only a response to the words uttered by the product. It also involves a complex series of judgments that include assessments of the information available and how to integrate this into what else the listener knows of the event at hand.
The understanding of talking products does not come so much from the words found at what is popularly conceived as the human-machine interface, but beyond this. The voice is a voice embedded in a network of local control, sequential ordering, interactional organization and intentional attribution. The recordable chips with which we can have a dialogue with ourselves, in which the control remains local, best demonstrate this. These products frame the understanding that we are talking with ourselves through our products.
Whereas dialogue is conversation with another agent, one who is somehow there, monologue is characterized as written speech, inner speech or rehearsed speech. Dialogue implies immediate unpremeditated utterances, whereas monologues are written speech lacking situational and expressive support that therefore require more explicit language. Questioning the abstraction of speech in voice chips does not demonstrate that speech is uniquely human. On the contrary, the stabilized voices of hardware-based speech are subject to reinterpretation, and rediscover the listener’s capacity, not the speaker’s incapacity. It may simply be viewed as a distinction between dialogue and monologue, neither of which are more or less human. Because we inhabit both sides of a dialogue, we can understand the voice chip’s position and compensate, so as to perform dialogue with ourselves.
Can We Summarize What They Said, and What Sort of Response This Suggests?
This essay has so far examined the unique position of voice chip products, differentiating them from the background noise of contemporary culture and other technological configurations that deliver speech. These hardware-bound voices are not broadcast and have no stable identity. The survey of what the voice chips say produces typologies that suggest further modes of investigation into how we understand and use these voices, where they appear, and what their voices mean. The short product life cycle of the consumer electronic devices they inhabit position these products as the E. coli of sociotechnical relations and can demonstrate the formation of product identities and product voices in our shifting understanding of machine interaction. The appearance of voice chips in some types of products and not others, some social sectors and not others, is open to further investigation. Detailing these would reveal the voice chip’s oral history of the process by which the very ephemeral social device of speech becomes stabilized and entered into systems of exchange.
Now I will introduce a complementary examination of speech recognition chip sets, around which there is much more recent product development activity.Measured by patents filed for novel devices that incorporate speech recognition. Although the voice chip’s applications may have peaked, the equivalent low power, distributed speech recognition function may be just beginning. Watching their development and deployment carefully, asking, “Now that we can talk to our products, what do we say?,” may allow us to hear the social scripts they presume. Can these provide evidence that symmetry between the ambitions for human and nonhuman attributes holds?
However, because we are more self-conscious about speaking than listening, this may be an instrument through which to observe our own roles in sociotechnical interaction. In order to prime this investigation, and because speech recognition chip sets are not yet (and may never be) widely available, the author hosted a competition to survey a range of applications. The competition was advertised on a large mailing list (12,000 members): the Viridian list owned and carefully managed by science fiction writer Bruce Sterling. The list is a forum for discussing technological futures with an emphasis on addressing environmental problems. Entrants were asked to propose speech recognition interfaces to an existing product (the prize was a voice note taker and the prototyping of the proposed device). Just under three hundred designs were submitted and will soon be available on the web site www.cat.nyu.edu/neologue. While these entries cannot be claimed to represent the conceptions of human-computer interaction distilled by the social forces of the market, manufacturing and advertising, they can be treated as evidence of technological desires and cultural expectations.
Now That We Can Talk to Our Devices, What Do We Say?
The most striking feature the competition entries demonstrated was the explicit intention to effect social change with technological change. This may or may not be peculiar to this list (which might be tested by hosting a similar competition in other contexts); however, this is consistent with a popular techno-determinism that attributes social change to technological change and under-represents the dominant forces of product innovation that can be attributed to sustaining and continuing a corporate entity.Product innovation for corporate continuity – assessing the life expectancy of corporate products. This also contradicts other popular understandings and lay rationalizations that new products arise to address preexisting social “needs” or profit opportunities, follow fashion or to optimize existing applications.
We can summarize the trends illustrated by the proposed products and product interfaces, which are predominantly the desire for social and individual envisioning and regulation. This is apart from the ultimate (and theatrical) control fantasies that this particular type of interface engages (e.g., on saying “Showtime,” the lights dim and the television and VCR turn onA.M.Dixon@shu.ac.uk), or the suggestions that dispensed with buttons (e.g., the TV remoteA.M.Dixon@shu.ac.uk) without explicating what words to use. Entries that did not explore what happens in the translation from finger-button to voice-button and the social (and observable) spectacle this makes did not render the sociotechnical relationship this investigation was trying to identify. There were also the applications that were similar to voice chips - with a similar interchangeable use of speech/buzz (e.g., the cookie container that recognized children’s footsteps to trigger singing, or the TV remote that called out “Polo” when it heard “Marco”email@example.com).
In addition to self-observation, regulation and control, the applications took on moral, physical, emotional, and consumption monitoring and regulation, in such forms as:
A wallet that recognized words and dispensed consumption regulation firstname.lastname@example.org;
A pocket device that recognized the phrase “now what am I supposed to do?” and responded “with a gentle reminder to adhere to the user’s selected ethical set”email@example.com (regulation of consumption);
A coffee maker that recognized “good morning” (“when you respond, the chip analyzes your tone of voice [for sluggishness]” and “adjusts the strength of the coffee…,” thus automating the physical regulation on which Starbucks has so successfully capitalized);
A more extreme circumvention of one’s own self-judgment: a device that monitors bloodflow and when detecting stress whispers “`relax,’ dims the lights a bit, and releases soothing aromatherapy”Andre French: Afrench@iss.net;
And the very opposite of an alarm clock, a device that on hearing, “Why am I still up?” “…should cause every light and entertainment system in my house to shut off for 4 hours.”
An example of self-observation was a voice-triggered “nocturnologue,” which would record any sleeptalking.
These devices to regulate the self, presumably with the goal of social synchronization, do not necessarily imagine the devices as “companions” and attribute to them a more social performance, although there is a small subset that do. This subset of entries realize the “technology-should-be-more-humanlike” expectation, which reflects a similar school of Human Computer Interface (HCI) designers working towards adaptive interfaces that can recognize and respond to different emotive states as an explicit strategy to be “user-friendly.” The best example is a comedic sidekick (Jerry Lewis) built into a watch and ready with smart rejoinders to recognized phrases (when it hears “nice hair,” the device says “cha cha cha”). This functionality would have to be described as reinforcing social performance.This is a version of the gestural value of handheld and portable devices identified and described in a study involving the ethnographic examination of filmic depictions of the use of handhelds [Jeremijenko 1992]. This seems both similar to other identification relationships (cars, furniture, home), and different, insomuch as it is directly inserted into the conversation.
The promise of emotive interfaces that recognize and respond to how you are feeling,For example, work at the MIT Media Lab’s “Affective Computing” research group. if these imagined interfaces are any evidence, was demonstrated and expressed in words that describe an ambivalence, even resentment, of technological relationships: for example, being able to say “shut up” to your television setVaclav Barta: firstname.lastname@example.org or to your telephoneMichael Butler: email@example.com (not “turn off,” not “close/finish” or other ending command). Clearly, this complicates the sort of understanding we can develop about a person’s relationship to a purchased product – and purchasing is of course the predominant form of “feedback” that companies and designers get about products. These voices make audible a strongly polarized ambivalence. There was no suggestion of saying, “I love my TV,” to turn it on.
Another device was proposed for automated prayers: triggered by saying “pray for me,” it was customizable for different religious “preferences.”It is peculiar to refer to a religious “preference” as if it were another consumption category - are religious and addictive behaviors subject to the same economic characterization? Prayers suggested ranged from excerpts from Psalm 23, to those for “cynical hipster types [who] might want their in-dash prayer boxes to recite William S. Burroughs’ Thanksgiving Prayer (`Thanks for Indians, to provide a modicum of challenge and danger… thanks for a nation of finks…,’ etc.), and some guilty white liberals (some Viridians, even) might want theirs to apologize for driving around in a vehicle spewing noxious fumes into the atmosphere.”Jon Lasser: firstname.lastname@example.org This is more than an interface that recognizes and responds appropriately to user emotional states; actually the entertainment is in delegating the emotionality or at least religiosity itself to the device.
This impulse is replicated in the delegation of care, social niceties and other arational and noncalculative tasks to the computational devices; for instance, a speech recognition chip that recognizes the sound of flatulence and politely apologizesMichael Butler: email@example.com to the room, relieving the responsibility of any one person to bear the embarrassment. Another entry, as an extension of Tamagotchi-like automation care, suggested using a voice recognition chip to train a parrot to firstname.lastname@example.org There were actually several other entries exploring information technology for animals, which seems to be evidence against a voice interface imagined as “humanizing” the computer, and more a demonstration that the ready treatment of animal noises as recognizable sounds imagines these as functionally equivalent in every way to English words. Speech recognition, reinterpreted as sound recognition.
Finally, and perhaps the most interesting or novel constellation of projects are the designs that use the opportunity to script interactions as a form of propaganda – propaganda that is distributed (enacted) beyond traditional and corporate monopolized media channels. The portable ideologue could play the role (even potentially look like) the email@example.com
Another device, the BackTalk, is a portable billboard for one’s car. It is triggered by the use of simple trigger words and suggested deep-set LEDs, displaying a message specifically to the driver behind one’s own car: “Thanks for letting me in,” “Baby on board,” or presumably any other bumper sticker expression. This is intended to influence others, and thus belongs in this category of the regulation (or at least influencing) of others.
These propaganda projects take very direct and explicit forms, including cell phones which, for example, cut out if they hear you say, “Yeah, I am on the cell phone,” “Yeah, I am in the village,” or “Dude,”firstname.lastname@example.org or monitor for swear words, or take other efforts to silence loud or otherwise “inappropriate” private voices in public spaces.
This impulse for social observation is illustrated by a museum display designed to collect responses (what the entry calls clichés) so that it “will grow as an open-ended accretion or demonstration of the clichés uttered by thousands, tens of thousands, millions of art consumers.” This collection is itself the spectacle; the museum exhibit is rethought of as an instrument for the collection of comments.
Another suggestion was the “crowd morality barnacle,” which is a device intended to influence mass behavior – in the given example, a riot. The CMD is intended for distribution throughout a crowd and will respond to key riot phrases; for example, it might respond to “smash” with “be careful”; “burn” with “it might explode”; or “get them” with “where are the children?”Dave Whitlock: email@example.com This is a different conception of regulation than the examples that illustrated the control of self.
To effect self-control, the designs went beyond turning electronic devices off or regulating the self with insistent and unrelenting reminders (e.g., correcting a habit of speech or cutting the “ums” out of the story) to quite novel punishment. These punitives enacted on the self included squirting water in one’s ear, triggering electric shocks, and dribbling water down one’s leg. There were few viable designs that offered a simple reward rather than punishment.
For affecting the social body, there were no physical punitives; the reward seems to have been the social behavior itself, or at least the evidence of it (as in the spectacle of clichés). This desire to see a social spectacle is repeated often, and I would like to argue that it is a recurrent theme in the networked context of information technology.
The final category of devices relies on double entendres and the multiple meanings of words, and demonstrates that speech interface cannot be understood as making the machine more human. Rather, it is clearly exploiting the different parsing, context sensitivity and repeatability of human-vs.-machine models of cognition. For example, to trigger the discrete recording of conversations, one entry describes a recorder that is triggered by “What’s up, amigo?” This deployment of an unusual (relative to the user and context of use - i.e., no one else is likely to say it) filler is used to initiate conversation and direct attention to the people being addressed, but is simultaneously being used for an instrumental purpose: as the “on” button. Likewise the “Don’t hurt me, just don’t hurt me” cell phone/GPS position locator/911 dialer proposal,Dave Whitlock: firstname.lastname@example.org which uses the self-defense phrase to dial for help without alerting the presumed attacker, who is presumed to interpret the plea at face value - second-guessing a reasonable or “usual” response in a threatening situation. In these interactions the user is able to simultaneously employ multiple meanings of his or her words. Clearly the speech chip is here being used so that the words used to interact with the machine are understood to be different from the speech used to interact with humans.
It is also notable that there were categories of speech not explored by these interfaces. Consider the linguistic communication defined as a performative. A performative, such as “I do,” is a highly codified and stabilized utterance that communicates a future commitment or social contract (Butler 1993). Because it is a stabilized social technique, it would be technically pragmatic – the problem of unlimited variation of phrasing is solved. The absence of designs to address this sort of statement is curious, and worth further investigation.
The categories of interaction demonstrated by this brief survey of voice chips are not discontinuous or radically different from other contemporary consumer technologies. The observation of self (or one’s own property) is embodied in the consumer video camera market and surveillance systems; self-regulation has extended from alarm clocks once a day to alarming cell phones carried with you and ready for all alarming occasions; handhelds directly regulate sleep and activity; VCRs and TiVo capture, regulate (in order to extend) and meter out media program consumption.
Social observation is also embodied by surveillance systems, but although surveillance looms large in the popular imagination, it has not been used to see or envision the social mass, or one another. The problem of seeing the social body has remained an architectural problem, solved by spectacles of plaza and malls: public and quasi-public places. What the voice chip most clearly demonstrates is that it is this area in which there seems to be the most interest: being able to view mass behavior. The traditional broadcast (e.g., television) media had very little interest in rendering the public to itself, and as such the rise of phone-in, and “reality television” genres suggests that even in the context of high production value broadcast media there is a cultural appetite to “see” each other, no matter how contrived. The collaborative filtering models, such as that popularized by the Amazon people-who-bought-this-book-also-bought-x button show us each other’s behavior, to make it a shared experience - to see where others have been. Like the micro-casting of a speech recognition-triggered rear window car display, we see this desire expressed through the car, and the car’s peculiar access to the public space of freeways. This is a public space where the rules of communication between and among people are highly constrained (cf. the plaza). This is not the interactive experience of the self with the self, or the self with the machine, but the machine as a proxy for interacting with the social. This is a peculiar and interesting way to think about human-machine interaction.
The interactions we hear with voice chips do not disambiguate the buzzes and beeps used by speechless machines, but speech recognition products do reinforce the idea that we use speech for machines and speech for humans differently, and simultaneously. The other applications also re-imagine how we understand their functions. The products discussed do not exploit the mechanistic, logical and fully controllable functions of machines, but treat them as complicated multifarious social actors. There is a clearly stated desire to enlist these new technologies and product interfaces to promote explicit desired social transformations. We also see here the ambivalent relationship we have with and for our current technological devices.
This essay has explored why listening to voice chips and speech recognition chips might give us a way to examine human-machine interactions in situ. Much real complexity of social and technical interactions is lost in the tradition of examining them within controlled laboratory contexts, and ethnographic analysis can be too rich (though the theoretical perspective that has developed from ethnographic insights, that privileges the improvisational nature of real-world applications, enables us to focus on how speech and turn-taking is used to coordinate the interaction between machines and humans).
This initial analysis is presented in order to set up some preliminary ideas and interpretations, so that as (or if) speech recognition chips become more widely distributed, we can “tune in” to this particular historical moment and hear what it is we expect, want and bring to our human-machine interactions. There are few instruments that give us this viewpoint. Listening to our daily interactions with products can work to contest and complicate the dominant methods used to describe technological trends and patterns of product innovation: demographically driven mass market research and the capture of consumption behaviors at the point of purchase. The examination of speech recognition applications gives unique access to the assumptions, expectations and the imaginative work of products and the interactions they script.
Further examinations of voice chip and speech recognition products and patents can extend what has only just begun. In understanding how voice chips abstract speech, we can examine what we understand interaction to be, and hence how we design and frame interactions in products of daily use, reproducing our understanding of human technical relations. The products make obvious the design assumptions with which they are built, but further investigation of the details of their use will help to elaborate how these micro-interactions perform and realize actual social roles and social structures. A detailed use-analysis of any one of the products could provide further insight into this sort of investigation.
Voice chips also raise other questions. Because they slice through many social and economic sectors but are still in a manageable population of products, they can be used to illustrate the iterative and continuous process of technical change that is intimately involved in a technology’s sociality, in contrast to the radical discontinuities of technological change through discovery and paradigm shifts (Dosi 1982, 147-162; Clark 1985, 235-251). They realize a recombinant model of technological change. Furthermore, for the same reasons, they can be used to examine the changing social position of these products in relation to the configuration of power and work relations (Zuboff 1984), and the transformations of the market groups and users that these products presume.
Finally, in the tradition of Turkle’s examination of children’s understanding of their interactive machines, children’s products with voice chips can illustrate what childcare roles we delegate to machines, and articulate clearly the hardwired (per hardware, not neurons) expression of consumption identity of children.
For these reasons, this essay marks the beginning of a project to collect an ongoing database of products with voices or speech recognition that appear on the market, or receive patents.As noted above, this list will be available at http://cat.nyu.edu/neologues and updated constantly. It will include images and product literature and, when possible, an audio file recording of the voices. As a longer archive of product voices, this may prove a valuable resource for the examination of changing sociotechnical relations, even in the event of the products falling silent and voice chips and speech recognition being abandoned altogether.
The voices of the products reflect back the voices and interactions we have projected and programmed into them, returning them for our reinterpretation. One mode of interaction we have with the consumer products that exist and are imagined at the time of this essay is a dialogue with a monologue. Command and control scripts are more common than improvisational scripts, but other forms of interaction are being scripted. By literally listening to what hardware has to say, and what we say to it, we may better ground our assumptions of interaction in reflexive reinterpretation. Furthermore, we can see from this examination that these technologies can be seen as structures of participation, organizing often indistinguishable human-machine interactions and using them to extend the predictability of individuals and coordinate their interactions. We have an ongoing opportunity, even method, to hear and understand our technologies in terms of these structures of participation, in our own language, and to see these technologies as a distributed system of voices and ears.
Althusser, Louis (1971). “Ideology and Ideological State Apparatuses (Notes Towards an Investigation).” In Lenin and Philosophy and Other Essays. New York: Monthly Review Press.
Austin, J.L. (1962). How to Do Things with Words. Oxford: Oxford University Press.
Bardini, Thierry (1997). “Bridging the Gulfs: From Hypertext to Cyberspace.” Journal of Computer-Mediated Communication 3, no.2 (September, 1997). http://www.ascusc.org/jcmc/vol3/issue2/bardini.html.
Benveniste, Émile (translated by Mary Elizabeth Meek) (1971). “The Nature of Pronouns.” In Problems in General Linguistics. Coral Gables: University of Miami Press.
Butler, Judith (1993). Bodies that Matter. London: Routledge.
Callon, Michel (1995). “Four Models for the Dynamics of Science.” In Handbook of Science and Technology Studies, edited by Sheila Jasanoff, Gerald E. Markle, James C. Petersen and Trevor Pinch. Thousand Oaks, CA: Sage Publications.
—., and John Law (1982). “On Interests and their Transformations: Enrollment and Counter-Enrollment.” Social Studies of Science 12 (1982): 615-625.
Clark, Kim (1985). “The Interaction of Design Hierarchies and Market Concepts in Technological Evolution.” Research Policy 14 (1985): 235-251.
Cowan, Ruth Schwartz (1987). “The Consumption Junction: A Proposal for Research Strategies in the Sociology of Technology.” In The Social Construction of Technological Systems, edited by Wiebe E. Bijker, Thomas P. Hughes and Trevor Pinch. Cambridge, MA: The MIT Press.
Dosi, Giovanni (1982). “Technological Paradigms and Technological Trajectories: A Suggested Interpretation of the Determinants and Directions of Technical Change.” Research Policy 11, no. 3 (1982): 147-162.
Dourish, Paul (2001). Where the Action Is: A History of Embodied Interaction. Cambridge, MA: The MIT Press.
Fabbri, Franco (1981). “A Theory of Musical Genres: Two Applications.” In Popular Music Perspectives, edited by David Horn and Phillip Tagg. Gothenburg and Exeter: International Association for the Study of Popular Music.
Fish, Stanley (1980). “How to Do Things with Austin and Searle.” In Is There a Text in this Class? The Authority of Interpretative Communities. Cambridge, MA, Harvard University Press.
Geertz, Clifford (1973). The Interpretation of Cultures. New York: Basic Books.
Jeremijenko, Natalie (1992). “TITLE.” Palo Alto, CA: Xerox PARC internal publication.
Latour, Bruno (1987). Science in Action. Cambridge, MA: Harvard University Press.
—. (writing as Jim Johnson) (1988). “Mixing Humans and Nonhumans Together: The Sociology of a Door-Closer.” Social Problems 35, no.3 (1988): 298-310.
Minneman, Scott (1991). The Social Construction of Engineering Reality. Ph.D. Thesis, Stanford Department of Mechanical Engineering Dissertation, Stanford, CA.
Oswald, Laura (1996). “The Place and Space of Consumption in a Material World.” Design Issues 12, no. 1 (Spring 1996).
Schegloff, E. (1982). “Discourse as an Interactional Achievement: Some Uses of `uh huh’ and Other Things that come Between Sentences.” In Georgetown University Round Table on Language and Linguistics: Analyzing Discourse Text and Talk, edited by Deborah Tannen. Washington, DC: Georgetown University Press.
Searle, J. (1972). “What is a Speech Act?” In Language and Social Context, edited by P.P. Giglioli. Baltimore: Penguin Books.
Shields, Rob (editor) (1992). Lifestyle Shopping: The Subject of Consumption. New York: Routledge.
Suchman, Lucy (1987). Plans and Situated Action: The Problem of Human-Machine Communication. Cambridge: Cambridge University Press.
Tagg, Philip (1979). Kojak – 50 Seconds of Television Music. Towards the Analysis of Affect in Popular Music. Göteborg, Sweden: Studies from the Department of Musicology, University of Gothenburg.
Turkle, Sherry (1984). The Second Self. New York: Simon and Schuster.
Willis, Susan (1991). Primer for Daily Life. New York: Routledge.
Zuboff, Shoshana (1984). In the Age of the Smart Machine: The Future of Work and Power. New York, Basic Books.