Natalie Jeremijenko asserts that machine speech should re-awaken us to "the peculiar structure of participation that we take for granted."
There are at least two different questions here. The first question: what could we say to things (or them to us)? Or in Lucy Suchman's terms, what new sociomaterial connections can we invent, that don't reiterate old subject/object divides? The second question is: what do voice chips say (what do things say to us, and us to them)?
To explore the answer to the first question, I refer the reader to a series of speech recognition interfaces that I built for an exhibition - Neologues - at the Storefront for Art and Architecture (1999), a design project that motivated my fascination with voice chips.
This exhibition was a series of functioning voice chip and speech recognition-based devices. These included a light switch actuated with your voice rather than your finger. In order to toggle the switch, you had to put your hands on your temples and say the words "mind power," parodying the ambitions of the Human Computer Interaction field. The light switch would toggle on. However, the light nearby would not go on. In order to operate that, you had to say the word "click" brightly. (Crisp plosives are easier to recognize.) As a human, this speech recognition chip made you perform like a switch. Observing one's own performance in this simple role was entertaining to most participants, or at least self-consciously silly.
There was an elevator plate you operated by saying "up" or "down." The only trick was that you had to say these in Spanish, which left most viewers going nowhere, as they swallowed the normative function of the speech recognition chip. There was a speech recognition interface for dogs, whose bark the device would translate to a human voice that said, "I am your loyal servant" - challenging the human-computer interface and its privileging of human cognition with a dog-computer interface that dogs seem to be able to use without a user manual. Speech recognition works well on barks.
There was a functioning prototype of an adapted handgun, the safety latch on which unlatched only when it recognizes the word "Diallo" (the young African immigrant who was shot 47 times for pulling out his wallet). Not just a one-liner, this device explored how the particular history of a device might be embedded using a voice chip, and has been proposed to the NYPD.
Another was a bomb that would detonate with the word "Bang," although - warned the note beside it - the operator would be held liable for any damage or injuries caused. The possibility of operating this not-quite-so-friendly user interface with such childlike ease dramatized the peculiar structure of participation that we take for granted. The entire interaction can be neatly scripted by corporations who stand to profit from this, in the same way that I scripted the interaction with the bomb. But it is the obsequious obedient user, who is behaving in exactly the way intended, who is held responsible for pulling the trigger, liable for the entire sociotechnical system. Such is the fetish of agency.
I elaborate these examples because, in exploring other ways to script interactions, they are the "sociomaterial connections" of the first question. These alternative designs and prototypes exploit the generative aspects of analysis. However, the project of this essay was to try to answer the second question, to make sense of the material culture in which we are currently immersed. Now with regard to this mess, Suchman is right. (Suchman is always right, it seems!) It is her analysis of interaction as the "contingent co-production of the sociomaterial world" that is right (even though I omitted the more structured interaction of interview that I thought more appropriate to apply to voice chip interaction). As soon as we read her careful work, we are struck with a deep recognition drawn from our experience with interaction. Yet the voice chip implementers and patent filers do not seem to know her work - they have it wrong! The model of interaction that is embedded in the voice chip is a parody of our own interaction.
Yet why do they persist? Why do they still appear, reappear, what cultural expectation recharges them, what reinspires designers over and over to deploy them? These voice chips treat voice as simply words that require no identity, judgment or intentionality, no responsiveness to the sequence of exchanges that situate meaning; these voice chips treat interaction as triggered action/reaction that can be implemented with a sensor or two; these voice chips use pre-scripted voices and triggering systems, and do deploy them as human stand-ins. Although this is wrong (are you ever struck by the rightness of an utterance from a voice chip?), it is so obviously wrong that it is funny. But the point is not whether they are right or wrong. The point is: they are there, they persist, and they keep appearing.
So when I claim that the voice chips are direct evidence of interaction, I mean that they are in the sense that a caricature is direct evidence: recognizable, descriptive, reductive and absurd - but not correct, and certainly not comprehensive. The voices in the whole array of products reviewed in the essay, I think, are very effective caricatures of what we expect from information technologies. And what they represent is exactly the idea that there are discrete components (e.g. voices) that assemble into "interactivity," or compose intelligence. They are exactly an embodiment of what Simon Penny refers to the AI fallacy, and moreover they make it look silly. They parody the idea of pre-scripted interactivity precisely because they perform it; they parody linguistic theory because they fulfill the categories; they mock us all, incessantly. The voice chips have none of the glamour or scientific seriousness associated with sensor effector apparatuses, Artificial Intelligence or User Interface Design. They provide a rare opportunity to examine the technology we produce in which it actually looks patently ridiculous and somewhat stupid. There are few technologies under analysis that have such a general aura of the unnecessary, of excess marginality, and have such a peripheral relationship to "function." In general we prefer to think about sociomaterial technology through the lens of heroic, gleaming nuclear warhead missiles, or complex neural nets, or simply important work technologies, rather than silly voice chips.
So the reductive move that both Suchman and Penny protest is in fact very deliberate and even the raison d'etre of the work. The singling out of voice chips from other things with interactive properties, like texts and graphics (Suchman), or my efforts to delineate them from technologies with much richer cultural contexts and histories, like the broadcast recording industry, is, as Penny correctly points out, not watertight - because this singling out explicitly provides the opportunity to play a game. The game is called: let's pretend voice chips are interactive, let's take them at their face value, let's take them seriously, let's pretend that they are interesting to listen to, let's put aside our well-developed coping skills for tuning out elevators that incessantly state the obvious, and escalators telling us to mind our step - as if they care. Let's instead play this game and seriously listen to voice chips - as if they were voices with histories and futures and stakes and agency, as if they were the voice of our collective investment in technological material culture, the mirror of our desires.
Okay, now walk into a shopping mall, or go about your daily activity and actively listen to these amputated voices. We start to realize that these voices are an alarm sounding, we start hearing other things in them... we listen for character and we hear a command control logic, we hear the control we have relinquished in trivial but crucial ways (when we think of the mass), we can hear the simplification of interaction that the designers intend, we can hear the voice of (from) the other side. Then the experience of voice chips actually does become enriched, because in the interactive co-production of conversation, we make up for the errors they enact - we compensate just as Lucy Suchman suggests. If we keep playing, perhaps we can question the very future of our technologies, without the glare, glamour and glimmer of complex systems.
Penny and Suchman are two of the most coherent and cogent theorists of the mess of technosocial interaction, and voice chips ratify their work. Voice chips also demonstrate their own limited repertoire of interaction scripts, and if they were to emerge as a genre of interaction there must be, or should be, alternative structures of participation.
======
Neologues: Lightswitch Interface Instructions
To operate this light switch, place hands on temple and clearly say, "mind power." This will activate the switch (i.e. it will toggle), but does not turn on the light. Other uses of "mind power," such as computer control through EEGs, also have this concrete command functionality, without the capacity for nuanced verbal control.
Neologues: Light Interface Instructions
To operate this light say, "click" brightly. Configures/scripts the user to perform as if he or she were a switch, like many "interactive" technologies.
Neologues: Elevator Interface Instructions
This elevator recognizes "up" and "down" in Spanish. There is no English language override, leaving many people stuck.
Neologues: Dog Translator Instructions
The appropriate dog growl is translated into human speech that says, "I am your loyal servant." Addresses abstract reasoning capacities in dogs and in so doing defies human-centric views of interaction.
Neologues: Bang Interface
Tele-operation of a bomb scripts user interaction as if he or she is responsible. Although he or she did not design the interaction, nor place the bomb, and can only obediently follow instructions, it is the user who is considered liable. This is similar to the problematic technocorporate "the person who pulls the trigger" logic. While corporations profit from and script the interactions for obedient users, the user is made responsible for choices that are not entirely theirs.
Bone Transducer Interface
A "located information interface" for delivering information on office hours and availability. The interface requires physical contact between the head (the resonating chambers therein) and the 1" diameter plate, coupling high-fidelity sound that cannot be otherwise be overheard. This transduction technology, elsewhere used in sound-compromised environments (e.g. bite interface in scuba diving) is adapted to provide a private audio environment in a semi-public context. In this case, it is embedded in the wall and positioned at kneeling height, to frame the act of actually receiving information.
Dumb PowerMeter
A domestic power consumption meter with speech recognition. The meter displays nothing until the person guesses the first significant digit. This interaction depends on the user having an idea, being able to make an educated guess, and caring enough to know, rather than delegating the smart appliance to knowing/displaying the power consumption, all the time, to no one.
back to Beyond Chat introduction