Simon Penny adds object-context to the talking machines of Natalie Jeremijenko's essay.
Talking greeting cards, talking ashtrays, talking toilet roll holders: these are some of the more alarming products of the information economy, more alarming for their utter triviality and transience. Natalie Jeremijenko's analysis of voice chip products (and secondarily, of voice recognition products) attempts a sociology of machines integrated into the human social circuit. This integration occurs on the basis that the machines engage, in some sense, in speech and speech acts. As she points out, this speech, lacking any but the most rudimentary sentience, confounds theories of speech and language.
In this radical and interdisciplinary exercise she draws on various theoretical resources, including the speech act theory of Searle and the studies of situated cognition by Suchman that have been so influential in HCI and post-AI. Appropriately, her analysis is grounded in Latour's Actor Network Theory, which speaks in terms of human/animal/object hybrids and confounds conventional notions of human agency and mastery (voice chips give some of these non-human actors a "voice"). This drawing from diverse theoretical resources is an essential part of any interdisciplinary study. It is a richly productive strategy, and one that will reveal contradictions and unsuspected voids. The resolution of such can be as engrossing as the field that the theories are being applied to. Jeremijenko does not go far in this direction, nor does she pretend to - she is very clear about the preliminary nature of the research. It is a scaffold: a collection of field research, some tentative categorization, some relevant theoretical correlation, some hypotheses.
Her premise is that voice-capable consumer technological widgets warrant consideration as a class of their own. Initial questions are therefore concerned with boundary conditions. What places (most of) these widgets in a separate class from other machine speech systems, such as automated telephone reception systems or desktop computer text-to-speech software? She observes that a defining quality of these devices is that the voice is not interpreted as being a message from a person separated by time or space from the listener: it is the voice of the device. It is not "recording" and it is not telematically-facilitated conversation (phone) or address (radio and TV). Apart from her reference to talking elevators and car alarms, the separate class is defined by three characteristics: their physical manifestation as handheld commodities; their utterly rudimentary behavior; and the fact that they (seem to) speak as agents and not conduits. But while the voice may be synthesized, the system, like any interactive system, runs code devised by a human designer, in which behavior is "recorded" and which is a representation of the desire of the designer. The limited capability of these talking devices relegates them to the station of bacteria or algae in the taxonomy of artificial life. They are hardly "autonomous agents."
Jeremijenko does not seem to be making the claim that the solid state nature of the chips themselves is particularly significant, except inasmuch as it lends robustness, portability and low power consumption to the devices. But the chip itself does not sound, does not trigger itself, does not power itself. Her use of the term "voice chip" seems to stand metonymically for a larger technological complex in which the voice chip is embedded.
The fact that they are handheld alerts me to the importance of their physically instantiated nature. They are meant to be held, to be carried, to be spoken into or listened to. They take part in the complex choreography of embodied relation with the world. I am cautious of her isolation of the capacity of speech synthesis from other aspects of the devices. Is she committing what we might call the "Artificial Intelligence Fallacy": reductively isolating a component of sentience as being primary and disregarding the remainder? In the early history of AI, the question of sensor and effector integration with the world was cast aside, either because it was too hard, or because it was not deemed to be of prime significant in the quest for "intelligence."
If the decision-making capability of these talking devices is rudimentary, then so is their integration with the outside world. Their knowledge of the world tends to be limited to a set of one-bit signals from immediately local finger-pressed buttons. Many of the hypothetical devices referred to in the latter part of the paper depend on unspecified and often technologically non-trivial sensor arrays. More than the use of synthetic or sampled voice as output, it is the sensor arrays and the integration and processing of sensor data that generates the semblance of sentience. A textual display, combined with such sensor processing, is unlikely to be deemed significantly more stupid than such a thing with voice output.
As she observes, these devices use machine speech mostly to replace an alert, an LED blink, piezo bleep or text display. Technologically, voice output is simply a display technology. It gives the impression of human sentience, in the same way that a photograph gives a recognizable representation of a face. People who talk to photographs are considered loonies (yet people do look at photographs, and discern human qualities).
The goals of this study seem to be substantially sociological, though it is sociology "in the expanded field" to be sure, and is informed by deep technological fluency. A sociological analysis of these devices would be augmented by a discussion of the kinds of activities and behaviors they facilitate or which emerge around them, and the kinds of social activity they distort, suppress or prevent. Like the contemporary youth culture of "texting," the phenomenon of the talking widget calls for an in-depth study, perhaps in the manner of Carolyn Marvin's fundamental text "When Old Technologies Were New." Other relevant pieces of current literature are those in the "socially intelligent agent" field, the work of Kerstin Dautenhahn, Phoebe Sengers, and others.
In her discussion of recording devices, she notes that since the voice spoken by the machine is the user's voice, interaction constitutes a dialogue in which "we can understand the voice chip's position." But while this is clearly true, it substantially complexifies the issue of who or what is actually speaking. It is no longer a device speaking as itself. This erodes the premise upon which her classification is based. What then distinguishes these recording devices, on the level of social function, from handheld tape recorders?
She compares the chips (or the chip-enhanced widgets), which have no significant cultural history, with the entire techno-cultural history of the music recording industry. While this comparison is thus inappropriate, it again suggests that further historical inquiry might be instructive. What of the sociology of automated telephone systems and talking "point of sale" devices? Of dictaphones and even stenography? Talking toys, as she notes, have a long history. In fact, Edison's first application for his wax roll phonograph technology was in a talking doll. The autonomous machine voice, in the form of a talking toy, predates any other application of stored voice technology, including the entire recorded music industry.