Article

‘All possible sounds’: speech, music, and the emergence of machine listening

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

“Machine listening” is one common term for a fast-growing interdisciplinary field of science and engineering that “uses signal processing and machine learning to extract useful information from sound”. This article contributes to the critical literature on machine listening by presenting some of its history as a field. From the 1940s to the 1990s, work on artificial intelligence and audio developed along two streams. There was work on speech recognition/understanding, and work in computer music. In the early 1990s, another stream began to emerge. At institutions such as MIT Media Lab and Stanford’s CCRMA, researchers started turning towards “more fundamental problems of audition”. Propelled by work being done by and alongside musicians, speech and music would increasingly be understood by computer scientists as particular sounds within a broader “auditory scene”. Researchers began to develop machine listening systems for a more diverse range of sounds and classification tasks: often in the service of speech recognition, but also increasingly for their own sake. The soundscape itself was becoming an object of computational concern. Today, the ambition is “to cover all possible sounds”. That is the aspiration with which we must now contend politically, and which this article sets out to historicise and understand.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Figure 6 shows a CASA technique proposed by Zeremdini, Ben Messaoud & Bouzid (2015). It is used in the processing of sound signals in conjunction with machine learning to extract required information, often referred to as "machine listening" (Parker & Dockray, 2023). ...
Article
Full-text available
Multichannel speech enhancement (MCSE) is crucial for improving the robustness and accuracy of automatic speech recognition (ASR) systems. Due to the importance of ASR systems, extensive research has been conducted in MCSE, leading to rapid advancements in methods, models, and datasets. Most previous reviews point to the lack of a systematic literature review of MCSE for ASR systems. This systematic literature review aims to (1) perform a comprehensive review of the existing approaches in MCSE for ASR, (2) analyze the performance of the MCSE and ASR for various techniques, models, as well as noise data and environments, and (3) discuss the challenges, limitations, and future research directions in this research area. We conducted keyword searches on several electronic databases such as Google Scholar, IEEE Xplore, ScienceDirect, SpringerLink, ACM Digital Library, and ISI Web of Knowledge to identify relevant journal and conference articles. We selected 240 articles based on inclusion criteria from the initial search results and ended with 35 experimental articles when exclusion criteria were applied. Through backward snowballing and the quality assessment, the final tally was 40 articles, comprising 23 journals, and 17 conference articles. The review shows that there is an increasing trend in MCSE for ASR with word error rate (WER), perceptual evaluation of speech quality (PESQ), and short-time objective intelligence (STOI) as common forms of performance measures. One of the major issues that we found in the review is the generality and comparability of the MCSE works, making it difficult to come up with unified solutions to noises in speech recognition. This systematic literature review has extensively examined MCSE and ASR techniques. Key findings include identifying MCSE methods that help ASR performance across various models, techniques, noise, and environments. We also identify several key areas researchers can explore in the future due to their promising potential.
Conference Paper
Full-text available
Sleep is an essential part of health. One factor that affects sleep quality is soundscape, the acoustic environment as perceived. In the SleepSound project, we aim to collect field data and harness machine listening for an AI-supported assessment of soundscape quality for healthy sleep. Why is this important? Our initial literature review reveals that there is little known about Hong Kong's domestic acoustic environment and practically no research has been published on people's perception of their own nighttime soundscape, or how it may affect sleep. A case in point is the Noise Control Ordinance (1989), which regulates noise from e.g. construction sites but leaves much of neighbourhood environments open to interpretation. Without a deeper understanding of soundscape, solving inevitable conflicts might be left to arbitrary judgements. Given this situation, our project aims to develop methods to chart the nighttime soundscape and its impact on sleep. Restorative sleep is important for everyone and crucial for vulnerable individuals e.g. with a medical condition. The present research project builds on our recent study in a nighttime hospital ward, where patients wore sleep trackers to detect disturbances, and soundscape audio was captured. SleepSound will venture further by focusing on the context of domestic bedrooms for normally healthy residents in Hong Kong.
Article
This article critiques the anthropocentric tendencies in machine listening practices and narratives, developing alternative concepts and methods to explore the more-than-human potential of these technologies through the framework of sonic fiction. Situating machine listening within the contemporary soundscape of dataveillance, the research examines post-anthropocentric threads that emerge at the intersection of datafication, subjectivation and animalisation. Theory and practice interweave in the composition of a music piece, The Spiral , enabling generative feedback between concept, sensation and technique. Specifically, the research investigates the figure of a mollusc bio-sensor between science fact and fable, as the (im)possible locus of musicality. This emergent methodology also offers new insights for other sound art and music practices aiming to pluralise what listening might be.
Article
This article reflects on the “flat” history of timbre space, tracking its emergence as a technical inscription in psychoacoustic experiments and its rise to become a dominant conceptual metaphor in timbre studies. Drawing on Bruno Latour's notion of “immutable mobiles,” the author shows how the idea of a multidimensional timbre space has been propagated through the circulation of diagrams, which make perceptual data on listeners accessible to remote viewers. After surveying laboratory tools and techniques required for the production of these diagrams, the article considers how models of timbre space have been built into new technologies for music composition, performance, and listening, as well as into audio classification schemes and metadata formatting standards like MPEG-7. Mapping connections between psychoacoustic discourses and design practices, the article sheds light on the technoscientific origins of timbre space, examining its articulation to research labs at Bell, CCRMA, and IRCAM, and interrogating its role in determining what counts as sound knowledge.
Article
Full-text available
Background Over the past decade, antiretroviral therapy (ART) regimens that include integrase strand inhibitors (INSTIs) have become the most commonly used for people with HIV starting ART. Although trials and observational studies have compared virological failure on INSTI-based with other regimens, few data are available on mortality in people with HIV treated with INSTIs in routine care. Therefore, we compared all-cause mortality between different INSTI-based and non-INSTI-based regimens in adults with HIV starting ART from 2013 to 2018. Methods This cohort study used data on people with HIV in Europe and North America from the Antiretroviral Therapy Cohort Collaboration (ART-CC) and UK Collaborative HIV Cohort (UK CHIC). We studied the most common third antiretroviral drugs (additional to nucleoside reverse transcriptase inhibitor) used from 2013 to 2018: rilpivirine, darunavir, raltegravir, elvitegravir, dolutegravir, efavirenz, and others. Adjusted hazard ratios (aHRs; adjusted for clinical and demographic characteristics, comorbid conditions, and other drugs in the regimen) for mortality were estimated using Cox models stratified by ART start year and cohort, with multiple imputation of missing data. Findings 62 500 ART-naive people with HIV starting ART (12 422 [19·9%] women; median age 38 [IQR 30–48]) were included in the study. 1243 (2·0%) died during 188 952 person-years of follow-up (median 3·0 years [IQR 1·6–4·4]). There was little evidence that mortality rates differed between regimens with dolutegravir, elvitegravir, rilpivirine, darunavir, or efavirenz as the third drug. However, mortality was higher for raltegravir compared with dolutegravir (aHR 1·49, 95% CI 1·15–1·94), elvitegravir (1·86, 1·43–2·42), rilpivirine (1·99, 1·49–2·66), darunavir (1·62, 1·33–1·98), and efavirenz (2·12, 1·60–2·81) regimens. Results were similar for analyses making different assumptions about missing data and consistent across the time periods 2013–15 and 2016–18. Rates of virological suppression were higher for dolutegravir than other third drugs. Interpretation This large study of patients starting ART since the introduction of INSTIs found little evidence that mortality rates differed between most first-line ART regimens; however, raltegravir-based regimens were associated with higher mortality. Although unmeasured confounding cannot be excluded as an explanation for our findings, virological benefits of first-line INSTIs-based ART might not translate to differences in mortality. Funding US National Institute on Alcohol Abuse and Alcoholism and UK Medical Research Council.
Article
Full-text available
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1
Article
Full-text available
At the time of writing this article, the world population is suffering from more than 2 million registered COVID-19 disease epidemic-induced deaths since the outbreak of the corona virus, which is now officially known as SARS-CoV-2. However, tremendous efforts have been made worldwide to counter-steer and control the epidemic by now labelled as pandemic. In this contribution, we provide an overview on the potential for computer audition (CA), i.e., the usage of speech and sound analysis by artificial intelligence to help in this scenario. We first survey which types of related or contextually significant phenomena can be automatically assessed from speech or sound. These include the automatic recognition and monitoring of COVID-19 directly or its symptoms such as breathing, dry, and wet coughing or sneezing sounds, speech under cold, eating behaviour, sleepiness, or pain to name but a few. Then, we consider potential use-cases for exploitation. These include risk assessment and diagnosis based on symptom histograms and their development over time, as well as monitoring of spread, social distancing and its effects, treatment and recovery, and patient well-being. We quickly guide further through challenges that need to be faced for real-life usage and limitations also in comparison with non-audio solutions. We come to the conclusion that CA appears ready for implementation of (pre-)diagnosis and monitoring tools, and more generally provides rich and significant, yet so far untapped potential in the fight against COVID-19 spread.
Article
Full-text available
This paper traces the infrastructural politics of automated music mastering to reveal how contemporary iterations of artificial intelligence (AI) shape cultural production. The paper examines the emergence of LANDR, an online platform that offers automated music mastering, built on top of supervised machine learning branded as artificial intelligence. Increasingly, machine learning will become an integral part of signal processing for sounds and images, shaping the way media cultures sound, look, and feel. While LANDR is a product of the so-called ‘big bang’ in machine learning, it could not exist without specific conditions: specific kinds of commensurable data, as well as specific aesthetic and industrial conditions. Mastering, in turn, has become an indispensable but understudied part of music circulation as an infrastructural practice. Here we analyze the intersecting histories of machine learning and mastering, as well as LANDR’s failure at automating other domains of audio engineering. By doing so, we critique the discourse of AI’s inevitability and show the ways in which machine learning must frame or reframe cultural and aesthetic practices in order to automate them, in service of digital distribution, recognition, and recommendation infrastructures.
Article
Full-text available
Data science is not simply a method but an organising idea. Commitment to the new paradigm overrides concerns caused by collateral damage, and only a counterculture can constitute an effective critique. Understanding data science requires an appreciation of what algorithms actually do; in particular, how machine learning learns. The resulting ‘insight through opacity’ drives the observable problems of algorithmic discrimination and the evasion of due process. But attempts to stem the tide have not grasped the nature of data science as both metaphysical and machinic. Data science strongly echoes the neoplatonism that informed the early science of Copernicus and Galileo. It appears to reveal a hidden mathematical order in the world that is superior to our direct experience. The new symmetry of these orderings is more compelling than the actual results. Data science does not only make possible a new way of knowing but acts directly on it; by converting predictions to pre-emptions, it becomes a machinic metaphysics. The people enrolled in this apparatus risk an abstraction of accountability and the production of ‘thoughtlessness’. Susceptibility to data science can be contested through critiques of science, especially standpoint theory, which opposes the ‘view from nowhere’ without abandoning the empirical methods. But a counterculture of data science must be material as well as discursive. Karen Barad’s idea of agential realism can reconfigure data science to produce both non-dualistic philosophy and participatory agency. An example of relevant praxis points to the real possibility of ‘machine learning for the people’.
Article
Full-text available
The realm of the voice and the realm of the affective often share the distinction of the ineffable. Over the past 5-10 years, there has been a proliferation of scientific research and commercial products focused on the measurement of affect in the voice, attempting to codify and quantify that which previously had been understood as beyond language. Following similar work regarding the digital detection of facial expressions of emotion, this form of signal capture monitors data “below the surface,” deriving information about the subject’s intentions, objectives, or emotions by monitoring the voice signal for parameters such as timing, volume, pitch changes, and timbral fluctuation. Products claim to detect the mood, personality, truthfulness, confidence, mental health, and investability quotient of a speaker, based on the acoustic component of their voice. This software is being used in a range of applications, from targeted surveillance, mental health diagnoses, and benefits administration to credit management. A study of code, schematics, and patents reveals how this software imagines human subjectivity, and how such recognition is molded by, and in service of, the risk economy; revealing an evolution from truth-telling, to diagnostic, to predictive forms of listening.
Book
Drawing on cultural theory and interviews with fans, cast members and producers, this book places the reality TV trend within a broader social context, tracing its relationship to the development of a digitally enhanced, surveillance-based interactive economy and to a savvy mistrust of mediated reality in general. Surveying several successful reality TV formats, the book links the rehabilitation of 'Big Brother' to the increasingly important economic role played by the work of being watched. The author enlists critical social theory to examine how the appeal of 'the real' is deployed as a pervasive but false promise of democratization.
Book
How are the new electronic technologies transforming business here and abroad — indeed, the entire world economy — and what new strategies must business develop to meet the challenges of this transformation? Economist, writer, and communications executive Maurice Estabrooks provides a readable, comprehensive survey of how businesses are using microchips, computers, and telecommunications to reshape the entire world of work — its cultures, organization, and economic systems. With insight and impeccable scholarship he provides concrete evidence of the emergence of artificially intelligent, cybernetic, network-based entities that are creating new linkages between businesses, markets, and technology itself — linkages that will profoundly affect the way businesses create and implement their corporate survival and growth strategies in the future. Drawing on the work of economic theorist Joseph Schumpeter, Estabrooks shows how Schumpeterian dynamics have played a key role in the breakup of AT&T and the Bell System, and in the deregulation of telecommunications, broadcasting, banking, finance, and other economically critical industries. What has emerged, he maintains, is an increasingly integrated, global information- and software-based services economy. Optical fibers, satellites, and wireless communications systems have already made possible the development of electronic superhighways, but in doing so they have also initiated a massive redistribution of economic power and wealth throughout the world, the implications of which are only now being understood. Historical, analytical, descriptive, Estabrooks' book will speak not only to academics and others who observe world transformations from relatively theoretical perspectives, but also to corporate and other executives whose organizations, and certainly their personal work lives, will be changed dramatically by the developments he describes in practical day-to-day situations.
Chapter
Mind design is the endeavor to understand mind (thinking, intellect) in terms of its design (how it is built, how it works). Unlike traditional empirical psychology, it is more oriented toward the "how" than the "what." An experiment in mind design is more likely to be an attempt to build something and make it work—as in artificial intelligence—than to observe or analyze what already exists. Mind design is psychology by reverse engineering. When Mind Design was first published in 1981, it became a classic in the then-nascent fields of cognitive science and AI. This second edition retains four landmark essays from the first, adding to them one earlier milestone (Turing's "Computing Machinery and Intelligence") and eleven more recent articles about connectionism, dynamical systems, and symbolic versus nonsymbolic models. The contributors are divided about evenly between philosophers and scientists. Yet all are "philosophical" in that they address fundamental issues and concepts; and all are "scientific" in that they are technically sophisticated and concerned with concrete empirical research. Contributors Rodney A. Brooks, Paul M. Churchland, Andy Clark, Daniel C. Dennett, Hubert L. Dreyfus, Jerry A. Fodor, Joseph Garon, John Haugeland, Marvin Minsky, Allen Newell, Zenon W. Pylyshyn, William Ramsey, Jay F. Rosenberg, David E. Rumelhart, John R. Searle, Herbert A. Simon, Paul Smolensky, Stephen Stich, A.M. Turing, Timothy van Gelder
Article
There is a gap in existing critical scholarship that engages with the ways in which current “machine listening” or voice analytics/biometric systems intersect with the technical specificities of machine learning. This article examines the sociotechnical assemblage of machine learning techniques, practices, and cultures that underlie these technologies. After engaging with various practitioners working in companies that develop machine listening systems, ranging from CEOs, machine learning engineers, data scientists, and business analysts, among others, I bring attention to the centrality of “learnability” as a malleable conceptual framework that bends according to various “ground-truthing” practices in formalizing certain listening-based prediction tasks for machine learning. In response, I introduce a process I call Ground Truth Tracings to examine the various ontological translations that occur in training a machine to “learn to listen.” Ultimately, by further examining this notion of learnability through the aperture of power, I take insights acquired through my fieldwork in the machine listening industry and propose a strategically reductive heuristic through which the epistemological and ethical soundness of machine learning, writ large, can be contemplated.
Article
Acoustic gunshot detection systems (AGDS) have been emerging as a technological solution to the growing problem of gun violence around the world. We examine a particularly prominent AGDS technology called RespondTM developed by publicly traded US company ShotSpotter Inc. (NASDAQ: SSTI) to better understand the sociotechnical logics that inform its operation. Drawing from frameworks provided by science and technology studies and sound studies, we ask, “What are the broader conditions that allow for a successful AGDS as it is imagined by ShotSpotter?” At a time in which the accuracy and reliability of ShotSpotter’s AGDS are being seriously questioned through the numerous reports of false positives that reached as high as 99 percent in certain cities it was deployed in, it is imperative to interrogate what exactly is “false” in these false positive reports and how the company operates despite them. In this paper, we trace ShotSpotter and its artificial intelligence/machine learning AGDS technologies as they exist across various patents, promotional materials, financial documents, and public statements to not only better understand the ways in which they translate sound into “crime,” or space into “crime scene,” but also to bring attention to how ShotSpotter translates itself across its different audiences.
Book
An accessible explanation of the technologies that enable such popular voice-interactive applications as Alexa, Siri, and Google Assistant. Have you talked to a machine lately? Asked Alexa to play a song, asked Siri to call a friend, asked Google Assistant to make a shopping list? This volume in the MIT Press Essential Knowledge series offers a nontechnical and accessible explanation of the technologies that enable these popular devices. Roberto Pieraccini, drawing on more than thirty years of experience at companies including Bell Labs, IBM, and Google, describes the developments in such fields as artificial intelligence, machine learning, speech recognition, and natural language understanding that allow us to outsource tasks to our ubiquitous virtual assistants. Pieraccini describes the software components that enable spoken communication between humans and computers, and explains why it's so difficult to build machines that understand humans. He explains speech recognition technology; problems in extracting meaning from utterances in order to execute a request; language and speech generation; the dialog manager module; and interactions with social assistants and robots. Finally, he considers the next big challenge in the development of virtual assistants: building in more intelligence—enabling them to do more than communicate in natural language and endowing them with the capacity to know us better, predict our needs more accurately, and perform complex tasks with ease.
Article
By looking at the politics of classification within machine learning systems, this article demonstrates why the automated interpretation of images is an inherently social and political project. We begin by asking what work images do in computer vision systems, and what is meant by the claim that computers can “recognize” an image? Next, we look at the method for introducing images into computer systems and look at how taxonomies order the foundational concepts that will determine how a system interprets the world. Then we turn to the question of labeling: how humans tell computers which words will relate to a given image. What is at stake in the way AI systems use these labels to classify humans, including by race, gender, emotions, ability, sexuality, and personality? Finally, we turn to the purposes that computer vision is meant to serve in our society—the judgments, choices, and consequences of providing computers with these capacities. Methodologically, we call this an archeology of datasets: studying the material layers of training images and labels, cataloguing the principles and values by which taxonomies are constructed, and analyzing how these taxonomies create the parameters of intelligibility for an AI system. By doing this, we can critically engage with the underlying politics and values of a system, and analyze which normative patterns of life are assumed, supported, and reproduced.
Article
While automatized content identification of audio data, in critical discourse analysis, is bound to the symbolic order of monitoring, control, surveillance, censorship and copyright protection, the very tools and algorithms which have been developed for such purposes can be turned into instruments of knowledge production in the scientific sense. Audio content identification is not simply an extension of cultural taxonomies to machine listening, but an operation with its own eigen knowledge. Audio content identification is not simply a continuation of analog techniques for monitoring sonic objects. From a media-epistemological perspective, new forms of audio content identification open different orders of the sonic archive. What is practiced in the online domain has been preceded by experimental investigations of archival storage. The real l'archive, though, are the technological (hardware) and mathematical (software) criteria defining content identification. A media archaeology of audio content identification reveals the technological l'archive governing such forms of enunciation.
Book
Tracing efforts to control unwanted sound—the noise of industry, city traffic, gramophones and radios, and aircraft—from the late nineteenth to the late twentieth century. Since the late nineteenth century, the sounds of technology have been the subject of complaints, regulation, and legislation. By the early 1900s, antinoise leagues in Western Europe and North America had formed to fight noise from factories, steam trains, automobiles, and gramophones, with campaigns featuring conferences, exhibitions, and “silence weeks.” And, as Karin Bijsterveld points out in Mechanical Sound, public discussion of noise has never died down and continues today. In this book, Bijsterveld examines the persistence of noise on the public agenda, looking at four episodes of noise and the public response to it in Europe and the United States between 1875 and 1975: industrial noise, traffic noise, noise from neighborhood radios and gramophones, and aircraft noise. She also looks at a twentieth-century counterpoint to complaints about noise: the celebration of mechanical sound in avant-garde music composed between the two world wars. Bijsterveld argues that the rise of noise from new technology combined with overlapping noise regulations created what she calls a “paradox of control.” Experts and politicians promised to control some noise, but left other noise problems up to citizens. Aircraft noise, for example, measured in formulas understandable only by specialists, was subject to public regulation; the sounds of noisy neighborhoods were the responsibility of residents themselves. In addition, Bijsterveld notes, the spatial character of anti-noise interventions that impose zones and draw maps, despite the ability of sound to cross borders and boundaries, has helped keep noise a public problem. We have tried to create islands of silence, she writes, yet we have left a sea of sounds to be fiercely discussed.
Book
At a ceremony announcing the completion of the first draft of the human genome in 2000, President Bill Clinton declared, “I believe one of the great truths to emerge from this triumphant expedition inside the human genome is that in genetic terms, all human beings, regardless of race, are more than 99.9 percent the same.” Yet despite this declaration of unity, biomedical research has focused increasingly on mapping that 0.1 percent of difference, particularly as it relates to race. This trend is exemplified by the drug BiDil. This drug was originally touted as a groundbreaking therapy to treat heart failure in black patients and help underserved populations. However, the book reveals a far more complex story. At the most basic level, BiDil became racial through legal maneuvering and commercial pressure as much as through medical understandings of how the drug worked. The book broadly examines the legal and commercial imperatives driving the expanding role of race in biomedicine, even as scientific advances in genomics could render the issue irrelevant. It surveys the distinct politics informing the use of race in medicine and the very real health disparities caused by racism and social injustice that are now being cast as a mere function of genetic difference.
Article
In addition to the recent proliferation of approaches, programs, and research centers devoted to ethical data and Artifiical Intelligence, it is becoming increasingly clear that we need to directly address the political question. Ethics, while crucial, comprise only an indirect response to recent concerns about the political uses and misuses of data mining, AI, and automated processes. If we are concerned about the impact of digital media on democracy, it will be important to consider what it might mean to foster democratic arrangements for the collection and use of data, and for the institutions that perform these tasks. This essay considers what it might mean to supplement ethical concerns with political ones. It argues for the importance of considering the tensions between civic life and the wholesale commercialization of news, information, and entertainment platforms—and how these are exacerbated by the dominant economic model of data-driven hyper-customization.
Book
The Closed World offers a radically new alternative to the canonical histories of computers and cognitive science. Arguing that we can make sense of computers as tools only when we simultaneously grasp their roles as metaphors and political icons, Paul Edwards shows how Cold War social and cultural contexts shaped emerging computer technology—and were transformed, in turn, by information machines. The Closed World explores three apparently disparate histories—the history of American global power, the history of computing machines, and the history of subjectivity in science and culture—through the lens of the American political imagination. In the process, it reveals intimate links between the military projects of the Cold War, the evolution of digital computers, and the origins of cybernetics, cognitive psychology, and artificial intelligence. Edwards begins by describing the emergence of a "closed-world discourse" of global surveillance and control through high-technology military power. The Cold War political goal of "containment" led to the SAGE continental air defense system, Rand Corporation studies of nuclear strategy, and the advanced technologies of the Vietnam War. These and other centralized, computerized military command and control projects—for containing world-scale conflicts—helped closed-world discourse dominate Cold War political decisions. Their apotheosis was the Reagan-era plan for a "Star Wars" space-based ballistic missile defense. Edwards then shows how these military projects helped computers become axial metaphors in psychological theory. Analyzing the Macy Conferences on cybernetics, the Harvard Psycho-Acoustic Laboratory, and the early history of artificial intelligence, he describes the formation of a "cyborg discourse." By constructing both human minds and artificial intelligences as information machines, cyborg discourse assisted in integrating people into the hyper-complex technological systems of the closed world. Finally, Edwards explores the cyborg as political identity in science fiction—from the disembodied, panoptic AI of 2001: A Space Odyssey, to the mechanical robots of Star Wars and the engineered biological androids of Blade Runner—where Information Age culture and subjectivity were both reflected and constructed. Inside Technology series
Thesis
By turning to the expertise of computer scientists and engineers, they seek to build "machine listening" prototypes for psychiatric assessment: technologies that use a microphone to capture sound and artificial intelligence (AI) to analyze sound. While their studies are premised on the notion that AI can listen beyond the human by attending to sounds of speech that have psychopathological significance supposedly set aside from linguistic meaning and human difference, in order to gather and classify the data necessary for building their technologies, researchers must rely on the very components of language that they seek to overcome: its interactional, sociocultural dimensions. I show how the connections between spoken utterances and inner states that researchers design their systems to make "autonomously" depend upon a tightly managed but oftentimes hidden infrastructure of human labor, including the labor of research subjects.
Article
This article considers machine methods used in the collection, processing, and application of vocal recordings for speaker identification and speech recognition between 1908 and 1970. The first phonographic archives featured collections of "vocal portraits" that prompted international investigations into the essential features of human voices for individual identification. Visual records of speech later found the same applications, but as "voiceprint identification" via sound spectrography began to achieve legal and commercial success in the 1960s, the procedure attracted more widespread scientific attention, which ultimately discredited both its accuracy and its rationale. At the same time, spectrogram collections spurred a new application-speech recognition by machine. The changing status of the speech spectrogram, from a record of unique features of individual voices to a model of fundamental invariants in speech sounds, was rooted in the demands of automated processing and a corresponding shift from the sound archive to the acoustic database.
Article
This article examines the figuration of the home automation device Amazon Echo and its digital assistant Alexa. While most readings of gender and digital assistants choose to foreground the figure of the housewife, I argue that Alexa is instead figured on domestic servants. I examine commercials, Amazon customer reviews, and reviews from tech commentators to make the case that the Echo is modeled on an idealized image of domestic service. It is my contention that this vision functions in various ways to reproduce a relation between device/user that mimics the relation between servant/master in nineteenth- and twentieth-century American homes. Significantly, however, the Echo departs from this historical parallel through its aesthetic coding as a native-speaking, educated, white woman. This aestheticization is problematic insofar as it decontextualizes and depoliticizes the historic reality of domestic service. Further, this figuration misrepresents the direction of power between user and devices in a way that makes contending with issues such as surveillance and digital labor increasingly difficult.
Book
Auditory Scene Analysis addresses the problem of hearing complex auditory environments, using a series of creative analogies to describe the process required of the human auditory system as it analyzes mixtures of sounds to recover descriptions of individual sounds. In a unified and comprehensive way, Bregman establishes a theoretical framework that integrates his findings with an unusually wide range of previous research in psychoacoustics, speech perception, music theory and composition, and computer modeling. Bradford Books imprint
Article
The power exercised by technology companies is attracting the attention of policymakers, regulatory bodies and the general public. This power can be categorized in several ways, ranging from the “soft power” of technology companies to influence public policy agendas to the “market power” they may wield to exclude equally efficient competitors from the marketplace. This Article is concerned with the “data power” exercised by technology companies occupying strategic positions in the digital ecosystem. This data power is a multifaceted power that may overlap with economic (market) power but primarily entails the power to profile and the power to influence opinion formation. While the current legal framework for data protection and privacy in the EU imposes constraints on personal data processing by technology companies, it ostensibly does so without regard to whether or not they have “data power.” This Article probes this assumption. It argues that although this legal framework does not explicitly impose additional legal responsibilities on entities with “data power,” it provides a clear normative indication to do so. The volume and variety of data and the reach of data-processing operations seem to be relevant when assessing both the extent of obligations on technology companies and the impact of data processing on individual rights. The Article suggests that this finding provides the normative foundation for the imposition of a “special responsibility” on such firms, analogous to the “special responsibility” imposed by competition law on dominant companies with market power. What such a “special responsibility” might entail in practice will be briefly outlined and relevant questions for future research will be identified.
Conference Paper
Computer vision and other biometrics data science applications have commenced a new project of profiling people. Rather than using 'transaction generated information', these systems measure the 'real world' and produce an assessment of the 'world state' - in this case an assessment of some individual trait. Instead of using proxies or scores to evaluate people, they increasingly deploy a logic of revealing the truth about reality and the people within it. While these profiling knowledge claims are sometimes tentative, they increasingly suggest that only through computation can these excesses of reality be captured and understood. This article explores the bases of those claims in the systems of measurement, representation, and classification deployed in computer vision. It asks if there is something new in this type of knowledge claim, sketches an account of a new form of computational empiricism being operationalised, and questions what kind of human subject is being constructed by these technological systems and practices. Finally, the article explores legal mechanisms for contesting the emergence of computational empiricism as the dominant knowledge platform for understanding the world and the people within it.
Book
An examination of more than sixty years of successes and failures in developing technologies that allow computers to understand human spoken language. Stanley Kubrick's 1968 film 2001: A Space Odyssey famously featured HAL, a computer with the ability to hold lengthy conversations with his fellow space travelers. More than forty years later, we have advanced computer technology that Kubrick never imagined, but we do not have computers that talk and understand speech as HAL did. Is it a failure of our technology that we have not gotten much further than an automated voice that tells us to “say or press 1”? Or is there something fundamental in human language and speech that we do not yet understand deeply enough to be able to replicate in a computer? In The Voice in the Machine, Roberto Pieraccini examines six decades of work in science and technology to develop computers that can interact with humans using speech and the industry that has arisen around the quest for these technologies. He shows that although the computers today that understand speech may not have HAL's capacity for conversation, they have capabilities that make them usable in many applications today and are on a fast track of improvement and innovation. Pieraccini describes the evolution of speech recognition and speech understanding processes from waveform methods to artificial intelligence approaches to statistical learning and modeling of human speech based on a rigorous mathematical model—specifically, Hidden Markov Models (HMM). He details the development of dialog systems, the ability to produce speech, and the process of bringing talking machines to the market. Finally, he asks a question that only the future can answer: will we end up with HAL-like computers or something completely unexpected?
Book
Musicians begin formal training by acquiring a body of musical concepts commonly known as musicianship. These concepts underlie the musical skills of listening, performance, and composition. Like humans, computer music programs can benefit from a systematic foundation of musical knowledge. This book explores the technology of implementing musical processes such as segmentation, pattern processing, and interactive improvisation in computer programs. It shows how the resulting applications can be used to accomplish tasks ranging from the solution of simple musical problems to the live performance of interactive compositions and the design of musically responsive installations and Web sites. Machine Musicianship is both a programming tutorial and an exploration of the foundational concepts of musical analysis, performance, and composition. The theoretical foundations are derived from the fields of music theory, computer music, music cognition, and artificial intelligence. The book will be of interest to practitioners of those fields, as well as to performers and composers.The concepts are programmed using C++ and Max. The accompanying CD-ROM includes working versions of the examples, as well as source code and a hypertext document showing how the code leads to the program's musical functionality.
Article
We are often told that data are the new oil. But unlike oil, data are not a substance found in nature. It must be appropriated. The capture and processing of social data unfolds through a process we call data relations, which ensures the “natural” conversion of daily life into a data stream. The result is nothing less than a new social order, based on continuous tracking, and offering unprecedented new opportunities for social discrimination and behavioral influence. We propose that this process is best understood through the history of colonialism. Thus, data relations enact a new form of data colonialism, normalizing the exploitation of human beings through data, just as historic colonialism appropriated territory and resources and ruled subjects for profit. Data colonialism paves the way for a new stage of capitalism whose outlines we only glimpse: the capitalization of life without limit.
Chapter
Will you be replaced by a machine? Can music express things beyond words? This chapter discusses developing interactive computational systems that have degrees of autonomy, subjectivity, and uniqueness rather than repeatability. Interactions with these systems in musical performance produce a kind of virtual sociality that both draws from and challenges traditional notions of human interactivity and sociality, making common cause with a more general production of a hybrid. This leads into the question of whether computers can evince agency. The chapter concludes that what is learned from computer improvisation is more about people and the environment than about machines.
Book
Human and Machine Hearing is the first book to comprehensively describe how human hearing works and how to build machines to analyze sounds in the same way that people do. Drawing on over thirty-five years of experience in analyzing hearing and building systems, Richard F. Lyon explains how we can now build machines with close-to-human abilities in speech, music, and other sound-understanding domains. He explains human hearing in terms of engineering concepts, and describes how to incorporate those concepts into machines for a wide range of modern applications. The details of this approach are presented at an accessible level, to bring a diverse range of readers, from neuroscience to engineering, to a common technical understanding. The description of hearing as signal-processing algorithms is supported by corresponding open-source code, for which the book serves as motivating documentation. The author is the leading practitioner in applying hearing science to modern problems such as speech and music recognition Presents updated versions of the author's widely-used hearing models and supports them with well-explained open-source code The book is wide ranging, leveraging ideas from machine vision in combination with hearing science.
Book
How can we engineer systems capable of “cocktail party” listening? Human listeners are able to perceptually segregate one sound source from an acoustic mixture, such as a single voice from a mixture of other voices and music at a busy cocktail party. How can we engineer “machine listening” systems that achieve this perceptual feat? Albert Bregmans book Auditory Scene Analysis, published in 1990, drew an analogy between the perception of auditory scenes and visual scenes, and described a coherent framework for understanding the perceptual organization of sound. His account has stimulated much interest in computational studies of hearing. Such studies are motivated in part by the demand for practical sound separation systems, which have many applications including noiserobust automatic speech recognition, hearing prostheses, and automatic music transcription. This emerging field has become known as computational auditory scene analysis (CASA). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications provides a comprehensive and coherent account of the state of the art in CASA, in terms of the underlying principles, the algorithms and system architectures that are employed, and the potential applications of this exciting new technology. With a Foreword by Bregman, its chapters are written by leading researchers and cover a wide range of topics including: Estimation of multiple fundamental frequenciesFeaturebased and modelbased approaches to CASASound separation based on spatial locationProcessing for reverberant environmentsSegregation of speech and musical signalsAutomatic speech recognition in noisy environmentsNeural and perceptual modeling of auditory organizationThe text is written at a level that will be accessible to graduate students and researchers from related science and engineering disciplines. The extensive bibliography accompanying each chapter will also make this book a valuable reference source. A web site accompanying the text, http://www.casabook.org, features software tools and sound demonstrations. © 2006 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
Article
This paper offers a theoretical treatment of the main problems that arise in mechanical speech recognition, based on the conclusions reached in experiments on the perception and recognition of speech sounds and on experimental results already obtained with a mechanical recognizer. In the first part of the paper, the problems of primary or acoustic recognition are dealt with; they include the “gating” problem, the choice of recognition units, and the acoustic recognition of different classes of speech sound—vowels, plosive consonants, fricative consonants and periodic continuants. The second part discusses the use of language statistics in mechanical recognition.
Chapter
This article discusses autoexperiments, field notes, and laboratory tests on the hardware and software of cochlear implants. Electroacoustic devices resist seeing-through. Yet in the case of cochlear implants, the desires of early users, the conflicting demands of mainstream medicine and economics, and the mediated features of electrical listening, the politics attendant upon communication can be found embedded in the design of electroacoustic objects. Many bioethicists have taken up the Deaf culture or linguistic minority critique of implantation, which situates this technology in the long history of eugenicist attempts to promote oralism through the medical eradication of deafness and through pedagogical bans on sign language. Despite the prominence of the cochlear implant in disability studies, bioethics, and science fiction, however, this has inspired little research in science and technology studies (STS).
Conference Paper
We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.
Book
Max Weber (1864-1920) was one of the most prolific and influential sociologists of the twentieth century. This classic collection draws together his key papers. This edition contains a new preface by Professor Bryan S. Turner. © 2009 H.H. Gerth and C. Wright Mills for selection Preface Bryan Turner.