Acoustic data, or recorded human speech, is often used in data science for direct applications such as automatic speech recognition. However, acoustic data is a rich source of information about human language and has the potential to contribute to research beyond these specific applications. 


Why use acoustic data?

Spoken language data is a rich source of information about not only the content of the speech (i.e. words) but also other important sources of information about our attitude, feelings, or intention of speech.

For example, the sentence “She ate the pie.” can have multiple meanings depending on where in the sentence there is emphasis:

  1. SHE ate the pie

  2. She ate the PIE.

(1) emphasizes the person eating the pie, and would be in response to a question: Who ate the pie? She ate the pie. In contrast, (2) emphasizes the pie, perhaps in response to a question like: What did she eat? She ate the pie. In writing, one might use italics or capital letters to impart emphasis. However, in many cases, this emphasis is not marked in writing, and there is no way in which to know which meaning was intended without further context.

Yet, the differences between (1) and (2) are quickly obvious in a recording. The figure below gives the two sentences. The top panel in blue has the waveform, or recorded speech, and the middle panel has a visual representation of speech, called a spectrogram. For example, on the left the difference in the amplitude, or height of the speech waves, in the waveform is much larger between the beginning and end of the sentence. In contrast, on the right there is a much smaller difference in the height of the waveforms.  



In acoustic phonetics, we can convert these visual differences into numbers, creating data that can be used in machine learning and other applications. This information can be analyzed on its own or merged with other forms of data to enrich text and other datasets with information about emphasis, emotion, intention, or speaking style. 


What are some sources of acoustic data?

Collecting spoken data is becoming more and more straightforward! Nowadays even phone and desktop computer microphones are pretty high quality, and the kinds of data that are recorded on these can be used for pretty sophisticated analysis.

For example, through websites and apps, it is possible to collect large amounts of data quickly by having people record into their phones through short surveys and crowdsourcing techniques. In addition, more and more recorded speech is becoming available on popular social media and sharing sites. These new avenues of obtaining data has been dramatically increasing the amount of acoustic data available to researchers in a wide range of domain areas.


How can you work with acoustic data?

Some (free, open source) useful tools for working with acoustic data include:

Forced Alignment: Forced alignment is a tool to help preprocess acoustic data. Forced alignment can take transcripts and audio files and line them up, which helps locate which part of the recording is associated with which word. Some useful forced alignment tools are the Montreal Forced Aligner and the Penn Forced Aligner.

Analyzing acoustic data: Praat is the most common software for analyzing acoustic information, and has many ready-to-use tools and functionalities for analyzing acoustics. For those who enjoy working in Python, there are several packages for working with Praat from Python, including Parselmouth, among others.


Interested in working with acoustic data but not sure how? Sign up for a consultation at the D-Lab or send me an email!



Emily Grabowski

I am a PhD student in Linguistics. My research interests include understanding how our speech production and speech perception systems constrain linguistic variation, especially as it applies to the larynx. I am also interested in integrating theoretical representations of language with speech. I approach this using a broad variety of tools/methodologies, including theoretical work, experiments, and modeling. Current projects include developing a computational tool to expedite analysis of pitch and an online perception experiment on the relationship between pitch and perceived duration.