DARLA: Completely Automated Vowel Extraction

What is DARLA?

DARLA is a web application providing two main functionalities for vowel extraction from speech: completely automated and semi-automated.

The completely automated system transcribes the input speech data using automatic speech recognition (ASR), and then runs it through forced alignment and formant extraction.

The semi-automated system is a FAVE-based approach that aligns and extracts vowel formants from speech with manual transcriptions.

What output does the system return?

We e-mail a set of files to you: a vowel plot showing the mean of stressed vowels after filtering out grammatical function words, a spreadsheet with both unnormalized and Lobanov-normalized formant measurements (pre-filtered to remove grammatical function words and tokens with high bandwidths for F1 or F2), the same spreadsheet formatted for convenient uploading to NORM, the alignments, and the transcriptions.

If you are using the completely automated feature, you also have the option of editing the transcriptions online, and seamlessly re-running your task through the system.

What ASR is used for the completely automated feature?

We give you two options: our in-house ASR, and YouTube's closed captioning.

The in-house ASR is a research system built with the CMU Sphinx toolkit. It is the fastest method, and works reasonably well for some kinds of recordings. However, the results often have a fair number of transcription errors, since we don't have the resources to match the commercial state-of-the-art.

For this reason, we have added a convenient option to make manual corrections: the initial results e-mailed to you include a link to an online playback tool which divides your recording and ASR transcription into 20-second audio chunks that can be corrected and then directly resubmitted to DARLA.
The other option uses Google's state-of-the-art ASR through YouTube's Closed Captioning. Your audio is uploaded to YouTube (to a private secure account), and the transcriptions are extracted.

One drawback is that YouTube often takes 5+ hours to produce captions. (The transcriptions are typically quite good, so it's worth the wait.) Note that you don't have to upload anything to YouTube or convert your audio into a video yourself; DARLA takes care of that for you! See the system page for details.

How can I evaluate the accuracy of the ASR transcriptions?

Our online transcription evaluation tool uses the weighted Levenshtein distance algorithm to compute transcription error rates for words, phonemes, and stressed vowels. Simply upload the ASR transcription and the manual transcription in plaintext format.

What do you use for alignment and extraction?

We use the same methods as FAVE. Our forced alignment is done with the ProsodyLab Aligner, and formant measurement with FAVE-extract. We also use the R vowels package for plotting.

Why completely automate vowel extraction?

In the last few years, sociolinguists have begun using semi-automated speech processing methods such as Penn's FAVE program to extract vowel formants. These systems have accelerated the pace of linguistic research, but require significant human effort to manually create sentence-level transcriptions.

We believe that sociolinguistics is on the brink of an even more transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. This technology would make it possible to quickly extract pronunciation features from hours of recordings, including YouTube and vast audio archives. DARLA takes a step in this direction.

How do we deal with ASR errors?

While ASR is far from perfect, we believe sociophoneticians do not need to wait for years to take advantage of speech recognition. Unlike applications like dictation software where accurate word recognition is the primary goal, sociophonetics typically focuses on a much narrower objective: extracting a representative vowel space for speakers, based on stressed vowel tokens. For example, it would usually not be crucial to know that the stressed vowel in the word "turning" was extracted from "turn it" rather than "turning", or that "tack" was wrongly transcribed as "sack". Such differences will have little effect on the speaker's vowel space for many sociophonetic questions.

It turns out that most ASR errors affect the identity of the words but not the identity of the vowels (especially stressed vowels), making it an ideal technology for automated vowel analysis. Of course, there will be instances of vowel error, but the effect of these errors is reduced by the large amount of data with hundreds or thousands of vowel tokens.

Important contrasts like "cot" versus "caught" tend to be handled by ASR's modeling of grammatical plausibility (using a language model). The system would be unlikely to transcribe "I caught the ball" as "I cot the ball" since the latter would be improbable under an English language model.

Finally, since DARLA shows probabilities for the phonetic environment around each vowel (e.g., obstruent+vowel+nasal consonant), researchers can examine contrasts like pin/pen versus pit/pet.

Sounds great! What's the catch with the completely automated system?

DARLA's completely automated system cannot provide perfect transcriptions. The automatic transcriptions typically contain a very large number of errors, especially in free speech data, even using YouTube's ASR. Current technology simply cannot match the accuracy of human manual transcriptions.

DARLA's completely automated approach to sociophonetics may help open the way toward large-scale audio analysis, but there is a tradeoff. As with many other sciences, automated processing necessitates error-reporting in the measurements, not just in the statistical modeling. We find that the system can be useful for extracting a representative vowel space for sociophonetic purposes, as long as error levels are considered and reported. In other words, fast large-scale data analysis requires a higher tolerance of noise in the data. In you need greater accuracy, please use our semi-automated system instead.

What is the semi-automated functionality? How does it differ from FAVE and other such tools?

This is designed for research that requires accurate human transcription. Our semi-automated system relies heavily on code from ProsodyLab Aligner and FAVE, but wraps it in a different interface and provides more output features.

DARLA allows you to upload your transcriptions in various formats: as a plaintext file, or as a TextGrid with a pair of boundaries around each transcribed sentence (the "Boundaries" option). You can also upload manually aligned/corrected TextGrids for formant extraction only. Another option is to use the completely automated system to generate ASR transcriptions, and then correct them using our online tool.

Some steps that require manual intervention in FAVE, like creating pronunciations of words that are not in the dictionary, are automated.

For research requiring perfect transcriptions, we recommend either using either the "Boundaries" option OR using the YouTubeClosedCaptions ASR as a first pass and then correcting it online with our playback tool. The plaintext method works just as well, but you will need to delete noises, laughter, interviewer's voice, etc. With the Boundaries method, such deletions aren't necessary since you are simply putting boundaries around the parts of the recording that you want.

What about noise or multiple voices in the recording?

The system cannot currently handle recordings with multiple speakers in an automated way (though this is an idea for future work).

When DARLA processes a recording with noise, laughter, loud breaths, background voices, music, etc., the ASR transcriptions or alignments are likely to be incorrect. If your recording would require a great deal of pre-cleaning, you might want to consider manually transcribing with the semi-automated method rather than the completely automated one.

However, it is easy to manually delete extraneous sounds and voices in Praat (select the noise and click Cmd+X or Ctrl+X).

If your recording includes an interviewer who doesn't have a microphone, this quiet background voice can cause confusion for the ASR and aligner. The best solution is to delete the interviewer voice (see above), but here are some other options:

Try "smoothing out" the amplitude of all voices on the recording: Load your file in Audacity, then click Effects > Compressor. That feature is a dynamic range adjuster which tries to make all voices approximately the same (pull the slider all the way to the left for the strongest effect).
Try reducing the amplitude of the whole recording (in Audacity, click Effects > Amplify) so that the quieter voice is non-existent.
You can also try increasing the amplitude of all voices so that the ASR transcription can "hear" all of them clearly: In Audacity, click Effects > Amplify.

Is the code for DARLA public? Can I contribute?

See this question for links to the alignment, extraction, and plotting code written by other researchers that we are building upon.

Our code for the rest of the system -- the web interface, the ASR, the YouTube features, online correction and evaluation of transcripts, handling different file formats, etc. -- is currently not public because it is under active development. We also think that most users prefer the convenience of the web interface rather than installing and wrangling with several programs on their computers.

If you are interested in contributing a new feature or modifying an existing functionality, please e-mail us! We are excited to collaborate on related linguistics and computer science research, as well as the software development front.

FAQs