DARLA: Automated Vowel Extraction

What is DARLA?

DARLA is a web application providing two main functionalities for vowel extraction from speech: completely automated and semi-automated.

The completely automated system transcribes the input speech data using automatic speech recognition (ASR), and then runs it through forced alignment and formant extraction. There is a new option to use a system called Bed Word, which provides industry-quality transcriptions.

The semi-automated system is a forced-alignment approach that aligns and extracts vowel formants from speech with manual transcriptions.

What output does the system return?

We e-mail a set of files to you: a vowel plot showing the mean of all vowels in your data (including both stressed and unstressed vowels) and a spreadsheet with both unnormalized and Lobanov-normalized formant measurements, the same spreadsheet formatted for convenient uploading to NORM, the alignments, and the transcriptions. You can select whether you want to filter out stop words (from this list) and high-bandwidth vowels.

We recommend removing the unstressed vowels from your spreadsheet output before plotting it in NORM.

We also recommend that you manually organize your DARLA output into word classes like LOT, THOUGHT, NORTH, FORCE, and so on (e.g. following Wells' 1982 lexical sets and/or the vowel classes defined in ANAE word lists), rather than simply analyzing the data according to the vowel categories that come out of the CMU dictionary, which we use for alignment and extraction. The CMU dictionary (as used by both FAVE and DARLA) occasionally doesn't show the level of pronunciation detail needed for some subtle distinctions, mergers, etc., needed by sociolinguists.

What ASR is used for the completely automated feature?

For the Bed Word system, we use a third party company called Deepgram, which has its own state-of-the-art model.

How can I evaluate the accuracy of the ASR transcriptions?

Our online transcription evaluation tool uses the weighted Levenshtein distance algorithm to compute transcription error rates for words, phonemes, and stressed vowels. Simply upload the ASR transcription and the manual transcription in plaintext format.

What do you use for alignment and extraction?

We use the same methods as FAVE. Our forced alignment is done with the Montreal Forced Aligner, and formant measurement with FAVE-extract. We also use the R vowels package for plotting.

Here is a recent paper that tested DARLA's alignment system: MacKenzie, Laurel, and Danielle Turton (2020). Assessing the accuracy of existing forced alignment software on varieties of British English. Linguistics Vanguard. https://doi.org/10.1515/lingvan-2018-0061

Which pronunciation dictionary do you use?

We use the CMU pronouncing dictionary, with some manual edits. You can check the current version of the dictionary here. Note: We recommend processing your data in terms of lexical sets like LOT, THOUGHT, NORTH, FORCE, etc., rather than simply depending on the vowel sets as defined by the CMU dictionary (like AA, AO, etc.). Like FAVE, DARLA depends on the CMU dictionary to assign vowel symbols for each given word in your transcription, and we occasionally find places in that dictionary where it doesn’t have the level of fine-grained pronunciation detail needed for various splits and mergers of interest to sociolinguists.

Why completely automate vowel extraction?

In recent years, sociolinguists have begun using semi-automated speech processing methods such as Penn's FAVE program to extract vowel formants. These systems have accelerated the pace of linguistic research, but require significant human effort to manually create sentence-level transcriptions.

We believe that sociolinguistics is on the brink of an even more transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. This technology would make it possible to quickly extract pronunciation features from hours of recordings, including YouTube and vast audio archives. DARLA takes a step in this direction.

Sounds great! What's the catch with the completely automated system?

DARLA's completely automated system cannot provide perfect transcriptions. DARLA's completely automated approach to sociophonetics may help open the way toward large-scale audio analysis, but there is a tradeoff. As with many other sciences, automated processing necessitates error-reporting in the measurements, not just in the statistical modeling. We find that the system can be useful for extracting a representative vowel space for sociophonetic purposes, as long as error levels are considered and reported. In other words, fast large-scale data analysis requires a higher tolerance of noise in the data. If you need greater accuracy, please use our semi-automated system instead.

Does DARLA use my audio data for other services? How is it stored?

When using any of DARLA's in-house services, your files are uploaded to the DARLA servers, where they are processed and then deleted later. We will not use your data or your email address for any other purpose.
For Bed Word, since we use Deepgram as a third-party service, we upload your audio data to Deepgram's servers. We do not provide your email address to Deepgram. Per their privacy notice, Deepgram maintains the right to hold records of uploaded audio data but will not use it for any purpose without permission from the audio owner. As with any academic research, it is important to consult with your university's Institutional Review Board when using third-party data services.

What is the semi-automated functionality? How does it differ from FAVE and other such tools?

This is designed for research that requires accurate human transcription. Our semi-automated system relies heavily on code from the Montreal Forced Aligner and FAVE, but wraps it in a different interface and provides more output features.

DARLA allows you to upload your transcriptions in various formats: as a plaintext file, or as a TextGrid with a pair of boundaries around each transcribed sentence (the "Boundaries" option). You can also upload manually aligned/corrected TextGrids for formant extraction only. Another option is to use the completely automated system to generate ASR transcriptions, and then correct them using our online tool.

Some steps that require manual intervention in FAVE, like creating pronunciations of words that are not in the dictionary, are automated.

For research requiring perfect transcriptions:

Use the completely automated ASR option as a first pass and then correct the transcriptions online with our playback tool, OR
If you can spend the time to produce manual transcriptions, using the semi-automated Boundaries option. The semi-automated plaintext method works just as well, but you will need to delete noises, laughter, interviewer's voice, etc. With the Boundaries method, such deletions aren't necessary since you are simply putting boundaries around the parts of the recording that you want.

What about noise or multiple voices in the recording?

The Alignment and Vowel Extraction system cannot currently handle transcriptions with multiple speakers in an automated way. However, Bed Word can remove interviewer audio.

When DARLA processes a recording with noise, laughter, loud breaths, background voices, music, etc., the ASR transcriptions or alignments are likely to be incorrect. If your recording would require a great deal of pre-cleaning, you might want to consider manually transcribing with the semi-automated method rather than the completely automated one.

However, it is easy to manually delete extraneous sounds and voices in Praat (select the noise and click Cmd+X or Ctrl+X).

If your recording includes an interviewer who doesn't have a microphone, this quiet background voice can cause confusion for the ASR and aligner. The best solution is to delete the interviewer voice (see above), but here are some other options:

Try "smoothing out" the amplitude of all voices on the recording: Load your file in Audacity, then click Effects > Compressor. That feature is a dynamic range adjuster which tries to make all voices approximately the same (pull the slider all the way to the left for the strongest effect).
Try reducing the amplitude of the whole recording (in Audacity, click Effects > Amplify) so that the quieter voice is non-existent.
You can also try increasing the amplitude of all voices so that the ASR transcription can "hear" all of them clearly: In Audacity, click Effects > Amplify.

Is the code for DARLA public? Can I contribute?

See this question for links to the alignment, extraction, and plotting code written by other researchers that we are building upon.

Our code for the rest of the system -- the web interface, the ASR, online correction and evaluation of transcripts, handling different file formats, etc. -- is currently not public because it is under active development. We also think that most users prefer the convenience of the web interface rather than installing and wrangling with several programs on their computers.

If you are interested in contributing a new feature or modifying an existing functionality, please e-mail us! We are excited to collaborate on related linguistics and computer science research, as well as the software development front.

FAQs