What is DARLA?
DARLA is a web application providing two main functionalities for vowel extraction from speech: completely automated and semi-automated.
The completely automated system transcribes the input speech data using automatic speech recognition (ASR), and then runs it through forced alignment and formant extraction.
The semi-automated system is a forced-alignment approach that aligns and extracts vowel formants from speech with manual transcriptions.
What output does the system return?
We e-mail a set of files to you: a vowel plot showing the mean of all vowels in your data (including both stressed and unstressed vowels) and a spreadsheet with both unnormalized and Lobanov-normalized formant measurements, the same spreadsheet formatted for convenient uploading to NORM, the alignments, and the transcriptions. You can select whether you want to filter out stop words (from this list) and high-bandwidth vowels.
We recommend removing the unstressed vowels from your spreadsheet output before plotting it in NORM.
We also recommend that you manually organize your DARLA output into word classes like LOT, THOUGHT, NORTH, FORCE, and so on (e.g. following Wells' 1982 lexical sets and/or the vowel classes defined in ANAE word lists), rather than simply analyzing the data according to the vowel categories that come out of the CMU dictionary, which we use for alignment and extraction. The CMU dictionary (as used by both FAVE and DARLA) occasionally doesn't show the level of pronunciation detail needed for some subtle distinctions, mergers, etc., needed by sociolinguists.
What ASR is used for the completely automated feature?
We use a model trained on over 400 hours of speech data with the CMU Sphinx toolkit.
How can I evaluate the accuracy of the ASR transcriptions?
Our online transcription evaluation tool uses the weighted Levenshtein distance algorithm to compute transcription error rates for words, phonemes, and stressed vowels. Simply upload the ASR transcription and the manual transcription in plaintext format.
What do you use for alignment and extraction?
Which pronunciation dictionary do you use?
We use the CMU pronouncing dictionary, with some manual edits. You can check the current version of the dictionary here. Note: We recommend processing your data in terms of lexical sets like LOT, THOUGHT, NORTH, FORCE, etc., rather than simply depending on the vowel sets as defined by the CMU dictionary (like AA, AO, etc.). Like FAVE, DARLA depends on the CMU dictionary to assign vowel symbols for each given word in your transcription, and we occasionally find places in that dictionary where it doesn’t have the level of fine-grained pronunciation detail needed for various splits and mergers of interest to sociolinguists.
Why completely automate vowel extraction?
In the last few years, sociolinguists have begun using semi-automated speech processing methods such as Penn's FAVE program to extract vowel formants. These systems have accelerated the pace of linguistic research, but require significant human effort to manually create sentence-level transcriptions.
We believe that sociolinguistics is on the brink of an even more transformative technology: large-scale, completely automated vowel extraction without any need for human transcription. This technology would make it possible to quickly extract pronunciation features from hours of recordings, including YouTube and vast audio archives. DARLA takes a step in this direction.
How do we deal with ASR errors?
While ASR is far from perfect, we believe sociophoneticians do not need to wait for years to take advantage of speech recognition. Unlike applications like dictation software where accurate word recognition is the primary goal, sociophonetics typically focuses on a much narrower objective: extracting a representative vowel space for speakers, based on stressed vowel tokens. For example, it would usually not be crucial to know that the stressed vowel in the word "turning" was extracted from "turn it" rather than "turning", or that "tack" was wrongly transcribed as "sack". Such differences will have little effect on the speaker's vowel space for many sociophonetic questions.
It turns out that most ASR errors affect the identity of the words but not the identity of the vowels (especially stressed vowels), making it an ideal technology for automated vowel analysis. Of course, there will be instances of vowel error, but the effect of these errors is reduced by the large amount of data with hundreds or thousands of vowel tokens.
Important contrasts like "cot" versus "caught" tend to be handled by ASR's modeling of grammatical plausibility (using a language model). The system would be unlikely to transcribe "I caught the ball" as "I cot the ball" since the latter would be improbable under an English language model.
Finally, since DARLA shows probabilities for the phonetic environment around each vowel (e.g., obstruent+vowel+nasal consonant), researchers can examine contrasts like pin/pen versus pit/pet.
Sounds great! What's the catch with the completely automated system?
DARLA's completely automated system cannot provide perfect transcriptions. The automatic transcriptions typically contain a very large number of errors, especially in free speech data, even using Google's ASR. Current technology cannot match the accuracy of human manual transcriptions.
DARLA's completely automated approach to sociophonetics may help open the way toward large-scale audio analysis, but there is a tradeoff. As with many other sciences, automated processing necessitates error-reporting in the measurements, not just in the statistical modeling. We find that the system can be useful for extracting a representative vowel space for sociophonetic purposes, as long as error levels are considered and reported. In other words, fast large-scale data analysis requires a higher tolerance of noise in the data. If you need greater accuracy, please use our semi-automated system instead.
What is the semi-automated functionality? How does it differ from FAVE and other such tools?
This is designed for research that requires accurate human transcription. Our semi-automated system relies heavily on code from the Montreal Forced Aligner and FAVE, but wraps it in a different interface and provides more output features.
DARLA allows you to upload your transcriptions in various formats: as a plaintext file, or as a TextGrid with a pair of boundaries around each transcribed sentence (the "Boundaries" option). You can also upload manually aligned/corrected TextGrids for formant extraction only. Another option is to use the completely automated system to generate ASR transcriptions, and then correct them using our online tool.
Some steps that require manual intervention in FAVE, like creating pronunciations of words that are not in the dictionary, are automated.
For research requiring perfect transcriptions:
- Use the completely automated ASR option as a first pass and then correct the transcriptions online with our playback tool, OR
- If you can spend the time to produce manual transcriptions, using the semi-automated Boundaries option. The semi-automated plaintext method works just as well, but you will need to delete noises, laughter, interviewer's voice, etc. With the Boundaries method, such deletions aren't necessary since you are simply putting boundaries around the parts of the recording that you want.
What about noise or multiple voices in the recording?
The system cannot currently handle recordings with multiple speakers in an automated way (though this is an idea for future work).
When DARLA processes a recording with noise, laughter, loud breaths, background voices, music, etc., the ASR transcriptions or alignments are likely to be incorrect. If your recording would require a great deal of pre-cleaning, you might want to consider manually transcribing with the semi-automated method rather than the completely automated one.
However, it is easy to manually delete extraneous sounds and voices in Praat (select the noise and click Cmd+X or Ctrl+X).
If your recording includes an interviewer who doesn't have a microphone, this quiet background voice can cause confusion for the ASR and aligner. The best solution is to delete the interviewer voice (see above), but here are some other options:
- Try "smoothing out" the amplitude of all voices on the recording: Load your file in Audacity, then click Effects > Compressor. That feature is a dynamic range adjuster which tries to make all voices approximately the same (pull the slider all the way to the left for the strongest effect).
- Try reducing the amplitude of the whole recording (in Audacity, click Effects > Amplify) so that the quieter voice is non-existent. This may help remove quiet background voices that would be a problem for the aligner.
- You can also try increasing the amplitude of all voices so that the ASR transcription can "hear" all of them clearly: In Audacity, click Effects > Amplify.
Is the code for DARLA public? Can I contribute?
See this question for links to the alignment, extraction, and plotting code written by other researchers that we are building upon.
Our code for the rest of the system -- the web interface, the ASR, online correction and evaluation of transcripts, handling different file formats, etc. -- is currently not public because it is under active development. We also think that most users prefer the convenience of the web interface rather than installing and wrangling with several programs on their computers.
If you are interested in contributing a new feature or modifying an existing functionality, please e-mail us! We are excited to collaborate on related linguistics and computer science research, as well as the software development front.