Completely Automated Alignment and Vowel Extraction
Our automated system takes uploaded audio files and returns ASR transcriptions, alignments, and vowel formant measurements.
It is recommended that you look through the discussion on the completely automated system's functionality and limitations before you begin.
-
Audio with transcriptions from our automatic speech recognition system
This system uses ASR built upon the CMU Sphinx framework to transcribe your data and then runs it through automated alignment and extraction using ProsodyLab Aligner and FAVE-Extract.
It also provides the facility to edit the transcripts produced by the speech recognizer, and rerun the analysis.
Audio with transcriptions provided by YouTube closed captioning
Speech recognition developed by Google for YouTube captioning is much more accurate than the research-level ASR that we developed for DARLA.
Just upload your audio to at the link below, receive an email with some information from us, wait a few hours for YouTube to process the captions, and then revisit DARLA to input the codes sent to your e-mail. DARLA will automatically extract YouTube captions (if available), and run your job through for forced alignment and extraction.
The quality of transcriptions is usually better than our in-built ASR, but there may be issues with reliability, since YouTube does not generate transcriptions for all uploaded videos. In addition, its spam detector sometimes rejects video uploads.
Unlike our in-house ASR system, YouTube does not immediately transcribe the audio. If you use this option, you will have to wait 5+ hours after the initial upload so that YouTube can process the closed captions.
-
ASR evaluation
Automated data analysis requires a higher tolerance of potential noise in the alignment and formant extraction results. You can estimate this noise using our transcription evaluation tool, which takes a manual transcription of your recording along with the ASR transcription of the same, and uses weighted Levenshtein distance to compute error rates for words, phonemes, and stressed vowels.