Transcription and Synchronization Products and Services
- CastingWords
- CastingWords is a pay-per-transcript service. You upload an audio file—for example, a mp3 file you extracted from your movie file—and CastingWords sends you back a transcript within 2 to 3 days. The charge is $.75 per minute, which means whittling a 50-minute lecture down to 40 worthwhile minutes of discussion, your charge would be only $30. If you have videotaped the lecture or event, you will still need to synchronize the transcript to the movie, after you get the transcript. But often transcription is the most time consuming and error-prone part of captioning. Having a service handle the transcription could be major benefit.
- Caption Mic
- Caption Mic is an "echoing" system, using optimized speech recognition software. A typical scenario using Caption Mic's software with their ccSatellite system would go something like this: The instructor wears a wireless headset that sends a signal to an in-class box that sends voice to a captioner at a remote location. The captioner echoes the instructor's words, adding punctuation verbally. On the captioner's end, an optimized laptop running the Caption Mic software, generates a timed transcript using speech recognition. On the fly, the transcript is made available via a per-event URL that would be available to any hearing impaired person in the audience on his personal laptop or could be projected onto a screen for the entire audience. After the event, Caption Mic generates a timed transcript in SAMI or QText format. These can be used for distribution via Windows Media Player, QuickTime, or NCAM's CCforFlash player or component.
- Automatic Sync
- If you supply the transcript, Automatic Sync can return a time-stamped (that is, synchronized) file within about five minutes. Output can be supplied to accommodate a number of media players: DFXP for NCAM's CCforFlash player (or component), SAMI and ASX for Windows Media Player, and SMIL and QText for QuickTime and RealPlayer. Only the audio track needs to be submitted. The synchonization is acheived via speech recognition-facilitated automated matching. If the original media needs to be transcribed, turnaround is 2 to 3 business days.
Automating Transcription Using Dragon Naturally Speaking Speech-to-Text Software
One area of research in our project involves experimentation with speech recognition software as a possible solution for the generation of transcripts (and, possibly, for real-time captioning). In the Windows world, Dragon Naturally Speaking is the primary speech-to-text package. With each version, it improves. However, Dragon is not made to recognize voice generically. It must be trained to recognize your voice. (IBM's ViaVoice, which runs on Windows or Mac, and iListen, for Mac, also require training for efficient voice recognition.)
On this page we discuss procedures for how to train Dragon without having access to the original speaker. This would be necessary in cases such as transcription of archived materials. We also compare the accuracy of untrained automated transciption with trained recognition.
In a nutshell, without being trained for a specific voice, Dragon does not seem to be a practical tool for transcription, at least in our initial evaluation. Experimentation needs to occur on this front. On the other hand, under ideal conditions, Dragon can approach 99 percent accuracy using a specifically trained voice profile, making it highly useful for after-the-fact dictation of lectures and, possibly, even for real-time captioning.
Using Dragon, Untrained, with Prerecorded Materials (video or audio): General Procedure and Tips
First, Dragon must have a voice profile created for the "device" used as input. It will take "dictation" in the form of WAV files, Microsoft's audio file format. This is the set up we used. Alternatively, if you are fortunate enough to have access to the speaker you wish to transcribe, she can create a voice profile, and Dragon can be set to listen to the audio stream using the custom profile. The education license for Dragon Preferred, the software license and version we used, allows for multiple voice profiles to be created. So it is possible for a single installation of dragon to be used by many different speakers.
Normally, when training Dragon, a microphone is attached to the computer, and the speaker reads some simple text to create a voice profile. The process takes around five minutes. On the other hand, to train Dragon for "dictation" a new user is created, using a clip of recorded audio of a pre-established text. (Dragon has a compiled-in library of literary texts it uses for "dictation" training.) This mode of training is more time consuming. Dragon wants around 15 minutes of audio from the dictation device (a WAV file in our case), and it takes Dragon up to an hour and a half to process the recording into a voice profile.
We trained Dragon's WAV profile against a speaker different from the one whose video we were going to transcribe. The test here was to determine the accuracy of transcription from a known device (the WAV file) but an unknown speaker.
The video we use is a "talking-head," single-speaker situation.
We then capture the audio from the video source. Obviously, this step is unnecessary if you are transcribing audio source and the audio is in WAV format. The video is played, using Audacity to record the audio output at 16-bit quality sound with a sample rate of 22050 Hz, mono. These are the recommended quality and rate settings for Dragon dictation. Audacity then saves out the recording as a WAV file, and the WAV file is fed to Dragon to generate the initial transcript.
Multiple-Speaker Situations
For voice recognition transcription, multiple speaker situations present enormous problems. It is possible that if the initial device voice profile were created using similarly accented speech and all of the speakers in the multiple-speaker scenario were mic-ed so that volume and clarity were very similar, we might get decent initial transcription results. This needs to be tested. We are skeptical.
Another possibility for multiple speakers is to have an "audio transcriptionist". This person would have a well-trained voice profile. She would repeat verbatim the speakers, possibly introducing each speaker. It is likely such a setup would be cheaper and potentially more accurate than using a hired transcriptionist with a stenography machine (minimum $80 per hour, typically $120 per hour with a two hour minimum).
Dictation Recognition Using a Voice Profile Created Independent of the Speaker
After creating a generic voice profile, Dragon was fed WAV audio recorded from a movie clip. The clip, downloaded from the archives section at the Law, Health Policy & Disability Center at the University of Iowa, was chosen because the speaker speaks slowly and clearly.
The image below gives a visual representation of the difference between the text Dragon dictated and the actual spoken text. (The final captioned movie is available in the Resources sidebar on this web page.) Dragon's mistakes are grayed out. The corrections are in green.
There are 667 words with approximately 120 misrecognized, which represents a rate of accuracy of around 80%. However correction time was problematic, due to the complexities of locating some of the errors. Recognition failed so radically in places that tracking down errors greatly increased editing time.
The lesson is that, with current software at least, speech patterns of an unfamiliar speaker cannot be very accurately recognized. Accuracy would likely improve if the user repeatedly trained Dragon by making corrections in the transcription via the Dragon interface. (Dragon improves voice profiles when you correct within the Dragon editing environment.) However, there is currently no way to train Dragon on another's voice, unless the speaker is available and submits to the time consuming procedure of dictation training. With many archived materials, the person performing the transcription will have no access to the original speaker. Experimentation needs to be done on how best to optimize Dragon for non-trained voices. IBM ViaVoice and iListen may also perform better with untrained voices. Again, testing is required to determine this.
Dictation Recognition Using a Voice Profile that Matches the Speaker
The image below gives a visual representation of the difference between the text Dragon dictated and the actual spoken text in a situation where the speaker had trained Dragon on a particular device. Dragon's mistakes are grayed out. The corrections are in green. This particular text selection was dictated with autopunctuation on. The autopunctuation feature seems a little under-developed. It appears to be triggered only by pauses in speaking and does not seem to pay attention to grammar.
There are approximately 22 misrecognized words in a passage of 476, which yields roughly 95% accuracy. In editing, most of the errors were very easy to spot and correct within Dragon. With further training and better mic-ing and recording, Dragon would probably achieve results approaching 99%. The sticking point is the pace and clarity of speech of the presenter. Instructors who mumble or rush will not have good results, and it is almost certain that during the course of a one hour class or lecture, the speaker's percentages will vary, as she shifts topics, modes of delivery, or simply becomes tired.