A relatively pain-free way to have accessible, captioned video in your web pages.
— brought to you by the friendly people at The Ohio State University Web Accessibility Center
With a little bit of work, some free online tools, and code and utilities available from this web page, you can provide your students, staff, and other users within and outside the university access to web video that is usable by everyone, including people with disabilities.
On this page, we cover some effective methods for captioning and embedding YouTube video in your web pages. We also describe and link to a tool for converting YouTube captions into formats suitable for use in other video players.
YouTube may not be the appropriate host for some materials over which the author wants or needs to maintain strict control due to intellectual property or privacy concerns. Also, YouTube videos are limited to 10 minutes. So, if you have longer material, you will need to segment the longer video into parts of 10 minutes or less. But if you want your video shared freely with the world, it is hard to conceive of a better venue than YouTube.
If you decide to use YouTube for your video, there are a couple of things you can do to ensure that the video is accessible to people with disabilities, who may have a difficult time operating the YouTube video player controls or who are deaf or hard-of-hearing.
For people with motor disabilities or who rely on screen readers to access web content, you can embed the video in your web pages in a manner that guarantees the playback controls are usable. For the deaf and hard-of-hearing, you can provide captions.
Of course, captioning positively affects people beyond your deaf and hard-of-hearing users. YouTube supports subtitles—synchronized transcripts in languages different than the original audio—using the same mechanism as captions. Subtitles can facilitate communication with people across the globe, as well as help students trying to learn a foreign language.
Captions can also help users in noisy venues (the gym?, a café?) or places where there is enforced quiet (libraries and computer labs, for instance). And they can help both users with cognitive disabilities and non-native speakers by presenting content in multiple “modes,” both aural and textual.
Finally, it is worth pointing out that The Ohio State University has a Web Accessibility Policy that requires all video that can be accessed by the general public to have synchronized captions. (Regarding non-public media, the policy states that video with a known, limited, and secured viewership, such as video for courses requiring registration, internal OSU video demonstrations or staff-educational materials, etc., must be captioned in a timely manner, on request. Of course, we would prefer all video to be captioned.)
So, not only are you doing a service to your viewers by captioning and providing accessible controls, you are conforming to university policy.
Probably the most time-consuming part of captioning is getting an accurate transcript of the audio. The other parts of the process—editing the transcript and synchronizing it with the video—can also be difficult and labor intensive, though, as we will see, synchronizing the caption has gotten significantly easier since YouTube introduced automatic caption timing in the Fall of 2009.
Though it can be time-consuming, after you have captioned one or two videos, a natural work flow and pacing develops, and you will gain proficiency and figure out which techniques work best for your particular situation.
You may also decide that parts of your process should be farmed out to staff, student employees, or pay-for services. In addition to covering the basics of the process, below we also try to outline what we think are good practices to follow and provide suggestions on software and mention a few helpful services.
In outline, the steps in captioning YouTube video are:
To produce good captions for your video, you will need an accurate transcription of the spoken audio of your video.
There are many ways to get a good transcript. One way is to use a transcription service. One transcription service we have heard good things about is Casting Words. A high-quality transcription with a six-day turn-around will cost $1.50 per minute—$90 an hour.
Another way is to do the transcription yourself. This laborious process can be helped along with a couple of software programs.
One program that may help is Express Scribe. Express Scribe is free and works in Windows and Mac. Express Scribe offers you a lot of control over audio playback. With it, you can use simple keystrokes to pause, play, and rewind in short increments. The Express Scribe player can be minimized and “pinned” so that it is always visible and its keyboard shortcuts always available. Express Scribe requires you have audio in either WAV or MP3 format. So if you want to use it, you will need to extract the audio from your video. You can use the free, on-line service Media Convert to do that, or various video and audio programs, such as QuickTime Pro ($30), can extract audio. (If your video camera produces MOV files as output, QuickTime Pro is certainly worth having.) VLC Media Player is a free program for Mac, Windows, and Linux that can playback and convert to and from a wide variety of formats. It includes the ability to export WAV and other formats.
A speech recognition program may help with transcription, as well. Speech recognition programs turn speech into text. If you are using a recent version of Windows (Windows Vista and later), you have available a very good speech recognition program, called Windows Speech Recognition. Earlier versions of Windows have speech recognition programs, but they are a bit clunky and not very accurate. The excellent speech recognition in Windows 7 may help significantly with transcription. It is equivalent in quality to commercial speech recognition programs, such as Dragon Naturally Speaking (Windows) and MacSpeech Dictate (Mac).
All speech recognition programs, including Windows Speech Recognition, need to be trained to recognize your voice. Training typically takes only a half-hour or so. Once you have trained the speech recognition, transcription is a process of listening and using a headset mic to echo back the spoken audio.
In our work transcribing using speech recognition, we have found that producing an accurate transcript typically takes between three and four times as long as the original audio. So, a half hour of video will take you roughly an hour and a half to accurately transcribe, once you get proficient using speech recognition with Express Scribe. This may seem like a long time. But try typing out a transcript manually, and you will see 3:1 is not that bad. If you have lots of specialized vocabulary or names, especially non-Western names, in your audio, speech recognition will go more slowly, and you will need to train the software to accurately transcribe the unusual words.
Whatever tools you use to get your transcription, take your time and produce a highly accurate transcription. Accurate transcription and good synchronization are the cornerstones of quality captioning.
Chunking a transcript involves breaking it into lengths appropriate to be displayed in one pop-on caption. A chunk can be one or two lines of transcript. More than that and you start to get problems with how the captions display and readability is negatively affected.
You will want to keep the lines less than 42 characters long for purposes of readability.
Also, there are conventions that are used to identify speakers and “sound effects,” such as background sounds or to indicate how a thing is being spoken. For example, a sound effect might be something like “leaves rustle outside,” “music plays,” or, when indicating a speaker, “Bob [shouting].” Another common cue is to surround music with music notes.
The Media Access Group at WGBH has a Captioning FAQ that provides some conventions for how to caption, though it is geared more for closed captioning for video and TV. The Described and Captioned Media Program (DCMP) has excellent materials on caption style, chunking, line division, and other conventions in their Captioning Key pages. In our example below, we follow YouTube's recommendations for preparing a transcript file, which appear to blend a number of conventions.
The example below has some examples of good and bad chunking and shows how you can introduce a speaker and insert a “sound effect.” Note that line breaks should occur at logical places, so that each line is as semantically complete as possible to make for easy reading. Also try to get chunks to “feel” complete—for example, in the table below, we decide to break the chunks in the song so that they match the singer's phrasing.
>> VERBAL KINT: The greatest trick
the devil ever pulled
was convincing the world
he didn't exist.
>> VERBAL KINT: The greatest trick the
devil ever pulled was
convincing the world he
Take your stinking paws off me,
you damn dirty ape.
Take your stinking
paws off me, you
damn dirty ape.
>> RICK ASTLEY [crooning]:
♪ Never gonna give you up,
Never gonna let you down... ♪
♪ Never gonna run around and desert you. ♪
>> RICK ASTLEY [crooning]:
♪ Never gonna give you up, ♪
♪ Never gonna let you down,
Never gonna run around and desert you. ♪
DCMP also has very good information on audio description (AD), what they call video description, in their Description Key pages. An audio description is an audio-only track that runs synchronously with the main video audio and describes visual content, so that people who cannot see the video have the necessary context for understanding what is going on.
In addition to the DCMP materials on audio description, Joe Clark has developed a set of standard techniques in audio description. Some things that might be described to enhance and clarify comprehension are:
In general, try to speak descriptions when there is a pause in the primary audio track, but speak over the primary track when required to add to the understanding of the video. The narrator's voice should be able to be easily distinguished from the primary audio.
If you need audio description for your videos, you will need to record the audio as a separate track and merge it with your video in your video editing software. This is possible even with free and low-budget software, such as Windows Live Movie Maker (Windows only), Apple iMovie (Mac only), and Apple QuickTime Pro (Windows and Mac).
In the sections below we discuss two ways to synchronize your chunked transcript with your YouTube video to create a timed caption track. One method is to let YouTube's Automatic Timing facility attempt to automatically perform the synchronization. This may not always work. The audio in your video may be low quality or, for whatever reason, YouTube simply may not be able to produce adequately synchronized timings. Therefore we give another method, using an online service from Accessify called Easy YouTube Caption Creator.
In Fall 2009, Google began incorporating into YouTube the Automatic Speech Recognition engine that helps power the transcription service in Google Voice. The first phase of this introduced Automatic Timing, which provides the ability to automatically synchronize your transcript with your YouTube video. In first quarter 2010, YouTube rolled out Automatic Transcription, making it possible to generate a transcript of your video, and thereby automating the entire captioning process.
In our experience, the Automatic Transcription facility is not capable in most circumstances of producing an accurate transcript. In cases where the voices in your audio are well recorded and the speakers speak very clearly, YouTube will likely produce a transcript that can be manually corrected, and you may save yourself some effort compared to producing a transcript from scratch. For the majority of cases, however, the machine-generated transcription will not be very good, and you are better off using only YouTube's Automatic Timing facility.
Here are the steps to use YouTube's Automatic Timing to synchronize your chunked transcript:
Once the Caption Track has uploaded and finished processing (synchronizing), make sure the checkbox next to it is selected and save your chages. That is all there is to it.
Here are some examples that demonstrate usages of both Automatic Timing and Automatic Transcription. The titles in the video playlist below describe how the captions were made:
You will notice that Automatic Timing very accurately synchronizes Paul Schindler's voice with the transcript. In our experiment, with mediocre audio and multiple speakers, alignment is mostly on target, except for a few instances in which Emily's voice is not matched properly. The Automatic Transcription of Paul Schindler is quite good. Though not perfect, it produces a result that might be edited and corrected. This is in stark contrast to our mediocre audio example, which produces an unusable caption track.
If you cannot use YouTube's Automatic Timing to synchronize your chunked transcript, you will need to do it manually. The manual synchronization process involves playing back the video and marking the times at which each transcript chunk occurs within the video.
MAGpie is a tried and true, stand-alone Java application that you install on your Windows machine (the program does not currently support the Mac). MovCaptioner ($25) is made for Mac only and is one of the best products for that platform, outputting caption files in many formats, including SubRip. Both MAGpie and MovCaptioner allow you to either input caption lines as they play or import a chunked transcript. MovCaptioner has the advantage of playing back the video in short snippets (one to 11 seconds) to facilitate transcribing.
YouTubeCC and CaptionTube are online applications. Both use a model for timing in which you type each caption chunk in individually, similar to modes available in both MovCaptioner and MAGpie. We find this method cumbersome, but it may work well for you.
The service we recommend is Accessify's Easy YouTube Caption Creator. Like MAGpie and MovCaptioner, it allows you to import the transcript, already chunked and fully prepared. You playback the video within the web application and set timings using a keystroke—simple and straight-forward.
Here are the steps:
Note that Easy YouTube Caption Creator thinks of a chunk as a single line of the transcript. So, if you have multiple-line chunks, join them into a single line before pasting them in to Caption Creator.
Clicking the “a” key sets a time for each caption chunk. When you have worked your way through the entire video, you can copy the timed-text output the Caption Creator has made for you. Paste it into a text file on your computer and name the file with the
my_caption_file.sbv, for example.
Finally, you must upload your timed caption file to YouTube. YouTube makes it simple to upload your caption file and associate it with your video.
YouTube associates your caption file with the video. When you reload your video in YouTube, you will see that it has captions.
You can upload more than one caption track. You can use this feature to upload tracks in another language—in which case you have created a “subtitle.” If you upload more than one track, note that the end user will need to be able to access the Flash controls in order to change the caption/subtitle track. Thus, in terms of accessibility, it may make sense to have more than one copy of the video on YouTube and associate just a single subtitle with each instance.
Having captions in YouTube is wonderful. But it would be even better if we could use the timed captions in YouTube for other services or for hosting our own video. The problem, of course, is that not all video players or services accept Subviewer formatted captions.
We have written a YouTube caption converter that will convert YouTube Subviewer format to SubRip, W3C Timed Text Markup Language (DFXP), and QT Text. W3C Timed Text is used in a number of players including Adobe's video component for Flash and the popular JW Player. And QT Text is the format used to caption QuickTime MOV files.
Now you can download your timed captions from YouTube, convert them, and re-purpose them for use elsewhere.
The YouTube video player is used everywhere. YouTube is a boon for discovering and broadcasting video on the web and is widely used in education. The player itself is implemented in Flash, which allows for high quality video and sound and attractive interface controls. As discussed above, the Flash-based YouTube player allows for captioning and subtitling of video, which is a great benefit for many reasons.
One problem with Flash, however, is that in many browsers it is not accessible to the keyboard alone. Another problem is that screen reader programs for the visually impaired cannot always accurately discern the function of controls implemented in Flash, and some screen readers cannot access Flash controls at all.
For example, all browsers running in MS Windows except Internet Explorer cannot get focus to a Flash movie using the keyboard alone—a user must hover over a movie with the mouse and click. Tabbing into the movie is not possible. Once in the Flash movie controls for the YouTube video, all browsers other than IE are “trapped”, tabbing through the player controls perpetually, unable to access any other parts of the web page.
The problem is similarly difficult for screen reader users. A portion of experienced users of screen readers will know that they must turn off their screen reader's regular page browsing mode and go into a “pass-through” mode to be able to read the buttons in the YouTube player, but even then the only usable buttons in the Flash-based YouTube player are Play and Mute. And if you happen to be using VoiceOver, the screen reader in Mac OS X, Flash controls are inaccessible.
Thus, for keyboard- or screen reader-reliant users YouTube can present a difficult situation. (For more information on this problem, see our write-up on Flash accessibility in JW Player Controls.)
Suffice it to say that HTML-based controls are preferable for accessibility. All browsers can access HTML-based controls and there is no need for mode-switching. This is where our Accessible Controls for the YouTube Embedded Video Player can come in handy:
Embedding a YouTube video in a web page is a simple matter of copying the “embed” code from YouTube. However, what you get on your web page has problems in terms of accessibility, as outlined above. Also, the embed code, itself is pretty ugly and hard to edit, if you want to change any of the parameters. And if you want to encapsulate the embed so that it includes controls in HTML to help with accessibility or add a play list, you will be dropping in even more hard to maintain code.
By contrast, using the Accessible Controls is simple and maintainable. Including accessible YouTube in your web pages requires adding a couple of lines in the
head and inserting one or more
div elements with a
class of “ytplayerbox” in the
The following code goes in the
head of your document.
The Accessible Controls for the YouTube Embedded Video Player makes the process of embedding controls and a play list very simple and easy to maintain. The code below demonstrates how you would add a video with accessible controls and a play list totalling five videos.
<!-- This is where the player, buttons, and (optional) play list get rendered --> <div class="ytplayerbox"> <!-- specify 'normal' for YouTube VGA aspect ratio (480 x 360) and 'wide' for YouTube HD (640 x 360) --> <span class="ytplayeraspect: normal"> </span> <!-- list video titles and identifiers here, play list rendered only if more than one movie --> <span class="ytmovieurl: XtFlYB56TZk">Interview with my daughter, Eva</span> <span class="ytmovieurl: QRS8MkLhQmM">YouTube Captions and Subtitles</span> <span class="ytmovieurl: _Tp6hgAEUiQ">Easy YouTube caption Creator</span> <span class="ytmovieurl: yvFbP82cYcs">Creating captions with CaptionTube</span> <span class="ytmovieurl: meCIER_s7Ng">Closed Captions</span> </div>
As the code shows, the Accessible Controls get inserted where ever you put a
class="ytplayerbox". The aspect ratio of the player can be set to “normal”, which renders the video at 480 by 360 pixels—the old, standard YouTube VGA-like ratio—or to “wide,” which renders the video at 640 by 360 pixels—the YouTube “HD”, “letterbox” ratio. That is done by using a special
class on a
span element, “ytplayeraspect: [wide or normal]”.
You then tell the Accessible Controls how many movies you want. If you specify one, there is no play list area rendered. More than one will generate a play list for you. Specify each movie's title and YouTube identification code. The contents of the
span become the title for the video (which, obviously, should be the title from YouTube, or something close to it). Put the identifier in
class, “ytmovieurl: [YouTube identifier for a video]”.
And that's it! Below are some examples showing the Accessible Controls in action.
You can have many instances of the player on the same web page. Below is an example that shows the player controls set to display video at the YouTube wide-angle, “HD” aspect ratio of 640 by 360 pixels.
Notice that if we have only one video, the play list does not display.