Tuning TTS output with ActivePrompts

Vocalizer supports tuning synthesis through Nuance ActivePrompts. ActivePrompts are created with Nuance Vocalizer Studio (a graphical TTS tuning environment) and are stored in an ActivePrompt database for run-time use. Nuance Vocalizer Studio is a separate product. For information, please contact your Nuance representative.

There are two types of ActivePrompts:

Recorded ActivePrompts are digital audio recordings that are indexed by an ActivePrompt database. The recordings are stored as individual audio files on a web server or file system. This indexing enables context-sensitive expansions of static or dynamic input text to a sequence of pre-recorded audio recordings, making Vocalizer a powerful prompt concatenation engine for recording-only or mixed TTS and recording applications.
Tuned ActivePrompts are an ActivePrompt database that stores synthesizer instructions so input text fragments are spoken in a particular way. These instructions are created by an application developer using Nuance Vocalizer Studio to adjust tuning parameters and listen to various versions of a prompt, then freezing the prompt. These synthesizer instructions are much smaller than the audio that will be produced.

At runtime, all ActivePrompts can be used in two different ways:

Explicit insertion using the Nuance <prompt> extension to SSML or the native <ESC>\prompt=prompt\ control sequence.
Implicit matching where ActivePrompts are automatically used whenever the input text matches the ActivePrompt text. For implicit matching, there are two sub-modes:
- Automatic mode, where implicit matches are automatically enabled across all the text in all speak requests.
- Normal mode, where the Nuance ssft-domaintype extension to SSML or the native <ESC>\domain=domain\ control sequence is used to enable implicit matches for specific regions within the input text.

For recorded ActivePrompt databases, automatic matching can be further restricted so it is only done within a text normalization block (<ESC>\tn\ control sequence or SSML <say-as> element) for a specific type. For example, a recorded ActivePrompt database for spelling that is only used for text wrapped in <ESC>\tn=spell\ or SSML <say-as interpret-as="spell">.

Installing ActivePrompts

Applications use ActivePrompts by loading them into the system and then referencing the ActivePrompts.

The available ActivePrompts databases are found in a voice-specific directory under the Vocalizer installation directory, for example, VOCALIZER_SDK/cpr_enu_tom/. The file suffix is .dat. See the Release Notes for each voice for a list of available databases.

The recordings are found relative to the URI or path used to load the ActivePrompt database. For example, if the ActivePrompt database http://myserver/apdb_rp_tom_alphanum.dat contains a prompt named alphanum/f.alpha0 and the database specifies a file suffix of .ulaw for 8000 Hz and .wav for 22050 Hz, the recording file must be http://myserver/alphanum/f.alpha0.ulaw for the 8000 Hz version and http://myserver/alphanum/f.alpha0.wav for the 22050 Hz version.

Store ActivePrompt databases on a web server or in a file system, with the recordings underneath. Store recordings in VOCALIZER_SDK/cpr_enu_tom/domain, where domain corresponds to the available ActivePrompt Database.

To load the ActivePrompt databases for runtime use, use the SSML <lexicon> tag or the <default_activeprompt_dbs> XML configuration file parameter. You can load any number of ActivePrompt databases at runtime. The load order determines the precedence, with more recently loaded ActivePrompt databases taking precedence over previously loaded databases. At runtime, Vocalizer only consults ActivePrompt databases that match the current synthesis voice.

Prompt concatenation engine

The Vocalizer prompt concatenation engine feature leverages recorded ActivePrompts to support near flawless playback of static and dynamic input text by concatenating recordings rather than using full TTS. This includes support for recordings only or mixed TTS and recordings, and support for creating custom voices for recording only playback.

Many voice applications are built by manually specifying carrier prompt recordings using SSML <audio>, then using an application library to expand dynamic content like alphanumeric sequences, dates, times, cardinal numbers, and telephone numbers to sequences of SSML <audio> elements. However, Vocalizer’s prompt concatenation engine gives better sounding results with the following advantages:

Application developers don’t need to purchase, create, or maintain libraries for expanding dynamic content like alphanumeric sequences, dates, times, cardinal numbers, and telephone numbers. Instead, the application can just specify plain input text for Vocalizer to expand, then create an ActivePrompt database that defines the necessary recordings.
ActivePrompts support context-sensitive rules, including prompts that start and/or end on a sentence boundary, on a phrase boundary, on a sentence or phrase boundary, with a specific punctuation symbol, or are phrase internal. For playing back dynamic content, even recording just three variations of each prompt (phrase initial, phrase final, and phrase internal) gives a huge quality boost, producing very natural sounding output.
Some Vocalizer voices include predefined ActivePrompt databases and recordings for a variety of dynamic types, along with recording scripts that allow easily re-recording those in a different voice. These optionally support phrase initial, phrase final, and phrase internal recording variations for very high quality output as described above. See the Release Notes for each voice to see where this feature is offered, and for the details.
For static prompts, application developers can choose between specifying plain input text (avoids tediously specifying recording file names), SSML <audio> (recording file names), SSML <prompt> (ActivePrompt names), or using a mixed approach.
Providing plain input text for all the static and dynamic prompts makes it easy to create rapid application prototypes and to follow rapid application development (RAD) models such as Agile or Extreme Programming, because it uses Vocalizer text-to-speech for all the prompts at the beginning of the project, then adds ActivePrompt databases and recordings later on as required, independent of the application code.
Vocalizer produces a single audio stream for all the content rather than relying on rapid fetching and concatenation of individual recording files by another system component such as a telephony service. This ensures the recordings are contiguous, rather than having the extra gaps that some telephony services introduce, which lead to slow playback.
This solution is extensible to the wide variety of languages and dynamic data types supported by Vocalizer, rather than requiring special linguistic knowledge and major code updates for each new language or data type.

Creating an ActivePrompt database

The first step for using Vocalizer for prompt concatenation is to define the set of recordings, then enter them into Nuance Vocalizer Studio to create an ActivePrompt database. Each prompt needs the following:

Logical prompt name, which the run-time engine transforms to a recording file name by appending a recording file suffix (such as .wav) and then using it as a URI relative to the ActivePrompt database. For example, if the ActivePrompt database http://myserver/apdb_rp_tom_alphanum.dat contains a prompt named alphanum/f.alpha0 and the database specifies a file suffix of .wav, the recording file must be http://myserver/alphanum/f.alpha0.wav.
Input text matched by the prompt. Vocalizer does its matching using a normalized form of the input text (converts it to lowercase, normalizes spaces, expands abbreviations, and so on) so there is flexibility for differences between the run-time and prompt text. It is best to think of this as word-by-word matching after expanding dynamic types like dates, times, and numbers to their word sequence (such as 110 to “one hundred ten”).
Boundary conditions, one for each side of the input text. This can be one of: sentence boundary, phrase boundary, sentence or phrase boundary, a specific punctuation symbol, phrase internal, or a wildcard (anything).

Some Vocalizer voices include predefined ActivePrompt databases and recordings for a variety of dynamic types, along with recording scripts. When possible, it is best to rely on those ActivePrompt databases (optionally re-recording them with the application voice talent) rather than re-creating those databases from scratch.

For other languages or voices, carefully consider the set of recordings required to speak the application's static and dynamic content. For the static content, consider each of the carrier phrases, which are typically listed in user interface documents and straightforward to define. Dynamic content is a bit more challenging: it requires knowing the language-specific output word sequences, then determining what variations to record for better sounding output.

For example, a basic recording set for digits playback could be one recording for each of the numbers 0 through 9, using wildcard boundary conditions. While that would produce understandable output, it would not sound natural. Much better output could be obtained by recording three variations of each number 0 through 9: one for phrase initial contexts (left boundary condition of sentence or phrase), one for phrase-medial contexts (wildcard boundary conditions), and one for phrase-final contexts (right boundary condition of sentence or phrase). Even better output could be obtained by recording digit pairs in those three contexts, so that a digit sequence like “0 2 3 7” after a carrier phrase would be played with one phrase-internal recording for “0 2”, then one phrase final recording for “3 7”.

Of course, this involves cost-versus-benefit decisions and may require experimentation to determine the lowest cost solution with a target quality level. The Nuance predefined ActivePrompt databases use both of these techniques and come with recording scripts, so even for new languages and types they provide a good reference point for making these decisions.

For flexibility, it is best to create a separate ActivePrompt database for each dynamic type, so applications can selectively enable them, such as one for alphanumeric sequences and another for dates, otherwise they may conflict with each other. They can use the same prompt names and recordings, which is desirable to improve run-time Internet fetch cache performance.

For additional flexibility, Vocalizer’s ActivePrompt run-time engine supports fallback if a recording is missing. This fallback builds a sophisticated ActivePrompt database that has features like multiple prompt variations, but later only recording a more basic prompt set. Vocalizer is of course also a text-to-speech engine, so if it fails to find any match it will automatically fall back to using text-to-speech output (except for recording only custom voices as described below).

The list of ActivePrompts used at run-time is available within the Vocalizer call log, a log that reports information for application tuning and capacity planning purposes. This is often helpful during ActivePrompt development and testing.

Developing applications

Applications use ActivePrompts by loading them into the system (see Load ActivePrompt databases) and then referencing the ActivePrompts. Those references can be explicit references using the ActivePrompt names (VoiceXML <prompt> or the native <ESC>\prompt\ control sequence), or they can be implicit references where the Vocalizer engine automatically searches the ActivePrompt database for each synthesis request, substituting ActivePrompts whenever the normalized input text matches an ActivePrompt’s normalized input text and boundary constraints.

Implicit ActivePrompt references can be further controlled by configuring each ActivePrompt database for either fully automatic mode (ActivePrompt database is always consulted) or normal mode (ActivePrompt database is consulted only when explicitly enabled by SSML ssft-domaintype or the native <ESC>\domain\ control sequence).

For implicit ActivePrompt matching when using Vocalizer 5.5 or later voice packs, Vocalizer can optionally match against both the original input text (original orthography) and normalized input text instead of the default of matching just against the normalized input text. This option can be useful for migrating applications across major Vocalizer language updates, because those updates may change the normalization rules, resulting in failed matches for older ActivePrompt databases that rely on the affected rules.

However, it is better to just recompile and retest ActivePrompt databases before deploying language updates rather than relying on matching against the original input text. Original input text matches are less powerful for information like numbers, dates, and times that involve heavy text normalization, and that matching adds additional CPU and ActivePrompt database storage overhead. Matching against the original input text is controlled by the Nuance Vocalizer Studio settings when creating ActivePrompt databases.

For dynamic content like alphanumeric sequences, dates, times, cardinal numbers, and telephone numbers, it is best to use implicit ActivePrompt references using the normal mode. This mode selectively enables the proper database for each dynamic data type, avoiding conflicts. The input text then becomes a SSML ssft-domaintype or native <ESC>\domain\ control sequence to enable the desired types, then the carrier phrase and text to speak. For example, the following SSML example speaks a pre-recorded carrier phrase (part_code_intro.wav) with the dynamic portion wrapped within a <say-as> element that explicitly specifies the spell:alphanumeric type:

<s ssft-domaintype="spell:alphanumeric">

<audio src="part_code_intro.wav">The part code is</audio>

<say-as interpret-as="spell:alphanumeric">8jihpey3wy</say-as></s>

Note: Some XML-based application development environments block the use of Nuance SSML extensions like ssft-domaintype. For those environments, set the escape_sequence parameter in the Vocalizer configuration file so you can use a sequence like "\!" instead of the <ESC> character (the <ESC> character is not allowed in XML documents), then use the native <ESC>\domain\ control sequence such as "\!\domain=spell:alphanumeric\". See Defining an alternative escape sequence.

This XML can of course be generated in many alternative ways. For example, in a VoiceXML environment, the VoiceXML expr attribute can be used to specify an ECMAScript variable that contains the dynamic portion, and an ActivePrompt can be used for the carrier phrase:

<s ssft-domaintype="spell:alphanumeric">The part code is

<say-as interpret-as="spell:alphanumeric"><value expr="partcode"/>

</say-as></s>

For all dynamic content, it is important to make sure the input format is compatible with Vocalizer. For common data types that are supported by Vocalizer, this is best done by wrapping the dynamic portion within the matching Vocalizer SSML <say-as> or the native <ESC>\tn\ control sequence, checking the Vocalizer Language Supplement to ensure the input format is compatible.

Load ActivePrompt databases

Use the SSML <lexicon> element or the <default_activeprompt_dbs> XML configuration file parameter to load ActivePrompt databases for runtime use. You can load any number of ActivePrompt databases at runtime. The load order determines the precedence, with more recently loaded ActivePrompt databases having precedence over previously loaded databases. At runtime, Vocalizer only consults ActivePrompt databases that match the current synthesis voice.

For recorded ActivePrompt databases, the recordings are found relative to the URI or file path used to load the ActivePrompt database. For example, if the ActivePrompt database http://myserver/apdb_rp_tom_alphanum.dat contains a prompt named alphanum/f.alpha0 and the database specifies a file suffix of .wav, the recording file must be http://myserver/alphanum/f.alpha0.wav.

Sample-Load an ActivePrompt database

The TTS User Config is automatically loaded at runtime by specifying the following environment variable:

VOCALIZER_USERCFG=C:\Lex\tts_config.xml

The following illustrate how to load the ActivePrompt databases, using either a file path or UNC path protocol.

Tuning TTS output with ActivePrompts

Installing ActivePrompts

Prompt concatenation engine

Load ActivePrompt databases

Sample-Load an ActivePrompt database

Related topics