Unlock the Power of Your Voice™
Speaker-specific speech recognition can be
more accurate . . .
Create highly accurate speech user
profiles from dictation, conversation, or video speech
"behind the scenes," no separate speaker enrollment
Pricing/custom programming/business systems integration
Feature summary SpeechMax™/SpeechServers™/SweetSpeech™
Software Development Milestones
Nonadaptive Speech Recognition
and business systems integration,
dictation with handheld microphone,
transcription audio playback with
creation for programmable keypad,
voice, or bar code
U.S. patents and patents pending
SpeechMax™ session file editor and
server-based speech recognition.
The primary emphasis in speech recognition
and other speech and language processing has been working with data from
large numbers of speakers.
For example, interactive voice response (IVR)
for telephony recognition model often represents a speaker-independent
speech recognition (SISR) model based upon the voices of thousands of
speakers. Developers use these models to create voice user interfaces
(VUI) for automated business directory assistance for telephone numbers,
hotel or airline reservations, insurance benefits and claims inquiries,
stock quotes, and credit card balances.
These interfaces work best for simple tasks
or limited vocabularies. Typically, the VUI limits the appropriate
verbal response with menu-drive choices to improve accuracy. For
example, prompt may request "yes" or "no" or "benefits," "claims," or
"representative". In many cases, because of limited recognition
accuracy, prompts request that users enter digits with telephone keypad.
Some software uses speaker-independent speech
recognition (SISR) to deliver voice mail as text. Voice commands for
"smart" homes or cars typically use speaker-independent models.
Another speaker-independent application is audio search which uses speech
recognition to find specific phrases in audio files recorded by a call
center. Reviewing text is faster than listening to recorded speech.
Machine translation typically uses models based upon the translations of
many translating authors. Natural language understanding searches for
web content based upon key concepts usually derive from text form many
In speech recognition for
dictation (aka known as large vocabulary speech recognition), companies
typically provide speaker adaptive models that start with a
speaker-independent model. The user "adapts" this speaker-independent
model to the speaker's voice through enrollment and corrective training.
Speaker-adaptive speech recognition (SASR) is the most popular method for
business and professional speech recognition for dictation.
Some speech and language technology necessarily references individual speech
characteristics. Speaker recognition includes speaker authentication
and related voice biometrics applications. Software attempts to
match speech sample from known source with speech from unidentified source.
Text-to-speech is necessarily individualized in that the typically,
human-sounding synthetic voice font represents a particular voice. In
some cases, applications support text to speech imitating the speech of
well-known, real-life speakers.
Some research has supported making speech
recognition more accurate using speaker-specific data. Two Microsoft Asia
researchers explained in 2004 that the traditional method to obtain ideal
speaker-dependent speech recognition (SDSR) is to adapt an SISR system to the
specific speaker model with a small training set. The authors noted that
SASR may outperform both SISR and SDSR, but that "the SDSR systems were trained
on waveforms of at most 2 hours long. But how about increasing the
training data to massive amounts?"
For this purpose, we propose using massive
amounts of speaker-specific training data recorded in one’s daily life.
We call this
Massively Speaker-Specific Recognition (MSSR) . . . . Initial results
show that by changing the focus to MSSR, word error rates can drop very
significantly. In comparison with speaker-adaptive speech recognition
system, MSSR also performs better since model parameters can be tuned to be
suitable to one particular individual . . . . With the increasing popularity
of household and professional voice/video-recording equipments and voice
recording capability built into personal devices such as mobiles, it is
relatively convenient to get massive amounts of speaker-specific speech in
daily conversations . . . While automatic transcription may not be
very accurate, it is relatively affordable to have some speech transcribed.
Shi & E. Chang, "Studies in Massively Speaker-Specific Speech Recognition"
information on version compatibility, operating system requirements, and other
Includes Model Builder for
acoustic/language model and lexicon
training audio-linked session files, text, and
Audio-linked session file = .TRS (training session
management functionality included
individual, small-group, or large group models
models represent form of SISR
supports nonadaptive speech recognition
speaker microphone enrollment or corrective
Phonetics editor reduces reliance upon expensive
speech phonetic pronunciation generator to help
enhancements to minimize need for lexical knowledge
tying/untying to support training/updates
Automatic linguistics questions generation
"Standard" speech recognition software features
expression application for forward/reverse
for new user and train user
User-specific tools for formatting text (e.g.,
user/high level adjustment processing parameters
Individualize speed vs. accuracy settings (e.g.,
accumulator/combiner techniques to create acoustic
at least 6-8 hours audio-aligned verbatim text
more training audio-aligned data if available
saturation available than with SASR
Generally returns word audio tags after 4 hours or
less training data
Use minimum 8
kHz/8-bit for telephony
16 kHz/16bits for other audio
acoustic model with recorded "everyday" speech
dictation, conversational speech, cell phone or
landline, or video
results if good quality audio
minimal device, background or channel noise
profile creates an accessible SharedProfile™
Use data for other speech and
voice commands, speaker recognition, text to speech,
phoneme generation, machine translation, audio
mining, and natural language understanding
Single Speaker and
Small-Group User Profiles
Process uses the
session file editor in the pre-automation, training
phase to generate training data for the speech user
profile, and in the automation phase, to edit
server-based or real-time speech recognition.
In the pre-automation phase,
presegments dictation audio into untranscribed session
file (.SES) before manual transcription (MT). This
simple workflow results in creation of an untranscribed
session file (USF).
Transcriptionist loads segmented audio into
session file editor. Operator
plays back audio segment by segment or continuously using transcriptionist
foot control or keyboard shortcuts. The result is a transcribed
session file (TSF) with audio-linked text. Transcriptionist uses this
session file to create formatted, corrected text for distribution as report
or document. The transcriptionist simultaneously generates verbatim
session file (.SES) audio-linked text for speech user profile training.
Operator submits this and other session files
to create speech user acoustic model.
Process similarly generates data from day-to-day transcription to create
language model (representation of speaker's pattern of word use) and lexicon
(dictionary of speaker's pronunciation).
assists with training process file management and user
Shared speech user profile is not limited to
acoustic model, language model, and lexicon, and may include other data. For
example, the profile may include audio or text segmentation data or text
Process uses these preferences for forward
formatting of raw speech recognition engine text into a format acceptable to
the dictating speaker, or reverse formatting final transcribed text to
speech recognition engine compatible format.
Process may also tune segmentation algorithms
to individual speaker's speech. Process can used same settings for
presegmentation audio before manual transcription and segmentation
immediately before SR speech-to-text decoding.
Toolkit also supports creation of small-group or large-group speaker profiles.
Off-the-shelf Dragon and other SASR programs are designed for use by a
single speaker and a single channel. They may be modified for
multispeaker, multichannel settings whereby each speaker is isolated with
use of a separate microphone or recording device. They do not provide
for training and update of a speaker-dependent model in cases, such as
depositions, trials, or regular business meetings, where two or more
speakers share the same microphone or other recording device.
Small-group profiles intended for use
for specific small group
same nonadaptive techniques as for single speaker
Best group results if
2 speakers that meet regularly over a long period
Use training data from
multispeaker single-channel or multi-channel source
May transcribe initial
training data with MT or SR
Convert to .SES format
to create speech user training data
individual-specific and small-group profiles from
Use for corporate board
meetings, repeated videos, or police surveillance
Multispeaker techniques are designed primarily
for situations where the same speakers are repeatedly present, such as a lengthy
legal proceeding, sequential meetings of a corporation's board of directors, or
long-term law enforcement or national security surveillance of a small group.
The small group speech user profile represents the range of the speech
characteristics of the small group.
High-Level DIY Kit for Speaker-Specific Speech Recognition and Other Speech and
Toolkit is designed as a do-it-yourself (DIY)
product for government, hospitals, law firms, and other organizations with
high transcription volumes. Software also supports "software as a
service model" (SaaS) for transcription or other companies. These
companies could offer speech user profile development, profile hosting, and
remote SR transcription.
Specialized member of the transcription team
or information services personnel could serve as the primary
operator/supervisor for creating the speech user profiles.
High Level Lexical Tools
The software provides a variety of methods to help a novice unfamiliar with
lexical representation to create and train a preferably speaker-dependent
1. One approach is support for word models
(templates). This form of speaker modeling does not require entry of a
phonetic pronunciation into the lexicon, but only the word itself. It may be
used for same-language speech-to-text conversion or translation.
Methods include a phonetics editor that uses a text-to-speech engine for
assistance in creating lexical pronunciation. The method permits an operator
to enter pronunciation based upon a simplified phonetic scheme, playback and
listen to text-to-speech audio based upon the entered pronunciation, and
determine whether the entered pronunciation accurately represents the
dictated speech. In one approach, the lexicon may be created with a
simplified representation of individual speech sounds used by grade school
speech teachers to teach phonetics and proper pronunciation.
3. Software also includes a novel technique to generate
speaker-dependent questions for state tying in automated linguistic
questions generation. This results in selection of different parameter range
and incremental values for linguistic questions generation, and automatic
selection of a set of best questions among those using maximum likelihood
based distance measure. This avoids time-consuming generation of separate
acoustic models, testing word error rate for each, and selecting questions
file based upon the lowest word error rate or other performance
4. Both manual and automatically generated questions
may be used for state tying. Where both manually and automatically generated
questions are available, in one approach, the state-tying methods will use
questions that maximize likelihood of acoustic similarity with reference to
data sparse triphones or other word subunits.
Use Accumulator and
Process supports accumulator and combiner
techniques with distributed computing for acoustic model updates. With this
approach, states may be untied when updating the acoustic model. In the
speaker-adaptive prior art, the acoustic model may be unsaturated in that
there is substantial lack of triphone or other word subunit modeling based
upon the end-user speaker's voice and style, especially at outset of use.
Use of accumulator and combiner techniques may improve accuracy in
speaker-dependent systems where there is relative abundance of trained
triphone states (compared to speaker-adaptive systems) and relative acoustic
model saturation when the end-user begins use of speech recognition.
for Accessible, Interchangeable Speech Data
Accessibility to interchangeable speech data
can also promote improvement in related speech and language processing.
For example, if there is a misrecognition
using voice activation (command and control) or interactive voice response
for telephony, operator listen to speech associated to text command.
User may correct result and submit training session file to update the
speaker model. Toolkit also potentially supports corrective updates for
other automated speech and language processing.
This is in contrast to "closed" speech
systems. There the operation and management of non-dictation,
speech-related applications is typically transparent to the end user.
In these systems, end user cannot view and correct results for continued
improvement of the speech user profile.
The speech engine may be used with the
user and file management for back-end, server-based speech recognition.
Plugins for the
SpeechMax™ HTML session file editor
support real-time interactive speech recognition with the system and local,
client-based transcription of an audio file.
SpeechServers™ SAPI 5.x uses same install kit as
SweetSpeech™but is separately licensed.
Word error rate reduction of about 1% with nonadaptive
approach compared to adaptive model (Relative
error reduction of nonadaptive SR up to 12.3% compared
to adaptive SR)
Value Proposition: Company nonadaptive
(NASR) approach directly creates a user profile without
use of a speaker-independent profile. In the
company's approach, a nonadaptive speech user profile
can be directly created from dictation and transcription
adaptive (SASR) user profile modifies speaker-independent speech user
profile. Creation of a speaker-independent (SISR) profile requires
substantial expenditure. Availability of speaker-independent model
saves time for speaker using adaptive SR. Based upon research cited
above, there is no evidence that this extra, intermediate SASR step improves
speaker accuracy compared to NASR.
A speaker-independent (SI)
speech user profile (model) is typically based on the speech of hundreds or
thousands of speakers that must be recorded and evaluated.
Speaker-adaptive speech recognition (SASR) creates a hybrid user profile by
modifying (adapting) the SI model to more closely resemble a speaker's
Creation of a
speaker-independent model requires upfront spending to create and test a
SISR for any major dialect (for example, UK English, American English,
Australian English, Indian English, or American South English). The
high costs of creating this intermediate model may be one reason why SR is
not available in many countries for all languages or dialects.
In some cases, development of speaker-independent models might be
impractical or impossible, e.g., for the speech-impaired.
above indicates higher accuracy for nonadaptive system with large training
volumes compared to speaker-adaptive or speaker-independent software.
Researchers found improved accuracy compared to SASR and SISR with training data
greater than 15 hours and less than 8 hours
minimum word error rates (WERs), average accuracy levels were as follows:
93.03% for nonadaptive speaker-specific SR (NASR), 92.05% for
speaker-adaptive SR (SASR), and 85.42% for speaker-independent speech
recognition (SISR). As measured by relative error rate (RER)
reduction, NASR accuracy was 52.2% higher compared to SISR, and 12.3% higher
for NASR with training sets over 15 hours.
training sets of less than 8 hours, NASR outperformed both SISR and SASR by
respective RER reductions of 45.1% and 8.6%. At less than 8 hours
speech, nonadaptive SR accuracy had comparable accuracy with
speaker-adaptive SR using more than 15 hours speech.
Researchers indicate that one reason for better performance of NASR is the
early saturation of SASR acoustic model compared to NASR. The
researchers found that the inflection point was at about 1,000 sentences
(1.3 hours). "After more than 1,000 sentences being added, the SASR
does not response [respond] with the same significant improvement as
before." In contrast, the researches found, "The performance of
[nonadaptive speaker-specific SR] still has significant improvement even
when the amount of training sentences exceeds 5,000."
This may underestimate the
potential for improved NASR accuracy compared to SASR. As indicated
above, research notes the rapid saturation of a speaker-adaptive model with
inflection point at 1.3 hours. If study had doubled or tripled the
data collection for nonadaptive SR compared to adaptive SR, the accuracy
difference may have increased substantially more than the approximately 1%
improvement seen in the study.
In addition, system may prove
especially beneficial in noisy environments such as homes, cars, factories,
call centers, plane or tank cockpits, or outdoors. Software models
background or channel noise directly rather than using approximation
techniques based upon SISR model. Noise modeling characteristics may
significantly impact development of voice commands and other speech and
language processing for "smart" homes or cars or other environments.
In summary, this research
indicates that speaker adaptation of speaker-independent SR (SISR) model
provides no significant benefit in terms of increased accuracy compared to
nonadaptive approach using large amounts of speaker-specific training data.
relative error rate reduction with single-speaker training
Use minimum of
6-8 hours transcribed speech to create model
Model parameters tuned to actual speech and "noise"
Improved accuracy and reduced WER compared to SASR
SR saturates with large amounts of training data
Less responsive to new training
compared to nonadaptive SR
No significant saturation with
nonadaptive SR (NASR)
Can effectively saturate and train
NASR with more specific-specific data
Increased accuracy of nonadaptive speaker-specific model in chart below refers
to percentage relative error reduction (RER) compared to SASR and SISR. At
< 8 hours and > 15 hours, RER was 8.6% and 12.3% compared to SASR.
Research indicates that NASR had approximately 1% lower word error rate (WER)
compared to SASR.
Increased accuracy in bar graph
below from Microsoft research refers to 8.6% error reduction of nonadaptive
recognition compared to adaptive technique with less than 8 hours of
training data, and 12.3% error reduction with greater than 15 hours of
training. Word error rate (%) is about 1% lower for nonadaptive
speaker-specific model for different levels of speaker training data.
Authors estimate word error rate is about 7% for training data of 12,000
sentences for nonadaptive technique, and 8% for adaptive model with same
level of training data. Similarly, there is improvement in relative
error reduction of 45% and 52.2% for nonadaptive approach compared to
1. Cost reduction by
creating speech user directly from dictation audio and manual transcription
without reliance upon an intermediate model. Speaker-adaptive SR
imposes a significant upfront development cost of creating a
speaker-independent model that is, then, used to create a speaker-dependent
speech user profile.
Company system supports
creation of NASR speaker model directly from day-to-day dictation,
conversational speech, video, or from other speaker-specific data without
expense of creating speaker-independent model or adaptive algorithms.
enrollment data more representative.
The researchers note that the "adaptive" process
involves speaker enrollment and general training (script reading). The
scripted process can result in recording speech that differs dramatically in
lexicon, syntax, and style from ordinary, day-to-day dictation or
conversational speech. The process itself may introduce error.
Nonadaptive technology avoids this pitfall with use of day-to-day,
3. Improved word error rate.
According to the review, speaker-specific nonadaptive SR (SASR) has slightly
lower word error rate (WER) of about 1% compared to adaptive approach
(NASR). This indicates slightly improved accuracy for NASR compared to
4 Direct modeling of
channel or background noise. Authors note that mainstream adaptive SR
uses MAP, MLLR, or other mathematical approximation to model speech user
profile. This includes modeling of background noise rather than actual
raw data. Nonadaptive approach directly models user speech and
noise. This may have important implications for development of "smart"
technology in noisy environments, such as homes or cars, controlled by voice
speaker-adaptive SR does not support creation of small-group models.
Company software also supports creation of small-group and individual user
profiles for groups that meet frequently, e.g., board of directors, parties
engaged in lengthy courtroom litigation, and groups under police
surveillance. The nonadaptive small-group profile would represent the
range of speaker characteristics for the group.
Efficiently Training NASR
Company has developed special tools for training
from day-to-day speech. It is envisioned that
specialized "speech trainer" members of the
transcription or information services team could
supervise or perform these functions. Steps
in the process include:
1. Dictation audio
segmentation with SpeechSplitter™
creates untranscribed session file (.SES) file that is
transcribed to create audio-linked text in transcribed session file (.SES).
The process uses the verbatim audio-linked text to train speech user
profile. This seamless change in dictation transcription process can
produce a vast amount of useful training data.
2. Process can use split/merge technique to associate verbatim
text with utterance audio for previously transcribed text. This
to rapidly link audio from prior dictation with
previously created verbatim text. The process does not require
extensive transcription expertise.
can potentially transform previously transcribed speech
recognition session files to common .SES format and use session files for
training nonadaptive SR profile.
4. Toolkit provides additional tools to assist with speech
user profile creation. For example, tools help reduce reliance upon
expensive lexical expertise. For example, automatic text to speech
(TTS) pronunciation generator can assist with creation of speech user
profile lexicon. Automatic linguistic questions generation reduces
needs for lexical expertise in determining likely search algorithms during
speech to text decoding.
Potential Users of Individual-Specific Speech and
Some potential immediate
beneficiaries of speaker-specific technology include the
1. This "do-it-yourself" (DIY)
approach may interest transcription companies. . Software opens
possibility of income capture from creation of SR speech user profiles.
In one approach, a transcription company could offer "software as a service"
(SaaS) . The company could create speech user profiles from dictation
and transcription, host the SR software and databases for speech
recognition. The approach could also potentially benefit companies,
government, or individuals with high manual transcription volumes.
2. In some developing countries
there is limited or no availability of speech recognition in that country's
language or dialect(s). In this setting, custom creation of speech
user profiles could represent a profitable transcription business activity
for government or business users. Even in the U.S. there are persons
(e.g., foreign born) whose acoustic profile does not match well with the
native population. Custom creation of speech user profiles in this
setting could prove beneficial as well.
3. Speaker-specific technology
can potentially benefit the speech-impaired.
The following videos illustrate creation of and transcription with
nonadaptive speaker profile.
Demo #1A . . .
utterance (phrase) segmentation to create untranscribed session file from
dictation . . .
Audio file is "split" to create segmented untranscribed session file.
Transcriptionist manually transcribes in
to create transcribed session file (TSF) using PlaySpeech™.
Demo also shows realignment segment
boundary markers to include "period" within larger segment.
. . .
utterance (phrase) segmentation to create untranscribed session file from
dictation . . . Audio file is "split" to create segmented
untranscribed session file. Transcriptionist imports previously
transcribed text, sequentially listens to each untranscribed utterance
(audio playable by clicking audio-linked text),
and sequentially delimits each utterance by toggling play audio control.
The result is a transcribed session file as above. The segmented transcribed
session file can be used as a training session file.
Demo #2 . . .
Server-based transcription using
speech to text conversion . . .
Speech user profile was created with
speech and language processing toolkit. Video first
shows text immediately after speech-to-text conversion (raw speech engine
decoding). This is followed by regular expressions algorithms to
search and match text strings. Conversion rules may reflect locale and
speaker or institutional preferences. User loaded
post-formatting transcribed session file (TSF) into
to play back audio and make any needed corrections.
obtain more information about version compatibility with
different SR and OS.