Unlock the Power of Your Voice™

Speaker-specific speech recognition can be more accurate . . .

Create highly accurate speech user profiles from dictation, conversation, or video speech

Profile created "behind the scenes," no separate speaker enrollment

Products 

Pricing/custom programming/business systems integration

Version compatibility/operating systems/dependencies

Feature summary SpeechMax™/SpeechServers™/SweetSpeech

Video Demos

Software Development Milestones

Nonadaptive Speech Recognition

Company software also supports workflow and business systems integration, workflow design, call center, telephone dictation, digital dictation with handheld microphone, manual transcription audio playback with foot control, audio conversion, and macro creation for programmable keypad, voice, or bar code

U.S. patents and patents pending

Overview

SweetSpeech
integrates with SpeechMax™ session file editor and SpeechServers™ server-based speech recognition  

The primary emphasis in speech recognition and other speech and language processing has been working with data from large numbers of speakers. 

For example, interactive voice response (IVR) for telephony recognition model often represents a speaker-independent speech recognition (SISR) model based upon the voices of thousands of speakers.  Developers use these models to create voice user interfaces (VUI) for automated business directory assistance for telephone numbers, hotel or airline reservations, insurance benefits and claims inquiries, stock quotes, and credit card balances.

These interfaces work best for simple tasks or limited vocabularies.  Typically, the VUI limits the appropriate verbal response with menu-drive choices to improve accuracy.  For example, prompt may request "yes" or "no" or "benefits," "claims," or "representative".  In many cases, because of limited recognition accuracy, prompts request that users enter digits with telephone keypad.  

Some software uses speaker-independent speech recognition (SISR) to deliver voice mail as text.  Voice commands for "smart" homes or cars typically use speaker-independent models.  Another speaker-independent application is audio search which uses speech recognition to find specific phrases in audio files recorded by a call center.  Reviewing text is faster than listening to recorded speech.  Machine translation typically uses models based upon the translations of many translating authors.  Natural language understanding searches for web content based upon key concepts usually derive from text form many different sources.   

In speech recognition for dictation (aka known as large vocabulary speech recognition), companies typically provide speaker adaptive models that start with a speaker-independent model.  The user "adapts" this speaker-independent model to the speaker's voice through enrollment and corrective training.  Speaker-adaptive speech recognition (SASR) is the most popular method for business and professional speech recognition for dictation.  

Some speech and language technology necessarily references individual speech characteristics.  Speaker recognition includes speaker authentication and related voice biometrics applications.   Software attempts to match speech sample from known source with speech from unidentified source.  Text-to-speech is necessarily individualized in that the typically, human-sounding synthetic voice font represents a particular voice.  In some cases, applications support text to speech imitating the speech of well-known, real-life speakers. 

Some research has supported making speech recognition more accurate using speaker-specific data.  Two Microsoft Asia researchers explained in 2004 that  the traditional method to obtain ideal speaker-dependent speech recognition (SDSR) is to adapt an SISR system to the specific speaker model with a small training set.  The authors noted that SASR may outperform both SISR and SDSR, but that "the SDSR systems were trained on waveforms of at most 2 hours long.  But how about increasing the training data to massive amounts?"

For this purpose, we propose using massive amounts of speaker-specific training data recorded in one’s daily life.  We call this Massively Speaker-Specific Recognition (MSSR) . . . . Initial results show that by changing the focus to MSSR, word error rates can drop very significantly.  In comparison with speaker-adaptive speech recognition system, MSSR also performs better since model parameters can be tuned to be suitable to one particular individual . . . . With the increasing popularity of household and professional voice/video-recording equipments and voice recording capability built into personal devices such as mobiles, it is relatively convenient to get massive amounts of speaker-specific speech in daily conversations . . . While  automatic transcription may not be very accurate, it is relatively affordable to have some speech transcribed.   Y. Shi & E. Chang, "Studies in Massively Speaker-Specific Speech Recognition" (IEEE 2004)

For more information on version compatibility, operating system requirements, and other dependencies, click here

SweetSpeech™ General Functions

  • Includes Model Builder for acoustic/language model and lexicon
  • Unicode compatible
  • Submit training audio-linked session files, text, and lexical data
  • Audio-linked session file = .TRS (training session file)
  • User management functionality included
  • Create individual, small-group, or large group models
  • Group models represent form of SISR
     
  • System supports nonadaptive speech recognition
  • No speaker microphone enrollment or corrective adaptation
     
  • Phonetics editor reduces reliance upon expensive lexical expertise
  • Text to speech phonetic pronunciation generator to help create lexicon
     
  • Other enhancements to minimize need for lexical knowledge
  • States tying/untying to support training/updates
  • Automatic linguistics questions generation
     
  • "Standard" speech recognition software features available
  • Regular expression application for forward/reverse formatting text
  • Wizards for new user and train user
  • User-specific tools for formatting text (e.g., dates, currencies)
  • End user/high level adjustment processing parameters
  • Individualize speed vs. accuracy settings (e.g., beam pruning)
     
  • Use accumulator/combiner techniques to create acoustic model
  • Requires at least 6-8 hours audio-aligned verbatim text
  • Supports more training audio-aligned data if available
  • Higher saturation available than with SASR
  • Generally returns word audio tags after 4 hours or less training data
     
  • Use minimum 8 kHz/8-bit for telephony
  • Use minimum 16 kHz/16bits for other audio
  • Create acoustic model with recorded "everyday" speech
  • Use dictation, conversational speech, cell phone or landline, or video
  • Better results if good quality audio
  • E.g., minimal device, background or channel noise
     
  • SR profile creates an accessible SharedProfile
  • Use data for other speech and language processing
  • E.g., voice commands, speaker recognition, text to speech, phoneme generation, machine translation, audio mining, and natural language understanding

Single Speaker and Small-Group User Profiles

Process uses the SpeechMax™ session file editor in the pre-automation, training phase to generate training data for the speech user profile, and in the automation phase, to edit server-based or real-time speech recognition. 

In the pre-automation phase, SpeechServers™ presegments dictation audio into untranscribed session file (.SES) before manual transcription (MT).  This simple workflow results in creation of an untranscribed session file (USF). 

Transcriptionist loads segmented audio into SpeechMax™ session file editor.  Operator plays back audio segment by segment or continuously using transcriptionist foot control or keyboard shortcuts.  The result is a transcribed session file (TSF) with audio-linked text.  Transcriptionist uses this session file to create formatted, corrected text for distribution as report or document.  The transcriptionist simultaneously generates verbatim session file (.SES) audio-linked text for speech user profile training. 

Operator submits this and other session files to SweetSpeech to create speech user acoustic model.  Process similarly generates data from day-to-day transcription to create language model (representation of speaker's pattern of word use) and lexicon (dictionary of speaker's pronunciation).

SpeechServers™
assists with training process file management and user creation.  

Shared speech user profile is not limited to acoustic model, language model, and lexicon, and may include other data. For example, the profile may include audio or text segmentation data or text formatting preferences. 

Process uses these preferences for forward formatting of raw speech recognition engine text into a format acceptable to the dictating speaker, or reverse formatting final transcribed text to speech recognition engine compatible format. 

Process may also tune segmentation algorithms to individual speaker's speech.  Process can used same settings for presegmentation audio before manual transcription and segmentation immediately before SR speech-to-text decoding.

Value Proposition:  Toolkit also supports creation of small-group or large-group speaker profiles. 

Off-the-shelf Dragon and other SASR programs are designed for use by a single speaker and a single channel.  They may be modified for multispeaker, multichannel settings whereby each speaker is isolated with use of a separate microphone or recording device.  They do not provide for training and update of a speaker-dependent model in cases, such as depositions, trials, or regular business meetings, where two or more speakers share the same microphone or other recording device. 

  • Small-group profiles intended for use for specific small group
  • Uses same nonadaptive techniques as for single speaker
  • Best group results if > 2 speakers that meet regularly over a long period
  • Use training data from multispeaker single-channel or multi-channel source
  • May transcribe initial training data with MT or SR
  • Convert to .SES format to create speech user training data
  • Create both individual-specific and small-group profiles from data
  • Use for corporate board meetings, repeated videos, or police surveillance

Multispeaker techniques are designed primarily for situations where the same speakers are repeatedly present, such as a lengthy legal proceeding, sequential meetings of a corporation's board of directors, or long-term law enforcement or national security surveillance of a small group.  The small group speech user profile represents the range of the speech characteristics of the small group.

High-Level DIY Kit for Speaker-Specific Speech Recognition and Other Speech and Language Processing

Toolkit is designed as a do-it-yourself (DIY) product for government, hospitals, law firms, and other organizations with high transcription volumes.  Software also supports "software as a service model" (SaaS) for transcription or other companies.  These companies could offer speech user profile development, profile hosting, and remote SR transcription. 

Specialized member of the transcription team or information services personnel could serve as the primary operator/supervisor for creating the speech user profiles. 

High Level Lexical Tools Supported

The software provides a variety of methods to help a novice unfamiliar with lexical representation to create and train a preferably speaker-dependent speaker model.

1. One approach is support for word models (templates). This form of speaker modeling does not require entry of a phonetic pronunciation into the lexicon, but only the word itself. It may be used for same-language speech-to-text conversion or translation.

2.  Methods include a phonetics editor that uses a text-to-speech engine for assistance in creating lexical pronunciation. The method permits an operator to enter pronunciation based upon a simplified phonetic scheme, playback and listen to text-to-speech audio based upon the entered pronunciation, and determine whether the entered pronunciation accurately represents the dictated speech. In one approach, the lexicon may be created with a simplified representation of individual speech sounds used by grade school speech teachers to teach phonetics and proper pronunciation.

3. Software also includes a novel technique to generate speaker-dependent questions for state tying in automated linguistic questions generation. This results in selection of different parameter range and incremental values for linguistic questions generation, and automatic selection of a set of best questions among those using maximum likelihood based distance measure. This avoids time-consuming generation of separate acoustic models, testing word error rate for each, and selecting questions file based upon the lowest word error rate or other performance consideration.

4. Both manual and automatically generated questions may be used for state tying. Where both manually and automatically generated questions are available, in one approach, the state-tying methods will use questions that maximize likelihood of acoustic similarity with reference to data sparse triphones or other word subunits.

Use Accumulator and Combiner Techniques 

Process supports accumulator and combiner techniques with distributed computing for acoustic model updates. With this approach, states may be untied when updating the acoustic model. In the speaker-adaptive prior art, the acoustic model may be unsaturated in that there is substantial lack of triphone or other word subunit modeling based upon the end-user speaker's voice and style, especially at outset of use.  Use of accumulator and combiner techniques may improve accuracy in speaker-dependent systems where there is relative abundance of trained triphone states (compared to speaker-adaptive systems) and relative acoustic model saturation when the end-user begins use of speech recognition.

SharedProfile™ for Accessible, Interchangeable Speech Data

Accessibility to interchangeable speech data can also promote improvement in related speech and language processing. 

For example, if there is a misrecognition using voice activation (command and control) or interactive voice response for telephony, operator listen to speech associated to text command.  User may correct result and submit training session file to update the speaker model. Toolkit also potentially supports corrective updates for other automated speech and language processing.   

This is in contrast to "closed" speech systems.  There the operation and management of non-dictation, speech-related applications is typically transparent to the end user.  In these systems, end user cannot view and correct results for continued improvement of the speech user profile.
 
The speech engine may be used with the SpeechServers™ user and file management for back-end, server-based speech recognition.   Plugins for the SpeechMax™ HTML session file editor support real-time interactive speech recognition with the system and local, client-based transcription of an audio file. 

SpeechServers™ SAPI 5.x uses same install kit as SweetSpeech™but is separately licensed.

Word error rate reduction of about 1% with nonadaptive approach compared to adaptive model (Relative error reduction of nonadaptive SR up to 12.3% compared to adaptive SR)

Value Proposition:  Company nonadaptive (NASR) approach directly creates a user profile without use of a speaker-independent profile.  In the company's approach, a nonadaptive speech user profile can be directly created from dictation and transcription data.

Speaker adaptive (SASR) user profile modifies speaker-independent speech user profile.  Creation of a speaker-independent (SISR) profile requires substantial expenditure.  Availability of speaker-independent model saves time for speaker using adaptive SR.  Based upon research cited above, there is no evidence that this extra, intermediate SASR step improves speaker accuracy compared to NASR.  

A speaker-independent (SI) speech user profile (model) is typically based on the speech of hundreds or thousands of speakers that must be recorded and evaluated.  Speaker-adaptive speech recognition (SASR) creates a hybrid user profile by modifying (adapting) the SI model to more closely resemble a speaker's speech.

Creation of a speaker-independent model requires upfront spending to create and test a SISR for any major dialect (for example, UK English, American English, Australian English, Indian English, or American South English).  The high costs of creating this intermediate model may be one reason why SR is not available in many countries for all languages or dialects. 

In some cases, development of speaker-independent models might be impractical or impossible, e.g., for the speech-impaired.

Research cited above indicates higher accuracy for nonadaptive system with large training volumes compared to speaker-adaptive or speaker-independent software.  Researchers found improved accuracy compared to SASR and SISR with training data greater than 15 hours and less than 8 hours  

Based upon minimum word error rates (WERs), average accuracy levels were as follows:  93.03% for nonadaptive speaker-specific SR (NASR), 92.05% for speaker-adaptive SR (SASR), and 85.42% for speaker-independent speech recognition (SISR).   As measured by relative error rate (RER) reduction, NASR accuracy was 52.2% higher compared to SISR, and 12.3% higher for NASR with training sets over 15 hours. 

For training sets of less than 8 hours, NASR outperformed both SISR and SASR by respective RER reductions of 45.1% and 8.6%.  At less than 8 hours speech, nonadaptive SR accuracy had comparable accuracy with speaker-adaptive SR using more than 15 hours speech.

Researchers indicate that one reason for better performance of NASR is the early saturation of SASR acoustic model compared to NASR.  The researchers found that the inflection point was at about 1,000 sentences (1.3 hours).  "After more than 1,000 sentences being added, the SASR does not response [respond] with the same significant improvement as before."  In contrast, the researches found, "The performance of [nonadaptive speaker-specific SR] still has significant improvement even when the amount of training sentences exceeds 5,000."   

This may underestimate the potential for improved NASR accuracy compared to SASR.  As indicated above, research notes the rapid saturation of a speaker-adaptive model with inflection point at 1.3 hours.  If study had doubled or tripled the data collection for nonadaptive SR compared to adaptive SR, the accuracy difference may have increased substantially more than the approximately 1% improvement seen in the study. 

In addition, system may prove especially beneficial in noisy environments such as homes, cars, factories, call centers, plane or tank cockpits, or outdoors.  Software models background or channel noise directly rather than using approximation techniques based upon SISR model. Noise modeling characteristics may significantly impact development of voice commands and other speech and language processing for "smart" homes or cars or other environments.


In summary, this research indicates that speaker adaptation of speaker-independent SR (SISR) model provides no significant benefit in terms of increased accuracy compared to nonadaptive approach using large amounts of speaker-specific training data. 

  • Increased relative error rate reduction with single-speaker training
  • Use minimum of 6-8 hours transcribed speech to create model
  • Model parameters tuned to actual speech and "noise"
  • Improved accuracy and reduced WER compared to SASR
  • Speaker adaptive SR saturates with large amounts of training data
  • Less responsive to new training compared to nonadaptive SR
  • No significant saturation with nonadaptive SR (NASR)
  • Can effectively saturate and train NASR with more specific-specific data

Increased accuracy of nonadaptive speaker-specific model in chart below refers to percentage relative error reduction (RER) compared to SASR and SISR.  At < 8 hours and > 15 hours, RER was 8.6% and 12.3% compared to SASR.   Research indicates that NASR had approximately 1% lower word error rate (WER) compared to SASR. 

Increased accuracy in bar graph below from Microsoft research refers to 8.6% error reduction of nonadaptive recognition compared to adaptive technique with less than 8 hours of training data, and 12.3% error reduction with greater than 15 hours of training.  Word error rate (%) is about 1% lower for nonadaptive speaker-specific model for different levels of speaker training data.  Authors estimate word error rate is about 7% for training data of 12,000 sentences for nonadaptive technique, and 8% for adaptive model with same level of training data.  Similarly, there is improvement in relative error reduction of 45% and 52.2% for nonadaptive approach compared to speaker-independent model. 

Benefits:  

1.  Cost reduction by creating speech user directly from dictation audio and manual transcription without reliance upon an intermediate model.  Speaker-adaptive SR imposes a significant upfront development cost of creating a speaker-independent model that is, then, used to create a speaker-dependent speech user profile.   Company system supports creation of NASR speaker model directly from day-to-day dictation, conversational speech, video, or from other speaker-specific data without expense of creating speaker-independent model or adaptive algorithms.  

2.  Speaker enrollment data more representative.  The researchers note that the "adaptive" process involves speaker enrollment and general training (script reading).  The scripted process can result in recording speech that differs dramatically in lexicon, syntax, and style from ordinary, day-to-day dictation or conversational speech.  The process itself may introduce error.  Nonadaptive technology avoids this pitfall with use of day-to-day, nonscripted speech.

3. Improved word error rate.  According to the review, speaker-specific nonadaptive SR (SASR) has slightly lower word error rate (WER) of about 1% compared to adaptive approach (NASR).  This indicates slightly improved accuracy for NASR compared to SASR. 

4  Direct modeling of channel or background noise.  Authors note that mainstream adaptive SR uses MAP, MLLR, or other mathematical approximation to model speech user profile.  This includes modeling of background noise rather than actual raw data.   Nonadaptive approach directly models user speech and noise. This may have important implications for development of "smart" technology in noisy environments, such as homes or cars, controlled by voice commands.        

5.  Mainstream speaker-adaptive SR does not support creation of small-group models.  Company software also supports creation of small-group and individual user profiles for groups that meet frequently, e.g., board of directors, parties engaged in lengthy courtroom litigation, and groups under police surveillance.  The nonadaptive small-group profile would represent the range of speaker characteristics for the group. 

Efficiently Training NASR Speech Engine

Company has developed special tools for training SweetSpeech™ models from day-to-day speech.  It is envisioned that specialized "speech trainer" members of the transcription or information services team could supervise or perform  these functions.  Steps in the process include:   

1.  Dictation audio segmentation with SpeechSplitter creates untranscribed session file (.SES) file that is transcribed to create audio-linked text in transcribed session file (.SES).  The process uses the verbatim audio-linked text to train speech user profile.  This seamless change in dictation transcription process can produce a vast amount of useful training data. 

2.  Process can use split/merge technique to associate verbatim text with utterance audio for previously transcribed text.  This technique uses
SpeechMax™ to rapidly link audio from prior dictation with previously created verbatim text.  The process does not require extensive transcription expertise.

3.  SpeechMax™ can potentially transform previously transcribed speech recognition session files to common .SES format and use session files for training nonadaptive SR profile. 

4.  Toolkit provides additional tools to assist with speech user profile creation.  For example, tools help reduce reliance upon expensive lexical expertise.  For example, automatic text to speech (TTS) pronunciation generator can assist with creation of speech user profile lexicon.  Automatic linguistic questions generation reduces needs for lexical expertise in determining likely search algorithms during speech to text decoding.

Potential Users of Individual-Specific Speech and Language Models  

Some potential immediate beneficiaries of speaker-specific technology include the following:

1. This "do-it-yourself" (DIY) approach may interest transcription companies.  .  Software opens possibility of income capture from creation of SR speech user profiles.  In one approach, a transcription company could offer "software as a service" (SaaS) .  The company could create speech user profiles from dictation and transcription, host the SR software and databases for speech recognition.  The approach could also potentially benefit companies, government, or individuals with high manual transcription volumes.  

2. In some developing countries there is limited or no availability of speech recognition in that country's language or dialect(s).  In this setting, custom creation of speech user profiles could represent a profitable transcription business activity for government or business users.  Even in the U.S. there are persons (e.g., foreign born) whose acoustic profile does not match well with the native population.  Custom creation of speech user profiles in this setting could prove beneficial as well. 

3. Speaker-specific technology can potentially benefit the speech-impaired.   

The following videos illustrate creation of and transcription with nonadaptive speaker profile.    

Demo #1A  . . . SpeechSplitter utterance (phrase) segmentation to create untranscribed session file from dictation . . .  Audio file is "split" to create segmented untranscribed session file.  Transcriptionist manually transcribes in SpeechMax™ to create transcribed session file (TSF) using PlaySpeech™.   Demo also shows realignment segment boundary markers to include "period" within larger segment.   Flash WMP MP4    

Demo #1B . . . SpeechSplitter™ utterance (phrase) segmentation to create untranscribed session file from dictation . . .  Audio file is "split" to create segmented untranscribed session file.  Transcriptionist imports previously transcribed text, sequentially listens to each untranscribed utterance using PlaySpeech™ (audio playable by clicking audio-linked text), and sequentially delimits each utterance by toggling play audio control.  The result is a transcribed session file as above. The segmented transcribed session file can be used as a training session file.  Flash WMP MP4   

Demo #2  . . . Server-based transcription using SweetSpeech speech to text conversion . . . Speech user profile was created with SweetSpeech speech and language processing toolkit.  Video first shows text immediately after speech-to-text conversion (raw speech engine decoding).  This is followed by regular expressions algorithms to search and match text strings.  Conversion rules may reflect locale and speaker or institutional preferences.   User loaded post-formatting transcribed session file (TSF) into SpeechMax™ to play back audio and make any needed corrections.  Flash WMP MP4

Use About to obtain more information about version compatibility with different SR and OS.  
 

Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)