More than a word processor™

For dictation, transcription, speech recognition and other speech and language processing

Supports analysis, conversion, and processing of other text, audio, and image pattern recognition using application programming interface (API) and software development kit (SDK)

Reduce search time for pattern recognition errors with advanced text compare


Pricing/custom programming/business systems integration

Version compatibility/operating systems/dependencies

Feature summary SpeechMax™/SpeechServers™/SweetSpeech

Video Demos

Software Development Milestones

Nonadaptive Speech Recognition

Company software supports workflow and business systems integration, workflow design, call center, telephone dictation, digital dictation with handheld microphone, manual transcription audio playback with foot control, audio conversion, and macro creation for programmable keypad, voice, or bar code

U.S. patents and patents pending 


Session file content represents the sum total of SpeechMax™ document and annotation (comment) data from audio, text, or image data processing.

Software has HTML display and XML storage of proprietary session file content. 

Full Edition supports audio record for dictation, audio playback with transcriptionist foot pedal or hotkeys, real-time speech recognition, desktop autotranscribe audio file with speech recognition, edit server-based speech recognition, multilingual,  multiwindow document display, and text and audio annotation (comments).   It integrates with SpeechServers™ server-based speech recognition and SweetSpeech™ speech engine and model builder. 

Reader Edition displays session file text, audio, and image content.  Document license-embedded SessionLock™ is designed to prevent unauthorized editing.

Microsoft Word runs SpeechMax™ Microsoft Office Toolbar Add-In.  Add-In has Next/Previous toolbar arrows for user navigation to Word bookmarks.  Add-In also has Import/Export toolbar functions that support annotation and phrase migration wizards.  These wizards support data migration to/from SpeechMax™ session file to Word. 

SpeechMax™ is available with SpeechProfessional™ combo package. 

For more information on version compatibility, operating system requirements, and other dependencies, click here

SpeechMax™ General Functions

  • Document window supports read/write text, audio, or image
  • Read/write includes .TXT, .RTF, .HTML, or .SES (proprietary session file)
  • Copy/paste graphics into document window
  • Content stored as XML
  • Single, double, or multiple window viewing available
  • Tile documents horizontally, vertically, or cascade
  • Enter text by keyboard, barcode, or speech recognition
  • SweetSpeech™  real-time or server-based (local autotranscribe) SR
  • Plugins (addins) for third party-software
  • Includes SR, text to speech, and other speech and language processing
  • Continuous or segmental (utterance) time-stamped audio playback
  • Select text, playback audio with audio-tagged document text
  • Use keyboard hotkeys or foot control to start/stop playback
  • Text compare across document (like word processor)
  • Represents text compare of text strings with no reference to source data
  • Text compare by phrase (synchronized text compare)
  • Use to compare SR audio-aligned text
  • Use different SR SDK to converts to CSUSA proprietary .SES format
  • Create equal number of segments (phrases) in .SES session files
  • Each phrase (utterance) arises from same audio
  • Each respective phrase has same start/duration time
  • Once synchronized, text in segments (phrase) compared
  • Differences highlighted, matches clear
  • Differences indicate higher error risk
  • Matches indicate greater reliability
  • Consolidated text compare color codes (highlights) degree of differences
  • Visualized in single window
  • E.g., with 3 texts, clear = all match, pink = 2 differ, red = all differ
  • Any type of text can be phrase compared if synchronized
  • With equal numbers of segments, session files are "synchronized"
  • Phrase compare supported even if underlying source data different
  • For example, phrase compare synchronized SR text and translation text
  • DataInSync™ = bounded synchronized data input for pattern recognition
  • Input data include may include text, audio, image, volumes, and spaces
  • Vertical placeholders delimit bounded input and output data
  • Optional display with/without vertical placeholders (delimiters) 
  • Create text and/or audio annotation (comment) to selected document text
  • Unlimited annotations (multilevel data) to selected text
  • Unlimited users (multiuser collaborative) may annotate same text
  • Text annotation may include comment or question
  • Text annotation may include hyperlink, command line
  • Command line may launch program, e.g., open media player or web page
  • Use annotation window sound recorder to record audio comment  
  • Load audio file or use text to speech to create audio annotation
  • Switch (transpose) document/annotation content
  • Selectively move annotation text into document to replace session file text
  • Selectively move annotation audio into document to replace audio
  • Divide session file segments into two or more groups
  • Sort (scramble)/unsort (unscramble) group session file content
  • Session lock converts document/annotation content to read only
  • Data migration document phrase or annotation
  • Application programming interface (API)
  • Web services and file management available through workflow manager
  • Web-based Help files

For complete description of functionality, see Web Help

Document Window

Screen shot shows toolbars and main document window.  Purple vertical markers represent placeholders delimiting phrases in the Pledge of Allegiance. 

Main/Annotation Window

First screen shot shows main document window and annotation window.   Annotation window includes sound recorder and text box.  Main window has form created using annotations.  Audio annotation (blue highlighting) supports text and/or audio entry Text annotation (purple highlighting, not shown) supports text entry only.  Text "Streeter" in annotation window represents last name.  It may be moved to first blank after "Full Name" in the document window form. 

Last name "Streeter" entered into text box may be entered into the first blank of the "Full Name" field (last name first), as shown in the second screen shot above.  The remainder of the form has been completed with annotation window data.  The underlining is automatically removed as annotation text is moved into document window. 

Multiwindow Tiled Display

Screen shot shows tiled horizontal display of 3 session files representing original English text and 2 translations.   Vertical placeholders are noted delimiting text in the material in 3 document windows.   The translations have the same number of segments delimited by the same punctuation.  The last synchronized segment is highlighted in all three windows.  Software supports horizontal, vertical, and cascade tiling.

Special Editing Functions For Training Speech User Profile

In manual dictation, the final text distributed in a letter, document, or report may include grammatical or factual corrections to the dictation, deletion of extraneous remarks or redundant material, and inclusion of non-dictated elements such as headers, footers, tables, graphs, and images.  This is different from "verbatim text," which represents what the speaker actually said. 

Verbatim text, along with its audio, is required for accurate training of a speech user profile.   In additional, speech user profile requires creation of lexicon for speaker including the lexical pronunciation of words in the speaker's documents.

Interface is designed to provide a wide range of functionality related to speech and language processing, including both completing the final (distribution) report and generating verbatim text for speech user profile training.. 

Program integrates transcription and supports a common interface for manual transcription, SR editing, and saving SR training data.  As another example, software supports selective exclusion of audio-aligned text unsuitable for training due to poor diction, increased background noise, or other factors. 

In contrast to standard word processors, it also includes easy-to-use optional tools to support selective exclusion of undictated text, such as headers, footers, plus nondictated elements, such as tables, graphs, and images.  Much of this can be performed with use of macros or keyboard shortcuts.

In many cases, interfaces do not provide a session file for review of SR command and control recognition errors.  For example, voice activation programs typically activate a feature of a software program.  If the voice command fails to achieve its desired result, the end user does not know if this was due to a misrecognition with no opportunity to update and train the speech user profile.

It is intended that a specialized transcriptionist or SR editor or IS personnel would supervise the transcription/SR editing team and handle more difficult issues.  It is anticipated that some of the tools described would be used rarely, if at all, by the typical transcriptionist or SR editor, e.g., remap word or phoneme audio tag or display sound waveform with wave analysis tool. 

Special editing functions include: 

  • Exclude utterance with poor quality audio from training data
  • Exclude headers, footers, and other nondictated text from user profile
  • Substitute verbatim for nonverbatim text for training user profile
  • Uses annotation tab to create/substitute verbatim for nonverbatim text
  • Results in creation of two session files, one verbatim, one final text+++
  • Display audio-linked transcription as words or as phonemes
  • Remap word or phoneme audio tag (automatically) after correction
  • Create lexical pronunciation (phonemes) for new word
  • Check if word in speech user lexicon with search function
  • Database indicates if word in database and lexical representation
  • If in lexicon, editor can view phoneme (phonetic) representation
  • If not in lexicon, user creates pronunciation as out of vocabulary word***
  • Also create database of key concepts associated to original audio and text
  • Key concepts may include key words
  • May also reference paraphrases, translation, or phonetic pronunciation
  • Tool useful in creating language model
  • Associate certain words to category, e.g., color, date, or name
  • Speech engine knows to limit word search based upon category
  • Use text annotation (comment) to enter key concept
  • Enter text annotation with keyboard, SR, or barcode
  • May also enter audio annotation for manual transcription
  • Editor may reverse/forward format dates, measures, and currencies
  • Reverse formatted text is in speech engine compatible format
  • Forward formatted text represents normal document display
  • Export paired document text and audio to train user profile
  • Export paired annotation text and audio to train user profile
  • User profile may represent single speaker or group user profile
  • Utterance tag means that select any word plays back entire phrase
  • Separate word tags means that can select word and play back word audio
  • Assign word tags to phrase-tagged (utterance) audio
  • Process described as "subsegmental utterance splitter"
  • Uses speech engine functionality to create word tags
  • Subsegmental splitting begins with utterance text
  • Speech engine automatically creates separate word audio tags
  • Display sound waveform with speech wave analysis tool
  • Waveform analysis available as tab off annotation window
  • Change utterance boundaries in speech wave window
  • Use to split or merge audio-linked document window text

Value Proposition: Generation of two session files for each SR or manual transcription helps generate data for training SR speech user profile, but also assists with quality assurance (QA) or litigation-related searches for dictation audio+++

Problem:  How does QA or a legal team find particular audio segment without playing back the audio file?

Solution:  Audio-linked text for both manual transcription and SR makes it easier to quickly find the audio in question.

In addition to facilitating rapid speech user training, the availability of two session files helps find audio for QA or litigation purposes. 

Speech recognition routinely segments audio prior to speech-to-text decoding step.  CSUSA enhancements segment dictation audio prior to manual transcription.  This produces a transcribed session file with audio-tagged text by phrase. 

The frequent lack of audio-aligned text generated by manual transcription makes it difficult for a transcriptionist, QA personnel, or other parties to quickly find the specific audio associated to a particular word or phrase.  Instead, the entire audio file must be played back and searched.  The lack of audio-aligned segmented text also makes it difficult to sequentially track document changes and translation while associating the audio to the proper text segment.

With the session file editor, user can find word or phrase in question by reviewing verbatim text (what the speaker said) or distribution text (that may differ from verbatim). 

Unlike standard manual transcription where the QA personnel or litigation team must listen to audio file during word search, process supports viewing text first.  With the text found, user can click text and playback audio.  Further, audio phrase may easily be tracked to subsequent edits or translations.

Audio-linked verbatim (training) text may differ from distribution text and assist finding speech that is not referenced in distribution text.

Value Proposition: Create phonetic text without linguistic expert***

Problem:  How does manual transcriptionist, SR editor, or other member of the specialized transcriptionist or IS personnel create phonetic pronunciation with no background in linguistics? 

Solution: Personnel can use text to speech phonetics pronunciation generator to generate phonetic pronunciation for lexicon.  Company experience is that consistency of use is more important than phonemic representation chosen.  This same tool is used to create model builder lexicon

This tool enables the SR editor, transcriptionist, or model builder personnel to create phonetic pronunciation for words without resort to linguistics expert.

The process does not have to be repeated for every user after initial lexicon creation for appropriate language dialect.  Company experience is that phoneme representation is very similar for different speakers speaking the same dialect of a given language. 

For example, phoneme representation for American English speakers is relatively uniform, but different for Canadian, British, Australian, or South African English.  It is possible to customize lexicon to account for regional variations in dialect or nonnative speakers.

It is envisioned that the speech editor involved in lexicon creation would be a specialized member of the transcription team or other IS personnel with interest or training in speech and language. 

An OOV word cannot be recognized because it is not in the lexicon used by the speech user profile.  User can create phonetic spelling (phonemes) for OOV words and add the word and lexical spelling to the lexicon. 

If lexical expertise is available, expert can listen to audio and supply word phonemes. 

If lexical expert is unavailable, speech editor can use text to speech (TTS) pronunciation generator.  This generator uses a simplified system of phonemes equivalent to phonetic representation used in primary school "phonics."  Actual text representation of phonemes is felt to be less important than consistency in associating speech sounds to given phoneme text.  

The process involves these steps: 

First, operator listens to speech audio. 

Second, operator selects appropriate language in TTS generator.  This automatically calls up appropriate phonetic set available with Microsoft text to speech. 

Third, operator adjusts phonemes in TTS generator text box.  

Fourth, operator plays back synthetic (TTS) speech from phonemes and compares to original speech audio.  If there is a good match with audio, editor enters word and associated phonemes into lexicon.  If it is not good match, editor adjusts phonemes until good result. 

Good result implies good match of synthetic, TTS audio with original spoken audio for the OOV word.

Text Compare By Document/By Phrase

Workflow proceeds to session file editor for preparation of final text.  One important task is text compare to find and correct errors.

Software supports text compare across the document.  Comparison looks at text strings only without regard to source data.  This is how a typical text processor performs text compare.   

Text comparison by phrase is also available.  Software synchronizes document and compares text to the same audio utterance or other audio. 

Without synchronization, it is more difficult for an operator to quickly determine whether words or phrases arise from the same utterance audio.  This is particularly true for longer documents or documents with lower accuracy.  

1.  For SR text compare, software converts Dragon (Nuance), Microsoft Windows, IBM, or other SR session files to CSUSA session file format (.SES). 

2.  Operator selects text compare across document or by phrase (utterance). 

3.  SR session files are typically "asynchronized" with different start/duration time stamps for each respective audio-aligned segment.  If operator selects by phrase option, DataInSync
™ feature resegments and retags the audio for each session file. 

4.  This results in an identical number of session file segments for each SR engine.  Same number segments each have the same start/duration times. 

5.  These synchronized files are text compared segment by segment (utterance by utterance). 

6.  Speech editor highlights differences by text compare representing potential SR errors.  Editor tabs to differences and listen to audio, and corrects SR errors as required. 

7.  Editor also corrects obvious speaker errors, e.g., speaker said right when meant left.  Speech editor may insert comments or questions about text and send to speaker as annotation (comment). 

8.  Editor saves audio-linked verbatim text (what speaker said) as training session file (.TRS).  There is minimal audio to review with highly accurate SR systems since there are few differences. 

Synchronization = Session File with Same Number of Segments 

Software can phrase compare texts from any source or processing method as long as they have the same number of segments.  The delimiters vary depending upon the type of file.  For dictation, manual transcription, and speech recognition, the delimiters are generally utterance boundaries.  For translation, it can be punctuation such as period, comma, colon, or semicolon.  

Text Compare and Traditional Multiengine, Multipass Techniques and "Voting" Using Confidence Scores

In speech recognition for dictation, confidence scores are assigned to alternative hypotheses as to what the speaker said during speech-to-text decoding.  The display text in the graphic user interface in a speech recognition system for dictation represents a "best result" based upon the highest-scoring hypothesis.  Alternative hypotheses may be listed in a drop-down window or other dialog for user reference and typically may be selected for substitution in the display text. 

Accuracy may be improved by exploiting differences in the nature of errors made by multiple speech recognition systems by automatic rescoring or "voting" process that selects an output with the lowest score to identify the correct word.  See, e.g., Jonathan G. Fiscuss, "A Post-Processing System to Yield Reduced Word Error Rates:  Recognizer Output Voting Error Reduction (ROVER), pp. 347-354 0-7805-3695-4/97 Copyright 1997 IEEE. 

ROVER and similar traditional multipass techniques represent an attempt to improve the accuracy of speech recognition text by aggregating the most-likely accurate results from different speech recognition engines.  Text compare is a process designed to alert the reviewing speech recognition editor or speaker of likely errors based upon differences in output, not to improve the accuracy of output text.

Typically, with a single speech recognition engine, confidence scores are assigned to alternative hypotheses as to what the speaker said during speech-to-text decoding.  The display text in the speech recognition graphical user interface reflects a "best result" based upon the best scoring hypothesis.  Alternative hypotheses may be listed in a drop-down window or other dialog for user reference and typically may be selected for substitution in the display text. 

Accuracy may be improved by exploiting differences in the nature of errors made by multiple speech recognition systems by automatic rescoring or "voting" process that selects an output with the best score to identify the correct word.  See, e.g., Jonathan G. Fiscuss, "A Post-Processing System to Yield Reduced Word Error Rates:  Recognizer Output Voting Error Reduction (ROVER), pp. 347-354 0-7805-3695-4/97 Copyright 1997 IEEE. 

ROVER and similar multipass techniques attempt to improve recognition accuracy of speech recognition text by selecting the most-likely accurate results from different speech engines.  Text compare is a process designed to alert the reviewing speech recognition editor or speaker of likely errors based upon differences in output, not to improve text accuracy displayed to user.  Company process can use ROVER techniques to aggregate text and text compare two or more aggregations. 

Segmentation is one of several conversion variables that may differ  speech engines.  Others include language model, acoustic model, lexicon, and speech vs. accuracy settings.  Absence of differences in systems that with different conversion techniques increases confidence that matches represent the correct result.  This is particularly true the greater the number of compared speech engines.

Text Compare Display Options

Process may perform text compare in the multiwindow session file editor between:

1.  Two or more viewable buffered read/write windows

2.  Viewable session and "hidden" session, display differences as popup

3.  Different texts displayed as popup dialog when cursor passed over text

4.  Consolidated text compare

Screen shots are provided:

1.  Two or more viewable buffered read/write windows

Using "dual-engine" comparison with two visualized buffered document windows, speech editor can text compare with both sets of text visualized.  Operator selects phrase compare or text compare (across document) by clicking on the "compare phrases" or "compare document".

a.  Phrase Compare

The first example represents phrase compare where segments are synchronized and equal in number.  Overall accuracy is about the same in both texts.  Dropdown menu indicates the difference in the other (lower) panel.  The window reflects an old (SpeechMax+) version of the software. 

b.  Text Compare Across Document.

In the second example (newer version of software), user has selected conventional text compare across the entire document.  Top window text has about 90% accuracy, bottom window has just under 100% accuracy.  Dropdown menu selection is play difference (audio) for the first highlighted difference.  

In both sessions in this dual-window display, relatively few differences suggest both are highly accurate.

Transcriptionist should only have to listen to about 10% of speech recognition audio with these highly accurate texts.  Reduction in transcriptionist review time is about 90%.  Example is consistent with conclusion that increasing accuracy will reduce text differences, and that differences will approach zero as accuracy nears 100%.

2.  Viewable session and "hidden" session, display differences as popup

Demo #3 . . . Single-window dual-engine comparison using server-based SweetSpeech™ and Dragon NaturallySpeaking . . . User sequentially opens Dragon and Custom Speech USA™ session files, clicks compare documents toolbar button to highlight differences, plays differences using menu dropdown, makes changes, increases leading/trailing playback to listen to word "of," and copies/pastes "well-maintained" from Dragon to format text, and enters new lines. Operator saves final distribution report as .txt. Small cap "l" was transcribed by both engines and capitalized as "L" for report distribution. Since user did not create a separate verbatim annotation, the final text is automatically saved as the verbatim text using Instant Verbatim™ feature.  Flash WMP MP4 (for other text compare video demos see below after paragraph 12)

Different texts displayed as popup dialog when cursor passed over text (not shown)

4.  Consolidated text compare, e.g., Text A vs.Text B, Text B vs.Text C, Text A vs. Text C ==> color coding. 

SR editor can also use consolidated text compare using 3 or more SR systems in Best Guess mode. 

Color code to indicate degree of differences, for example, clear = no difference, pink = difference between 2, or red = all 3 differ.
With large number of sessions, high number of matches indicates increased likelihood of reliability. Text is virtually all clear. Large number of differences indicates increased likelihood of error. Text is mostly highlighted.

In the color-coded Best Guess example below, consolidated text compare highlights about 55% of text (representing differences) with about 45% representing nonhighlighted text (nondifferences).   Accuracy is given as 75.6%.  Original word error rate is 24.4%.  

As indicated in next section, one approach is for speech editor to review audio for all highlighted text and make corrections as needed.  In addition, editor should review reliability index data for each (nohighlighted) word.  If reliability index indicates that there is no history of identical misrecognition, speech editor does not need to review audio for that word. Word may be reviewed by dictating speaker.

Value Proposition: Focus on "differences" audio and reduce audio review. 

Problem:  A major SR company advertises up to 98% accuracy.  If editor reviews 10 hours of audio with this accuracy, 9 hours, 48 minutes represents correctly recognized speech, and 12 minutes represents errors. At 90% SR accuracy, editor reviews 10 hours of audio to find 1 hour of misrecognitions. With increasing accuracy of SR and other pattern recognition, an emerging issue is efficiently finding the increasingly few pattern recognition errors.



1.  Synchronizes output using text compare (phrases).  Using output from two or more SR programs, automated wave analysis identifies identical start/end SR session file text arising from same audio.  Retagging and resegmenting algorithm creates an identical number of synchronized segments in each session file.  Operator may text compare by segment (usually short phrase) or use traditional text compare across entire document.

2.  Significant (optional) reduction in audio review time with highly accurate SR.  Text compare supports an expected 80% reduction in speech recognition editor audio review time for 90% accurate SR.  This assumes that system can rely primarily upon dictating speaker review, and does not require complete review by speech editor as well.  In this setting, it is expected that audio review time using text compare approaches zero as SR accuracy approaches 100%.  The exception to this approach is where reliability index database indicates prior "no difference" due to identical misrecognition for this word.

Bar graph shows expected decrease in audio review time when reviewing text with highlighted differences.  Speech editor corrects most or all errors for dictating speaker using this technique.  With highly accurate speech recognition, text returned to speaker after speech editor review generally will have about the same error rate as manual transcription.   As described above, when there are no misrecognitions and no differences for review, speech editor review time drops to virtually zero.  No recognition errors are highlighted also when speech engines make the same mistake (identical misrecognition).  Bar graph shows decrease in audio review time when comparing output from two different speech engines.  It is assumed, for purposes of this graph, that each speech engine misrecognizes different words from those misrecognized by the other speech engine. Total audio review time with text compare depends upon how much audio the speech editor must review. Consider the case where each speech engine misrecognizes the same word in a sentence or phrase as the other engine, but differently (e.g., user said "ball" and first engine transcribed "hall" and the second "wall").  Only audio for a single word (single audio tag) need be reviewed.  However, if the misrecognition involves errors of words in different locations in the sentence, then word audio for two different locations (two different audio tags) in the sentence are involved.  This would increase audio review time of editor.  Estimate of decrease in audio review time may underestimate time savings.  Graph reflects assumption that errors may occur in different parts of the sentence.

Same logic applies to evaluating differences between three or more texts, as well as to nonspeech audio, text, or image pattern recognition, e.g., speaker identification, machine translation, computer-aided diagnosis (CAD) for medical purposes, and other pattern recognition.

3. Helps dictating speakers who self-correct.  Speakers can use this process in "preview" mode to detect potential SR errors before undertaking their own review.  They can also use process in "review" mode to check that potential misrecognitions have been reviewed.  In addition, if other party has not reviewed reliability index for a word, speaker should run check on nondifference text..

4.  Compare text output using different SR technologies.  SR may use hidden Markov models with Gaussian mixtures, neural networks, adaptive, nonadaptive, speaker dependent, speaker independent, or other techniques. Synchronize text output "by phrase"  from speech engines using different conversion variables and compare relative recognition accuracy.

7.  Other audio, text, or image pattern recognition software may output text description of results. Company session file editor can display synchronized differences in results for review by human editor.   Software also supports text compare of manually processing of text results from delimited source data, e.g.,  two or more transcriptionists transcribing presegmented dictation audio into audio-linked transcribed session file. Similarly, speech editors and other editors dealing with pattern recogntion can use reliability index to rule out history of prior identical misrecognition.

8.  Software can compare synchronized session files text results from virtually any source or processing method.  Synchronization requires same number of segments, not identical data content.  

9. Nontext results can also be displayed and synchronized using main document windows and/or annotation feature that can open files, websites, or run programs (e.g., media player).

10. Limitations of text compare:  Process does not identify every error.  Two or more speech engines sometimes misrecognize the same audio, as mentioned.  With an identical misrecognition, no difference is detected.  For example, speaker says "hat" and SR engines A and B both transcribe as "that".  However, text compare highlights nonidentical misrecognitions.  For example, speaker says "hat" and SR engine A transcribes "that" and SR engine B transcribes "flat". 

Company experience is that identical or nonidentical misrecognitions occur less frequently with improving SR accuracy.   If goal using text comparison is to return text with the same or lower error rate compared to gold standard manual transcription (estimated about 5% or less), depending upon speaker average dictation accuracy, review by speech editor may not needed be in all cases.

11.  Matches as indicator of reliabilityProcess may also use "nondifferences" mode to identify likely accurate SR training data, subject due to problems with identical misrecognition.  Potentially, process may use nondifferences for unsupervised (nonhand labeled) SR speech user profile training. 

12. There are other uses for text compare in speech and language processing, including training, synchronized translation, and use of knowledge base, as described in video demos.

The video demos shows text compare for SR, transcription, education, and other uses.

Demo #3 . . . Single-window dual-engine comparison using server-based SweetSpeech™ and Dragon NaturallySpeaking . . . User sequentially opens Dragon and Custom Speech USA™ session files, clicks compare documents toolbar button to highlight differences, plays differences using menu dropdown, makes changes, increases leading/trailing playback to listen to word "of," and copies/pastes "well-maintained" from Dragon to format text, and enters new lines. Operator saves final distribution report as .txt. Small cap "l" was transcribed by both engines and capitalized as "L" for report distribution. Since user did not create a separate verbatim annotation, the final text is automatically saved as the verbatim text using Instant Verbatim™ feature.  Flash WMP MP4  

Demo #4 . . . Double-window dual-engine comparison of Demo #3 session files . . . Operator selects toolbar window icon to horizontally display  audio-aligned text.  Operator specifically references option of play entire phrase (including difference) as opposed to playing difference only.  Flash  WMP MP4

Demo #5 . . . Double-window SpeechMax™ VerbatiMAX™ comparison of uncorrected speech recognition transcribed session file (TSF) with verbatim text .TXT . . . Any difference represents an error.  Text comparison reduces transcriptionist review time.  This supports batch, automated correction of transcribed session file (TSF) to verbatim transcribed session file for speech user profile training.  Flash WMP MP4 

Demo #6 . . . Training student medical transcriptionists . . . Video shows application of text compare in training and education.  In one approach, a medical transcription (MT) instructor can create a verbatim transcribed session file text key and compare to synchronized student output.  Text comparison identifies errors in spelling and punctuation and generates accuracy rate.  Synchronized audio makes training more efficient.  Operator need only click on text to hear associated audio.  Multilingual SpeechMax™ supports medical transcriptionist training in various languages. Color-coded Best Guess represents consolidated text compare.  This composite provides the instructor with a quick, visual estimate of whether other students made similar or different errors.   Software determines individual student accuracy levels automatically.  Flash WMP MP4

Demo #7A . . . Teaching elementary school phonics . . . Video shows proof-of-concept prototype software for elementary school phonetics training ("phonics").  Local elementary school teacher requested video to demonstrate proof of concept to local school board.  The "r" phonics video shows how teacher uses synchronized text compare (by phrase).  Accuracy levels are determined automatically.  Video also shows how teacher can customize training with web-based resources.  Video displays how has made available web resources using software text annotations to specific document text. Other options (not shown) exist for synchronizing text and reviewing the pronunciation of same text by different students.  Flash WMP MP4

Demo #21 . . . Synchronized English, French, and Spanish translation . . . Video shows operator tiling document windows horizontally and synchronizing French translation with English source.  Operator opens third document window for previously delimited Spanish translation.  Operator synchronizes Spanish translation with both English and French translations and synchronizes French and Spanish translations.  Operator selects each English segment sequentially and confirms synchronized highlighting of French and Spanish segments.

Delimiting output with punctuation common to different languages (e.g., periods, commas, colons, or semicolons) simplifies potential automation of the text compare process text compare.  It would not be difficult to compare accuracy of each translation (not shown), operator can obtain identically delimited French and Spanish translations from different manual or automatic translation sources.  Operator may phrase compare, respectively, synchronized French translations and same for Spanish texts. 
Flash WMP MP4

Demo #22 . . . Law firm knowledge base with color-coded frequency of previous word use . . . Video  shows use of SpeechMax™ software with knowledge base for creation of a nondisclosure agreement and efficient document assembly.  In example, user selects phrases from various "fill-in-the-blank" alternatives.  User text compares alternative form selections.  User can also use color-coded composite Best Session™ to visualize the variability in selection choice for each field.  Color coding shows agreement between source documents in knowledge base.  Redder indicates considerable difference.  Pink indicates minimal difference and clear no difference.   Knowledge base may be utilized by law firm, business, or other organization. Flash MP4

Value Proposition:  Text Compare for Other Audio, Text, or Image Pattern Recognition Output

SpeechMax session file text output may represent human or automated processing or delimited text, audio, or image data.  For example, the image below shows an audio wave.  If it is associated to delimited text description, it is a session file.  Synchronization requires same number of segments regardless of underlying source data.  If these segments represent text output, text compare can be applied. 

Other potential sources of pattern recognition data include:  text
, e.g., optical character recognition, automatic document classification; audio, e.g., music, traffic noise and soundscapes, baby cries, and dog barks characterization; or image, e.g.,  computer-aided medical diagnosis, facial profiling, retinal or iris scanning, handwriting, or fingerprinting.

As with SR text compare, steps include convert to .SES format, synchronize data files, compare text output (if available), and identify differences and matches.  If output data is corrected, user may obtain training data for pattern recognition models.  Matches may be used for supervised ("hand-labeled") or unsupervised training.    

With image source data, text output results may be displayed in document window and text compared.  Image may be displayed in document window or using annotation feature to link to file, webpage, or video. 

For example, computer-aided diagnosis (CAD) for mammography can generate mammogram session file with suspicious areas marked for review and described by text.  These may be viewed by reference to a hyperlink.  Process may compare text output from > 2 CAD systems and/or human reviewers for each suspicious area.  Decision as to final report may be made by single radiologist in review.  Process can use data from surgical or biopsy specimen to improve CAD pattern recognition models

Selective Modification of Session File Text and Audio

Session file audio may be modified globally or selectively.

Global modification may be used to create a new speech user based from a synthetic (text-to-speech) voice font or other method.

Session file editor has document window (upper) and annotation (comment) window (lower).  Operator may globally or selectively edit document window speech recognition session file text using text or audio annotations.

Value proposition:  Forms creation and completion with document and annotation window

Problem #1:  Structured dictation forms are common, but off-the-shelf SR software often requires extensive knowledge of SR application or SDK to create. How can secretaries, receptionists, transcriptionists, or data entry personnel quickly create forms for text, voice, or bar code entry?

Solution:  CSUSA software allows user to create document form fields for text or audio entry (blue) ("audio annotation) or text only (purple) ("text annotation").  User  selects text or other document character (e.g., underlining) and clicks toolbar annotation control.  This creates colored highlighting for text or audio entry (blue highlighting) or text only (purple highlighting).  ''

Software also supports form creation by creating a word token within a segment of an empty session file (see below).

Software also supports creation of an unlimited number of "child" annotations to a form field that enable the form creator to provide additional information to help the user to complete the form. This may represent text or audio information, as well reference a webpage or playback of an instructional video.   These "child" annotations can also reflect additional information that is required for form completion. 

Text annotations may include comments or questions, but can also open a web page or open an program.  This is done by preceding the URL or path name with <EXEC>.  Audio annotations may include audio recorded with annotation window sound recorder or uploaded audio file. 

This enables a person completing the form to view instructional text, a website, or video.  User can also playback audio for instructions or testing (e.g., for language or hearing examination).  

To complete form, user enters text into document (upper) or annotation (lower) window. User can enter annotation text with keyboard, SR, or barcode.  User can record annotation audio with sound recorder, upload audio file, or create audio with text to speech (TTS).  Software transpose (swap) or move (replace) feature uses copy and paste to insert annotation text or audio, or both, into document window. 

First screen shot shows employee information form.  Last name "Streeter" is entered into annotation text box.  It may be transposed or moved into first blank of the "Full Name" field (last name first).  Depending on the option selected, the document text (in this case underlining) is automatically transposed to annotation text field or replaced by copied text.  Second screen shot shows completed form.

Text tokens are commonly used to support insertion of dynamic text, such as the current date or time, into various text fields.  Office personnel may also create form using text token inside an empty session file segment. 

This segment is demarcated by vertical placeholders.  User may enter text into the segment representing form field name and color highlighting to indicate data entry site.  Text token may include one or more words or continuous, adjacent complex text for a URL.  Vertical placeholders may be hidden in the final text.  Person completing the form navigates to segments with tab key.  See Demo #11 below.


1.  Convenient, easy tool for office staff to create forms.

2.  Flexible data entry.  System supports text entry by keyboard, SR, or barcode.  Audio entry by record or upload of audio file. 

3.  AnnotationTrain™.  Use audio text annotation pairs to train speech user profile.

4.  Microsoft Office Toolbar Addin.  This addin supports migration phrase and annotation data to/from session file editor and Microsoft Word, for example.

The following demos show different uses of the annotation (comment) functionality.

Demo #8 . . . Audio annotation . . .  Video shows open MT Desk website for medical transcriptionist in training learning about prostate cancer treatment. This functionality can be used to launch video player or any other program. Flash WMP MP4

Demo #9 . . . Audio annotation . . .  Video shows use of spell check supplemented by audio annotation pronunciation of medical term for student medical transcriptionist. Flash WMP MP4

Demo #10 . . . Audio annotation . . . Video shows use of Employee information form and data migration to Microsoft Word.  Audio annotation (blue highlighting) supports text and/or audio entryText annotation (purple highlighting) supports text only entry.   Flash WMP

Demo #11 . . . Audio/text annotation . . .  Video shows form creation.  Form creation involves entry of field name and creation of audio annotation token within otherwise empty session file segment.  Flash WMP

Demo #12  
. . . Audio/text annotation . . .  Video deals with The Talking Form™ and audio prompt creation for form user. Flash WMP MP4

Demo #13A  . . . Real-time dictation using Dragon speech recognition into Microsoft Word and correction in SpeechMax™ . . .  Video shows SpeechMax™ Microsoft Office Toolbar Add-In with data transfer to/from Microsoft Word and SpeechMax™.  Using the Add-In, a speaker can dictate into Microsoft Word with Dragon speech recognition, transfer text and audio data to SpeechMax™ for transcription, and migrate data back to Microsoft Word.  User can also enter data into SpeechMax™ form and migrate to Microsoft Word.  Software can convert Dragon session file or other compatible file into .SES for session file training.  Alternatively, user could dictate audio file for later presegmentation and manual transcription in SpeechMax™, as well as dictate audio or using SR into form fields in SpeechMax™.  CSUSA software can use data to create .SES session file for speech user profile training. WMP MP4 

Demo #13B . . . Update Continuity of Care Record (CCR) with  manual transcription in SpeechMax™ form fields  . . .  Proof of concept video  illustrates SpeechMax™ Microsoft Office Toolbar Add-In with  transfer XML data to/from Continuity of Care Record (CCR) to/from Microsoft Word and to/from SpeechMax™ form created with annotation (comment) features.  Add-In supports download or upload XML data to/from Microsoft Word and CCR.   Video demonstrates download data into Word using Add-In and data modification based upon written or dictated information.  If dictation, transcriptionist may play back dictated audio using SpeechMax™, PlayBax™, or other software and transcribe into Word.  Alternatively, user may dictate into Microsoft Office using speech recognition.  Data may be transferred to SpeechMax™ and modified using dictation and transcription, speech recognition, keyboard, or bar code.  Modified data may be transferred to Word and uploaded directly into CCR using the Add-In.  Alternatively, operator may modify in Word before upload. Software can convert Dragon session or other compatible file into .SES for session file training. WMP MP4  

Demo #22 . . . Audio/text annotation . . .  Video shows use of multilevel annotations for multilevel knowledge base for creation of a nondisclosure agreement and efficient document assembly with SpeechMax™.  In example, user selects phrases from various "fill-in-the-blank" alternatives.  User text compares alternative form selections.  User can also use color-coded composite Best Session™ to visualize the variability in selection choice for each field. Color coding shows agreement between source documents in knowledge base. Redder indicates considerable difference.  Pink indicates minimal difference and clear no difference.   Knowledge base may be utilized by law firm, business, or other organization. Flash MP4

Value proposition:  Correction of a first speaker's text by a second speaker

Problem #2:  How does a Dragon or other off-the-shelf desktop SR software, an SR editor or another speaker that is reviewing another speaker's session file with the first speaker's user profile open in the buffered read/write SR window voice correct the first speech user's text and save the changes without corrupting the first speaker's speech user profile?

Solution:  Dragon and other desktop users can use a SpeechMax Dragon plugin and dictate changes as a text annotation, correct the original document, and use the audio and text data to train both the correcting speaker's and initial speaker's voice profile.  Supports SR editor voice correction of text, as well creation and voice correction of multispeaker collaborative document with no speech profile corruption.

  • Annotation sound recorder records SR audio
  • Text entered into text annotation window and moved in into  documentation
  • Document window (top), annotation window (bottom) (see below)
  • Speaker A SR text in document window, can playback audio
  • Speaker B creates text annotation using SR for text in document window
  • Annotation window creates text and saves audio as audio file
  • Speaker A text corrected with new text and speaker A audio
  • Speaker B speech user profile trained with new text and new audio

Example of voice correction with SR dictation into annotation window

  • Annotation recorder records 2nd speaker audio
  • Text entered into text annotation window
  • Document window (top), annotation window (bottom) (see below)
  • Supports annotation correction of Speaker A SR text
  • Speaker A had dictated "Alan Smith" but was recognized as "Adam Smith"
  • Speaker A profile open in buffered document window
  • Speaker B correctionist plays back 1st speaker dictation
  • Speaker B corrects using SR  ("Alan Smith") ( see screen shot below)

Document Window (top) with vertical purple utterance markers separating text and Adam Smith text highlighted in blue, Annotation Window (bottom) with sound recorder, annotation text window with Alan Smith text, SPEAKER B highlighted under Annotation Name column as party making annotation


1.  Train both speech user profiles.  After second speaker correction of first speaker SR text, the system can train both speech user profiles using the verbatim text "Alan Smith" and the respective audio dictated by first speaker and second speaker.  There is no corruption of either speech user profile. 

2.  Supports sequential voice creation/correction multispeaker documents on a local area network or document emailed as attachment to other party.
  Unlimited number of annotations (comments) per dictated text are supported.  These comments may represent voice correction. 

Process may use similar document/annotation concept for web-based, real-time multispeaker collaborative documents.  Utilize first window to view text/playback audio.  Use second window to dictate SR correction or change.  Copy/paste text from second window into first.  Use corrected text to train Speaker A profile.  Use Speaker B audio and text dictated into second window to train/update Speaker B profile.

Value Proposition:  Selective correction or change of session file audio       

Problem #3:  Is there a convenient way to selectively modify audio tags to improve or change quality or the type of audio tagged to the text.    

Solution:  With CSUSA system, user can also selectively modify the audio tag of audio-linked SR with no change in document text. Process supports selective replacement of some document audio or replacement of all document audio tags.  This relies upon copy/paste functionality.  Process requires creating and summating time stamp offsets of the session file to maintain alignment unless original audio is same length as the replacement audio. 


1.  Create form or other text with SR, replace audio tag representing original SR dictation with audio content.  User completing form selects form field text and hears relevant audio information.  Potential benefit where necessary to hear audio, e.g., testing hearing or recognition of speech or music.   

2. Improve audio quality of SR audio-tag when intended for playback.  eReader audio book may be read by professional voice talent and transcribed with SR.  Voice talent may also dictate audio which is presegmented and transcribed manually.  Some audio may be poor quality and require redo by voice talent. 

Similar need exists where business or office supports call-in listening to audio from dictated reports.  Process may selectively replace ("splice in") phrase audio for one or more segments or replace audio for individually-tagged words. 

Process can use annotation text and audio pairs to train speech user profile. 

3.  Associate foreign language audio tag to source text dictated by SR.  This creates a learning tool for foreign language.  For example, user clicks on English in document window and hears foreign translation as audio tag.  Similar techniques can be created for other language instruction.

4. Substitute text to speech synthetic speech for human-dictated SR audio tag.  Use text to speech audio and text to create acoustic model.  Language model could represent real or artificial language.  This is one approach to creating speech user profile for robots that are expected to respond to the conversational speech or voice commands of other robots.  Optionally, mechanically generated speech may reflect changing playback parameters for human speech using voice enhancement.  

5. Substitute musical recording for SR text audio tag.  Style sheets can be customized.  Create singalong, karaoke, or other entertainment.  Apply techniques also to hymns, folksongs, or other traditional music. 

Demo #19 . . . Singalong/karaoke:  Stairway to Heaven (Led Zeppelin) . . .  Demo shows playback of audio (song) and highlighted text and slider bar in SpeechMax™Flash WMP MP4

"Stairway to Heaven," produced by Jimmy Page, executive producer Peter Grant, © 1971 Atlantic Recording Corporation for the United States and WEA International Inc. for the world outside of the United States. 

6.  Create speech user profile with synthetic speech. Use text to speech to create audio file for each SR audio-linked text segment and upload as audio annotation.  Switch (transpose) document/annotation audio, thereby creating audio-tagged text with synthetic speech.  Use to create speech user profile with an acoustic model and lexicon reflecting a synthetic voice, and a language model reflecting human speech.  This would support development of a robot using SR that follows voice commands from the synthetic, audible speech of a talking robot.    

Protect Confidential Session Data with ScrambledSpeech™

Value Proposition:  Speech or text redaction limits access to confidential information such as name or address, division limits amount of data transferred to any individual outside processing node, and scramble disorders content making it more difficult to understand. 

Transcription or SR editing is often processed out-of-office at remote sites.  Federal and state government, businesses, and professional offices would often like to outsource data processing to cheaper centers, but in ways that limit disclosure of confidential information.  The following is designed for  security requirements as might apply to routine business or professional dictation. 

  • Limit disclosure of content during manual transcription or SR editing
  • Redact word or phrase SR audio and text  (SpeechCensor™)
  • Divide session file segments into > 2 groups
  • Send divisions to different processing nodes
  • Option to scramble (sort) segments within division (ScrambledSpeech™)
  • Maintains intelligible unit (utterance) for manual transcription or editing SR
  • Utterance = phrase or short sentence audio
  • Balance obscuring content vs. needs for efficient transcription

One division of the scrambled session file is seen in the window below.


1.  Limit knowledge of whole for minimal security level protection.  Solution is not designed for data requiring high level of security.
2.  Maintains intelligible unit (utterance) for manual transcription or SR correction. Easier to understand and transcribe or correct content compared to scramble at word level.  While data is obscured by scramble, utterance phrase or sentence should be intelligible to manual transcriptionist or SR speech editor for rapid transcription or editing.

3.  Simple process to reverse after edited document 
Reverse redact, divide, and scramble after document return (see menu above).

4.  Potential application to TV captioning or court reporting by remote court reporter.  User may segment and distribute single segment to different captionist or court reporter.  Process may also stream files
untranscribed session file segment (audio utterance), transcription results in transcribed session file with audio linked text.   

Demo #23 . . . Divide, scramble, merge, and unscramble . . . This video shows divide/scramble untranscribed audio session file and merge/unscramble transcribed session file. Flash WMP MP4

Demo #24 . . . Selective redaction . . . This video shows transcribed session file in SpeechMax™ and selective redaction of patient name from transcribed session file (TSF). Flash WMP MP4

Demo #25 . . . Selective redaction and export of redacted audio  . . . Video shows transcribed session file in SpeechMax™, selective redaction of patient name from transcribed session file (TSF), playback of exported audio in PlayBax™ transcription software controlled with foot pedal, and transcription of redacted audio file in Word. Flash WMP MP4

Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)