More than a word processor™

For dictation, transcription, speech recognition and other speech and language processing

Supports analysis, conversion, and processing of other text, audio, and image pattern recognition

Reduce search time for pattern recognition errors with advanced text compare

U.S. patents and patents pending 

Full Edition SpeechMax™ supports audio record for dictation, audio playback with transcriptionist foot pedal or hotkeys, real-time speech recognition, desktop autotranscribe audio file with speech recognition, edit server-based speech recognition, multilingual,  multiwindow document display, and text and audio annotation (comments). 

Session file content represents the sum total of SpeechMax™ document and annotation (comment) data from audio, text, or image data processing.

Software has HTML display and XML storage of proprietary session file content.  Session file represents sum of document and annotation content.

Reader Edition displays session file text, audio, and image content.  Document license-embedded SessionLock™ is designed to prevent unauthorized editing.

Consolidated text compare with color

Microsoft Word runs SpeechMax™ Microsoft Office Toolbar Add-In.  Add-In has Next/Previous" toolbar arrows for user navigation to Word bookmarks.  Add-In also has Import/Export toolbar functions that support annotation and phrase migration wizards.  These wizards support data migration to/from SpeechMax™ session file to Word. 

SpeechMax™ is available with SpeechProfessional™ combo package. 

Use About to obtain more information about version compatibility software with different SR and OS.

SpeechMax™ General Functions

  • Document window supports read/write text, audio, or image
  • Document window supports read/write text, audio, or image
  • Read/write includes .TXT, .RTF, .HTML, or .SES (proprietary session file)
  • Enter text by keyboard, barcode, or speech recognition
  • Record speaker audio with integrated annotation window sound recorder
  • Continuous or segmental audio playback
  • Use keyboard hot keys or foot control to start/stop playback
  • Real time dictation into speech recognition
  • Copy/paste graphics into document window
  • Standard text compare and text compare by phrase 
  • Select text, playback audio with audio-tagged document text
  • Single, double, or multiple window viewing
  • Tile documents horizontally, vertically, or cascade
  • Create text and/or audio annotation (comment) to selected document text
  • Unlimited annotations (multilevel data) to selected text
  • Unlimited users (multiuser collaborative) may annotate same text
  • Text annotation may include comment or question
  • Text annotation may include hyperlink, command line
  • Command line may launch program, e.g., open media player or web page
  • Use annotation window sound recorder to record audio comment  
  • Load audio file or use text to speech to create audio annotation
  • Switch (transpose) document/annotation content
  • Selectively move annotation text into document to replace session file text
  • Selectively move annotation audio into document to replace audio
  • Divide session file segments into two or more groups
  • Sort (scramble)/unsort (unscramble) group session file content
  • Session lock converts document/annotation content to read only
  • Data migration document phrase or annotation
  • Application programming interface (API)
  • Plugins (addins) for third party-software
  • Web services and file management available through workflow manager
  • Web-based Help files
  • Text compare across document (like word processor)
  • Represents text compare of text strings with no reference to source data
  • Text compare by phrase (synchronized text compare)
  • Used to compare SR audio-aligned text
  • Use different SR SDK to converts to CSUSA proprietary .SES format
  • Create equal number of segments (phrases) in .SES session files
  • Each phrase (utterance) arises from same audio
  • Each respective phrase has same start/duration time
  • Once synchronized, text in segments (phrase) compared
  • Differences highlighted, matches clear
  • Differences indicate higher error risk
  • Matches indicate greater reliability
  • Any type of text can be phrase compared if synchronized
  • With equal numbers of segments, session files are "synchronized"
  • Phrase compare supported even if underlying source data different
  • For example, phrase compare synchronized SR text and translation text
  • DataInSync = bounded synchronized data input for pattern recognition
  • Input data include may include text, audio, image, volumes, and spaces
  • Vertical placeholders delimit bounded input and output data

Document Window

Screen shot shows toolbars and main document window.  Purple vertical markers represent placeholders delimiting phrases in the Pledge of Allegiance. 

Main/Annotation Window

Screen shot shows main document window and annotation window.   Annotation window includes sound recorder and text box.  Main window has formed created using annotations.  Audio annotation (blue highlighting) supports text and/or audio entry Text annotation (purple highlighting, not shown) supports text only entry.  Text "Streeter" in annotation window represents last name.  It may be moved to first blank after "Full Name" in the document window form. 

Multiwindow Tiled Display

Screen shot shows tiled horizontal display of 3 session files representing original English text and 2 translations.   Vertical placeholders are noted delimiting text in the material in 3 document windows.   The translations have the same number of segments delimited by the same punctuation.  The last synchronized segment is highlighted in all three windows. Software supports horizontal, vertical, and cascade tiling.

Text Compare By Document/By Phrase

Software supports text comparison across the document as with text processor text compare.  Comparison looks at text strings only without regard to source data. 
Text comparison by phrase synchronizes document and compares text to the same audio-linked text. 

1.  For SR text compare, software converts Dragon (Nuance), Microsoft Windows, IBM, or other SR session files to CSUSA session file format (.SES). 

2.  Operator selects text compare across document or by phrase (utterance). 

3.  SR session files are typically "asynchronized" with different start/duration time stamps for each respective audio-aligned segment.  If operator selects by phrase option, DataInSync
™ feature resegments and retags the audio for each session file. 

4.  This results in an identical number of session file segments for each SR engine.  Same number segments each have the same start/duration times. 

5.  These synchronized files are text compared segment by segment (utterance by utterance). 

6.  Speech editor highlights differences by text compare representing potential SR errors.  Editor tabs to differences and listen to audio, and corrects SR errors as required. 

7.  Editor also corrects obvious speaker errors, e.g., speaker said right when meant left.  Speech editor may insert  comments or questions about text and send to speaker as annotation (comment). 

8.  Editor saves audio-linked verbatim text (what speaker said) as training session file (.TRS).  There is minimal audio to review with highly accurate SR systems since there are few differences. 

Synchronization = Session File with Same Number of Segments 

Software can phrase compare texts from any source or processing method as long as they have the same number of segments.  The delimiters vary depending upon the type of file.  For dictation, manual transcription, and speech recognition, the delimiters are generally utterance boundaries.  For translation, it can be punctuation such as period, comma, colon, or semicolon.  

Text Compare Display Options

Process may perform text compare in the multiwindow session file editor between:

1.  Two or more viewable buffered read/write windows
2.   Viewable session and "hidden" session, display differences as popup
3.  Different texts displayed as popup dialog when cursor passed over text
4.  Consolidated text compare

1.  Two or more viewable buffered read/write windows

Using "dual-engine" comparison with two visualized buffered document windows, speech editor can text compare with both sets of text visualized.  Operator selects phrase compare or text compare (across document) by clicking on the "compare phrases" or "compare document".

a.  Phrase Compare.  The first example represents phrase compare where segments are synchronized and equal in number.  Overall accuracy is about the same in both texts.  Dropdown menu indicates the difference in the other (lower) panel.  The window reflects an old (SpeechMax+) version of the software. 

b.  Text Compare Across Document.

In the second example (newer version of software), user has selected conventional text compare across the entire document.  Top window text has about 90% accuracy, bottom window has just under 100% accuracy.  Dropdown menu selection is play difference (audio) for the first highlighted difference.  

In both sessions in this dual-window display, relatively few differences suggest both are highly accurate.

Transcriptionist should only have to listen to about 10% of speech recognition audio with these highly accurate texts.  Reduction in transcriptionist review time is about 90%. Example is consistent with conclusion that increasing accuracy will reduce text differences, and that differences will approach zero as accuracy nears 100%. 

2.  Viewable session and "hidden" session, display differences as popup

Demo #3 . . . Single-window dual-engine comparison using server-based SweetSpeech™ and Dragon NaturallySpeaking . . . User sequentially opens Dragon and Custom Speech USA™ session files, clicks compare documents toolbar button to highlight differences, plays differences using menu dropdown, makes changes, increases leading/trailing playback to listen to word "of," and copies/pastes "well-maintained" from Dragon to format text, and enters new lines. Operator saves final distribution report as .txt. Small cap "l" was transcribed by both engines and capitalized as "L" for report distribution. Since user did not create a separate verbatim annotation, the final text is automatically saved as the verbatim text using Instant Verbatim™ feature.  Flash WMP

Different texts displayed as popup dialog when cursor passed over text (not shown)

4.  Consolidated text compare, e.g., Text A vs. Text B, Text B vs. Text C, Text A vs. Text C ==> color coding. 

SR editor can also use consolidated text compare using 3 or more SR systems in Best Guess mode. 

Color code to indicate degree of differences, for example, clear = no difference, pink = difference between 2, or red = all 3 differ.
With large number of sessions, high number of matches indicates increased likelihood of reliability. Text is virtually all clear. Large number of differences indicates increased likelihood of error. Text is mostly highlighted.

In the color-coded Best Guess example above, consolidated text compare highlights about 55% of text (representing differences) with nonhighlighted text (nondifferences). There is a potential reduction in audio review time of 45%.

Value Proposition: Focus on "differences" audio and reduce audio review. 

A major SR company advertises up to 98% accuracy.  If editor reviews 10 hours of audio with this accuracy, 9 hours, 48 minutes represents correctly recognized speech, and 12 minutes represents errors. At 90% SR accuracy, editor reviews 10 hours of audio to find 1 hour of misrecognitions. With increasing accuracy of SR and other pattern recognition, an emerging issue is efficiently finding the increasingly few pattern recognition errors.


  • With 90% SR accuracy, at least 80% decreased review time expected

  • User tabs to highlighted differences, skips matches

  • Selects text, plays back audio, and corrects as needed

  • User may open both documents to view simultaneously if needed

  • Expected final transcription accuracy about same as gold standard MT

  • Decreased audio review time as accuracy increases

  • Significant productivity gains with highly accurate systems

  • Zero audio review time if both SR systems 100% accurate


1.  Large SR transcription volumes suggest large potential savings in speech editor review time with very accurate SR. 

2.  Helps dictating speakers who self-correct.  Speakers can use this process in "preview" mode to detect potential SR errors before undertaking their own review.  They can also use process in "review" mode to determine that potential misrecognitions have been reviewed.    

3.  Create speech user profiles with unsupervised training.  Matches indicate likely accurate text.  Use for SR speech user profile training without manual audio review.  Text compare results of computer-generated speech user profile against manually-created standard or other computer-generated speaker profile.    

4.  Compare text output using different SR technologies.  SR may use hidden Markov models with Gaussian mixtures, neural networks, adaptive, nonadaptive, speaker dependent, speaker independent, or other techniques. Synchronize text output "by phrase"  from speech engines engines using different conversion variables and compare relative recognition accuracy. 

5.  There are other uses for text compare in speech and language processing, including training, synchronized translation, and use of knowledge base, as described in video demos. 

Demo #6 . . . Training student medical transcriptionists . . . Video shows potential application of text compare of student interpretation, conversion, or analysis of bounded input data in training and education.  In one approach, a medical transcription (MT) instructor can create a verbatim transcribed session file text key and compare to synchronized student output.  Text comparison identifies errors in spelling and punctuation and generates accuracy rate.  Synchronized audio makes training more efficient.  Operator need only click on text to hear associated audio.  Multilingual SpeechMax™ supports medical transcriptionist training in various languages. Color-coded Best Guess represents consolidated text compare.  This composite provides the instructor with a quick, visual estimate of whether other students made similar or different errors.   Software determines individual student accuracy levels  automatically.  Flash WMP

Demo #7A . . . Teaching elementary school phonics . . . Video shows proof-of-concept prototype software for elementary school phonetics training ("phonics").  Local elementary school teacher requested video to demonstrate proof of concept to local school board.  The "r" phonics video shows how teacher uses synchronized text compare (by phrase).  Accuracy levels are determined automatically.  Video also shows how teacher can customize training with web-based resources.  Video displays how has made available web resources using software text annotations to specific document text. 

If same panel of phonetic sounds are used for multiple students, teacher can potentially (not shown) synchronize phrases.  To do so, operator can load each session file separately into the multiwindow session file editor.  Once synchronized, teacher could sequentially listen to each child's pronunciation. In this way, teacher can compare varied pronunciations for each of the phonics phrases.   Flash WMP

Demo #21 . . . Synchronized English, French, and Spanish translation . . . Video shows operator tiling document windows horizontally and synchronizing French translation with English source.  Operator opens third document window for previously delimited Spanish translation.  Operator synchronizes Spanish translation with both English and French translations and synchronizes French and Spanish translations.  Operator selects each English segment sequentially and confirms synchronized highlighting of French and Spanish segments.

Delimiting output with punctuation common to different languages (e.g., periods, commas, colons, or semicolons) simplifies potential automation of the text compare process text compare.  It would not be difficult To compare accuracy of each translation (not shown), operator can obtain identically delimited French and Spanish translations from different manual or automatic translation sources.  Operator may phrase compare, respectively, synchronized French translations and same for Spanish texts.  Initial translations are shown in the screen shot below. 
Flash WMP  

Demo #22 . . . Law firm knowledge base with color-coded frequency of previous word use . . . Video  shows use of SpeechMax™ software with knowledge base for creation of a nondisclosure agreement and efficient document assembly.  In example, user selects phrases from various "fill-in-the-blank" alternatives.  User text compares alternative form selections.  User can also use color-coded composite Best Session™ to visualize the variability in selection choice for each field. Color coding shows agreement between source documents in knowledge base.  Redder indicates considerable difference.  Pink indicates minimal difference and clear no difference.   Knowledge base may be utilized by law firm, business, or other organization. Flash

6.  Limitations of Text Compare:  Process does not identify every error.  Two or more speech  engines sometimes misrecognize the same audio.  With an identical misrecognition, no difference is detected.  For example, speaker says "hat" and SR engines A and B both transcribe as "that".  However, text compare highlights nonidentical misrecognitions.  For example, speaker says "hat" and SR engine A transcribes "that" and SR engine B transcribes "flat". 

Company experience is that identical or nonidentical misrecognitions occur less frequently with improving SR accuracy.   Goal using text comparison is to return  text with the same or lower error rate compared to gold standard manual transcription (usually estimated about 5% or less).

7.  Matches as Indicator of ReliabilityProcess may also use "nondifferences" mode to identify likely accurate SR training data.  Potentially, process may use nondifferences for unsupervised (nonhand labeled) SR speech user profile training.  This differs from supervised or hand-labeled data entry where human reviewer supervises and manually checks and corrects data submitted for training.  

In unsupervised mode, developer can use text comparison to help "computers train computers." A high number of matches with other session files indicates increased likelihood of reliability.  Operator can use export audio text pairs functionality to save unvalidated results of matched transcription to train SR user profile. 

8.  SpeechMax™ supports text compare of text results from pattern recognition.  

Use Text Compare with Other Audio, Text, or Image Pattern Recognition 

SpeechMax session file text output may represent human or automated processing or delimited text, audio, or image data.  For example, the image below shows an audio wave.  If it is associated to delimited text description, it is a session file. 

Other potential sources of pattern recognition data include:  text
, e.g., optical character recognition, automatic document classification; audio, e.g., music, traffic noise and soundscapes, baby cries, and dog barks characterization; or image, e.g.,  computer-aided medical diagnosis, facial profiling, retinal or iris scanning, handwriting, or fingerprinting.

As with SR text compare, steps include convert to .SES format, synchronize data files, compare text output (if available), and identify differences and matches.  Determine if differences represent error.  Assume matches represent likely correct.  If output data is corrected, user may obtain training data for pattern recognition models.    

With image source data, text output results may be displayed in document window and text compared.  Image may be displayed in document window or using annotation feature to link to file, webpage, or video. 

For example, computer-aided diagnosis (CAD) for mammography can generate mammogram session file with suspicious areas marked for review and described by text.  These may be viewed by reference to a hyperlink.  Process may compare text output from > 2 CAD systems and/or human reviewers for each suspicious area.  Decision as to final report may be made by single radiologist in review.  Process can use data from surgical or biopsy specimen to improve CAD pattern recognition models

Selective Modification of Session File Text and Audio

SpeechMax™ session file editor has document window (upper) and annotation (comment) window (lower).  Operator may selectively edit document window speech recognition session file text or audio using text or audio annotations.

Once user has entered annotations, user may switch (transpose) document/annotation content, or selectively replace (move) annotation text or audio into document window to modify session file text or replace audio.  The switch and replace functions use the copy paste functions of the underlying operating system.

Process may use annotation text audio pairs to train the speech user profile.  Annotation training differs from general session training that uses document window audio-linked verbatim training session file (.TRS) to train speech user. 

Value Proposition: Session file editor supports voice correction of another speaker's session file, i.e., selective modification of text previously dictated with speech recognition. 

This enables another one or more parties to voice correct the speech recognition text of another speaker and to use the modified text and audio to train the speech user profiles of both speakers.  This is not supported by Dragon NaturallySpeaking or other off-the-shelf desktop speech recognition software. 

With desktop speech recognition, Speaker B can open another speech user profile in a buffered document window and listen to the audio-linked text created by Speaker A.  Speaker A may represent a dictating lawyer, physician, or other dictating speaker.  Speaker B may represent a speech editor or another speaker attempting to collaborate on a document with Speaker A.

If Speaker B attempts to voice correct Speaker A text, saving the Speaker B voice with the corrected text will corrupt the Speaker A speaker profile. If Speaker B opens his/her profile and corrects the text, there is no training of Speaker A profile.  This limits voice correction by a speech editor or use of desktop SR to create multispeaker collaborative documents and train speech user profiles. 

With SpeechMax™, a second speaker can open the .SES session file dictated by a first speaker in the main document window (upper) and use the annotation (comment) window (lower) to markup the session file of the first speaker with voice.  The selective modification of  the document window text of the first speaker using OS copy/paste functionality. 

This supports selective modification of document text and audio tags. This also allows second author/speaker to make corrections with voice without damage to 1st or 2nd speech user profile.  This assists transcriptionist or court reporter voice correction of documents, as well as multispeaker creating and editing collaborative documents. 

  • Annotation sound recorder records SR audio, text entered in text annotation window

  • Document window (top), annotation (comment) window (bottom)

  • Supports annotation correction of Speaker A SR text

  • Speaker A had dictated "Alan Smith" but was recognized as "Adam Smith"

  • Speaker A profile open in buffered document window

  • Speaker B correctionist plays back 1st speaker dictation

  • Speaker B corrects using SR  ("Alan Smith") ( see screen shot below)

Dictates using SR into Annotation window

Document Window (top) with vertical purple utterance markers separating text and Adam Smith text highlighted in blue, Annotation Window (bottom) with sound recorder, annotation text window with Alan Smith text, SPEAKER B highlighted under Annotation Name column as party making annotation


1.  Train both speech user profiles.  After second speaker correction of first speaker SR text, the system can train both speech user profiles using the verbatim text "Alan Smith" and the respective audio dictated by first speaker and second speaker.  There is no corruption of either speech user profile. 

2.  Supports sequential voice creation/correction multispeaker documents on a local area network or document emailed as attachment to other party.
  Unlimited number of annotations (comments) per dictated text are supported.  These comments may represent voice correction. 

May potentially use similar document/annotation concept for web-based, real-time multispeaker collaborative documents.  Utilize first window to view text/playback audio.  Use second window to dictate SR correction or change.  Copy/paste text from second window into first.  Use corrected text to train Speaker A profile.  Use Speaker B audio and text dictated into second window to train/update Speaker B profile. Potentially second window may reside in first browser or in a second, newly opened browser.

Value Proposition:   Software supports selective modification of audio tag of document window text.

Mainstream adaptive SR correction and user training focuses on text correction of audio-linked text. There is no convenient way to selectively modify SR audio tag with human or synthetic voice or other audio.  

With CSUSA system, user can also selectively modify the audio tag of audio-linked SR. This relies upon OS copy/paste functionality.  Process requires creating and summating time stamp offsets of the session file to maintain alignment unless original audio is same length as the replacement audio.

One use for this feature is modification of audio instructions associated to a form field in SpeechMax™

SpeechMax™ supports simplified form creation with data fields entered in the main document window as text by keyboard or SR.  User can create complex forms without use of expensive software development kits or knowledge of advanced scripting.  Forms creation uses text and audio annotation (comment) features. 

Forms creator can enter text and audio comments.  Forms creation can also create text annotations to form fields that link to websites or run programs (such as media player).  These can supply additional information to person who is completing the form.

As a result, party completing the form can access text, spoken instructions, website, or other information relevant to form completion.  

Operator can also selectively modify document window text audio tag using audio annotation. 


1.  Create form or other text with SR, replace audio tag representing original SR dictation with audio content.  User completing form selects form field text and hears relevant audio information.  Potential benefit where necessary to hear audio, e.g., testing hearing or recognition of speech or music.   

2.  Train profile for real or artificial language with synthetic (robotic) speech
Substitute synthetic voice for SR human voice tag.  This creates session file text with audio-linked synthetic voice.  This can be used to create speech recognition for a real or artificial language with a synthetic voice font.  Speech user profile would consist of model based upon the synthetic voice.  This is one approach to creating speech user profile for robots. 

3. Improve audio quality of SR audio-tag when intended for playback.  Electronic audio book may be read by professional voice talent and transcribed with SR.  Voice talent may also dictate audio which is presegmented and transcribed manually.  Some audio may be poor quality and require redo by voice talent.  Process may selectively replace ("splice in") phrase audio for one or more segments or replace audio for individually-tagged words.

4.  Associate foreign language audio tag to source text dictated by SR
This creates a learning tool for foreign language.  For example, user clicks on English in document window and hears foreign translation as audio tag. 

5.  Associate song to lyrics.  Dictate song lyrics, transcribe with SR or manual transcription, delete spoken audio tags, and substitute song for spoken word.   (The following demo was created differently, but shows the same result.) 

Demo #19 . . . Singalong/karaoke:  Stairway to Heaven (Led Zeppelin) . . .  Demo shows playback of audio (song) and highlighted text and slider bar in    SpeechMax™.   Flash WMP

"Stairway to Heaven," produced by Jimmy Page, executive producer Peter Grant, © 1971 Atlantic Recording Corporation for the United States and WEA International Inc. for the world outside of the United States. 

Using the top slider bar, a user may "drag" playback point to another point within the same segment, or to a point in another segment.   Software also supports finds a corresponding location in text or resume listening at a new time stamp using slider bar.   A slider bar may also be associated with each separate window.  SpeechMax™ synchronization techniques support dynamic display of synchronized text in two or more windows.  One window may play the song and text, one or more other windows may display only translated song lyrics. 

Protect Confidential Session Data with ScrambledSpeech™

Value Proposition:  Speech or text redaction limits access to confidential information such as name or address, division limits amount of data transferred to any individual outside processing node, and scramble disorders content making it more difficult to understand. 

Transcription or SR editing is often processed out-of-office at remote sites.  Federal and state government, businesses, and professional offices would often like to outsource data processing to cheaper centers, but in ways that limit disclosure of confidential information.  The following is designed for  security requirements as might apply to routine business or professional dictation. 

  • Limit disclosure of content during manual transcription or SR editing

  • Redact word or phrase SR audio and text  (SpeechCensor™)

  • Divide session file segments into > 2 groups

  • Send divisions to different processing nodes

  • Option to scramble (sort) segments within division (ScrambledSpeech™)

  • Maintains intelligible unit (utterance) for transcription or editing SR

  • Utterance = phrase or short sentence

  • Balancing obscuring content vs. efficient transcription

One division of the scrambled session file is seen in the window below.


1.  Limit knowledge of whole for minimal security level protection.  Solution is not designed for data requiring a high level of security.
2.  Maintains intelligible unit (utterance) for manual transcription or SR correction.  While data is obscured by scramble, utterance phrase or sentence should be intelligible to manual transcriptionist or SR speech editor.

3.  Simple process to reverse after returned document 
Reverse redact/divide/scramble after document return (see menu above).

4.  Potential application to TV captioning or court reporting by remote court reporter.  May segment and distribute single segment to different captionist or court reporter.  May stream files
untranscribed session file segment (audio utterance), transcription results in transcribed session file with audio linked text.   
Demo #23 . . . Divide, scramble, merge, and unscramble . . . This video shows divide/scramble untranscribed audio session file and merge/unscramble transcribed session files. Flash WMP

Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)