U.S. patents and patent pending

Improve speech recognition accuracy, find and correct speech recognition errors more quickly, and protect document confidentiality

In 1998 company began developing Windows-based dictation, transcription, and speech recognition software in Crown Point, IN.  Company has served as authorized reseller for Microsoft, IBM, Dragon (Nuance), ATT, Sony, and Olympus products.  Company last developed software using Microsoft Vista operating system.  It closed its business and development office in 2007.  It has since worked as a virtual company providing primarily consulting services.  To obtain more information, click contact button above and complete form, or contact info@customspeechusa.com.

Workflow Management and Dictation and Transcription

workflow manager and file management supports dictation, manual transcription, speech recognition, text to speech, audio file conversion, machine translation, audio mining, telephone dictation, speech and language processing, and other tasks.  Software development kit (SDK) and application programming interface (API) supports integration with third-party software and hardware. 

Command! Call Center combines call management and telephone dictation.   PhoneCop™ supports call management with caller ID, two-way call recording, and call transfer (call forwarding).  CallDictate™ software module is for telephone dictation, voice mail, and informational announcements.  CallStation™ is a PC solution.  Company has also marketed a software development kit for telephony-related workflow. 

Dictation and transcription software includes acWAVE™ audio file conversion for Sony and Olympus handheld recorders and other audio sources.  CustomMike™ is a driver for handheld Philips microphone. MacroBLASTER™ is a macro editor for programmable keypad for voice, barcode, or keyboard command.   PlayBax™ runs with Infinity transcriptionist foot pedal and Spectra headset for manual transcription and provides small footprint transcription window. 

Speech Recognition Processing


. . . . More than a word processor™

SpeechMax™ is a speech-oriented, multilingual, multiwindow, speech-oriented HTML session file editor.  The software stores session file data in a proprietary (.SES) XML format.  Read/write includes .TXT, .RTF, .HTML, or .SES.  User can enter text with keyboard, speech recognition, programmable keypad, or barcode.  Single, double, or multiple window viewing is available.  Tile documents horizontally, vertically, or in cascade.  Create one or more audio and/or text annotations (comments) associated to specific document text.  Text annotation may include simple text, as well as hyperlink or command line (which can open media player, web page, or perform other function).  Company has marketed session file editor primarily for training speech recognition user profile and editing speech recognition text.  As of 2007, software supported edit of Dragon (Nuance), Microsoft Windows, and this company's SweetSpeech™ software. 

Software includes application programming interface (API) and software development kit (SDK).  Use to develop software add-on to create, review, or edit session file from speech and nonspeech audio, text, or image data.  In 2007 company implemented SpeechMax™ Microsoft Word Toolbar Add-In.  Among other functions, Add-In has Next/Previous toolbar arrows to navigate to Word bookmarks, plus Import/Export toolbar functions that support annotation and phrase migration wizards.  These wizards support, for example, data migration to/from a customizable form in SpeechMax™ to a Word document. 

Screen shot shows
SpeechMax™ toolbars and main document window.  Vertical purple markers represent placeholders indicating very short pauses between speech utterances.  An utterance may consist of a word, phase, or short sentence. 

Annotation feature is located below main document window.  Dictating speaker, transcriptionist, speech recognition editor, or other user can annotate document window text with text and/or audio comments.  Office staff, speaker, transcriptionist, or other user can also use annotation feature to create a customized fill-in-the blank form in the main document window, such as an employee information form (see below).    

Location and order of "blanks" is completely customizable.  Different users can enter text and/or record voice data in the order preferred by a particular user, e.g., a dictating speaker.  Type of data entry is customizable too so that one user can enter data by voice and another by text.

Last name "Streeter" has been entered at bottom in annotation window in first screen shot).  It is associated to the first blank of the "Full Name" field (last name first).  As shown in second screen snot, user entered  other data as annotations and uploaded to main document window.  Process automatically removes underlining as annotation text moves into document window. 

Demo #10 demonstrates use of employee information form and data migration to Microsoft Word.  Audio annotation (blue highlighting) supports text and/or audio entryText annotation (purple highlighting) supports text only entry.  Form includes instructions created as audio annotation for playback by user. Flash MP4

Demo #11 shows form creation.  Form creation generally involves entry of field name and creation of audio annotation within otherwise empty session file segment.  Video also shows text entry with manual transcription or speech recognition. Flash WMP MP4

Demo #12 discusses
The Talking Form™ and audio prompt creation.  A user can listen to instructions, for example, how to complete a form.  Flash WMP MP4

Demo #13A illustrates SpeechMax™ Microsoft Office Toolbar Add-In with data transfer to/from Microsoft Word and SpeechMax™.  Software enables speaker to dictate into Microsoft Word with Dragon speech recognition, transfer text and audio data to SpeechMax™ for correction, and migrate final text into Microsoft Word or other program. WMP MP4

Demo #13B demo represents proof-of-concept workflow. It illustrates SpeechMax™ Microsoft Office Toolbar Add-In with transfer XML data to/from Continuity of Care Record (CCR) to/from Microsoft Word and to/from SpeechMax™.  Add-In supports download/upload XML data to/from Microsoft Word and CCR.  Video demonstrates download data into Word using Add-In and data modification based upon written or dictated information.  If speaker dictates for manual transcription, transcriptionist may play back dictated audio using PlayBax™, and transcribe into Word.  Alternatively, user may dictate into Microsoft Office using speech recognition.  Data may be transferred to SpeechMax™.  Same or different user can further modify with dictation/transcription, speech recognition, keyboard, or bar code.  Add-In supports transfer modified data back to Word with upload directly into CCR using the  Add-In.  Alternatively, operator may modify in Word before upload. WMP MP4 

Software includes a standard spellchecker.  It also includes WordCheck™, company's advanced text compare for error detection for pattern recognition.
  Process synchronizes output from different speech recognizers.  Each synchronized text phrase transcribed by two or more speech engines reflects the same speech.  Synchronization also implies same number of segments without regard to data type or processing method (e.g., manual transcription versus speech recognition).   User can, for example, open one or more main document windows to view synchronized text from two or more speech engines, two or more manual transcriptionists, or both.   The concept of synchronization also applies to other pattern recognition, such as the translation of English into French and Spanish texts, or other audio, text, or image processing. 

As indicated below, the vertical pink markers separate synchronized phrases arising from the same source text.  The last sentence in each text has been selected (highlighted. 

In company approach, during first phase of speech recognition training, speaker dictates using handheld microphone or other device.  Dictation audio is segmented into audio utterances and manually transcribed phrase by phrase.  This results in a session file with audio-linked text.  It may be used to train Dragon or Microsoft adaptive speech engines, or nonadaptive
SweetSpeech™ speech recognition. SweetSpeech™ also provides high level ("do-it-yourself") tools for transcriptionists, speech editors, or speakers to create data for lexicon, language model, and acoustic model for speech recognition.  

Once recognition reaches high accuracy, speaker transitions out of "preautomation" (manual transcription) training phase into "full automation" with speech recognition.  Speaker has option of using real-time speech recognition (with immediate visualization of text on screen) or speech recognition with delayed visualization (until speech engine has processed entire audio file).  

SpeechServers™, for example, transcribes the audio using one or more speech engines for server-based recognition that returns audio-tagged text after processing of the entire file.  Text may be reviewed by speech editor and/or speaker with SpeechMax™ session file editor.  

If process uses two or more speech recognition programs,
WordCheck™ text compare detects differences in speech engine output.  Generally, matches (agreement) between two or more programs represent correct text. J. G. Fiscus, "A Post-Processing System to Yield Reduced Word Error Rates:  Recognizer Output Voting Error Reduction (ROVER)," pp. 347-54 (1997 IEEE)  

The exception to the rule is an identical misrecognition.   In this case, two or more speech engines misrecognize the same audio the same wrong way.  The incorrect text is the same in the different speech engines.  Examples from company experience include two or more speech engines
recognizing "underlining" when spoken "underlying", or recognizing "are" when spoken "of" as in "of the spine".  In these cases, when the incorrect text is a match, there is no detectable text difference. 

Protocols for speech editor review include:  (1) speech editor reviews all audio and text, makes corrections, and returns text to speaker for final review and approval; (2) editor only reviews text differences and associated audio only (as difference indicates that one or both texts are incorrectly recognized), or (3) some variation of the above.  Based upon company experience, highly accurate speech recognition returns text with about the same error rate as gold standard manual transcription.  For critical documents, it is recommended that dictating speaker review entire document and make changes and/or request additional edit where appropriate before sign-off

In one approach, WordCheck™ synchronized text compare identifies differences between speech engine output and highlights the differences.  Nonhighlighted text indicates no detected text difference.

How can a speech editor and/or speaker use
text compare?  Different approaches exist.  For example:

With SecondLook
approach, session file editor plays back ALL session file audio (representing differences and matches), reviews text, and makes corrections.  To make certain that he or she has reviewed all differences (which represent increased risk of misrecognition), editor takes a quick "second look" at highlighted differences of original documents before sending text off for review and signature by the dictating speaker. 

In FirstLook™ mode, editor plays back audio for DIFFERENCES only (where one or both engines have misrecognized text).  Editor selects or enters correct text into document, and returns edited text to speaker for final review.  This dramatically shortens editor review time with highly accurate recognition, as indicated in graph below.   Speaker reviews entire document before approving for distribution.  Speaker can correct uncorrected errors (usually minimal or none with highly accurate recognition), or send document back for rework. 

The organization can use time savings from using SpeechMax to support other activities.  For example, speaker who was required to self correct can spend less time correcting errors (as a speech editor has corrected most or all), and more time on his or her primary work, e.g., a doctor who treats patients.  The same applies to other dictating speakers, such as lawyers, engineers, or business personnel.

Graph below shows estimated time savings using text comparison in bar graph below.  Graph assumes that speech editor reviews DIFFERENCES in FirstLook™ mode only.  At theoretical limit of 100% accuracy in all texts, there are no differences.  Speech editor review time drops to zero at the 100% limit.  Dictating speaker carefully reviews all text before signoff primarily to detect errors, including identical misrecognition (where two or more speech engines make the same mistake). 

At 90% accuracy, speech editor generally will identify some differences requiring review.  At this accuracy level, there is 80% speech editor decrease in audio review time.  The dictating speaker still reviews the final text.  With highly accurate speech engines, speaker has few if any errors to correct.  

As indicated, different speech engines sometimes make the same recognition error.  These and other errors become less frequent with improved accuracy.  Process can use SDK and API to create database reliability index to track previous identical misrecognitions.  This index can alert editor and/or speaker if previously identically misrecognized text (e.g., "underlining" instead of "underlying") appears in transcribed text. 

As indicated in following graphic, text comparison can potentially significantly reduce speech editor audio review time.

Total audio review time with text compare depends upon how much audio the speech editor must review.  Bar graph shows expected decrease in audio review time when reviewing text with highlighted differences.  Speech editor corrects detected errors.  With highly accurate speech recognition, text returned to speaker after speech editor review generally will have the same, or about the same error rate, as manual transcription.   As described above, when there are no misrecognitions and no differences for review, speech editor review time drops to approximately zero.  No recognition errors are highlighted also when speech engines make the same mistake (identical misrecognition).  It is assumed, for purposes of this graph, that each speech engine misrecognizes different words from those misrecognized by the other speech engine.  This assumption tends to increase expected speech editor review time and reduce the projected time savings.  First, consider the case where each speech engine misrecognizes the same word in a sentence as the other engine, but differently (e.g., user said "ball" and first engine transcribes "hall" and the second "wall").  Speech editor need only review text and audio for a single word.  However, if the misrecognition involves words in different locations in the sentence, then word audio for two different locations (two different audio tags) in the sentence must be reviewed.  This would increase editor audio review time.    As misrecognition errors may occur in the same word (position) in a sentence, graph would tend to underestimate  potential time savings.

Same logic applies to evaluating differences between three or more texts, as well as to output text from other audio, text, or image pattern recognition, e.g., speaker identification, machine translation, computer-aided diagnosis (CAD) for medical examinations, and other pattern recognition.

ScrambledSpeech™ feature reorders speech utterances (phrases) before manual transcription or speech recognition editing.  This is designed to limit understanding of document content as a whole by any one transcriptionist or session file editor.  Process can send all rearranged audio phrases to a single transcriptionist, several reordered phrases to several transcriptionists, or a single phrase to each of many transcriptionists.  After transcription, the transcribed phrases are rearranged into their original sequence and text distributed as a final report or other document.  Similarly, process can scramble and divide multiple speech recognition text-tagged utterances among different speech recognition editors, rearrange the reviewed and corrected session file segments into proper order, and return session file to speaker for review and approval. 

supports, in summary, speech recognition for dictation, speech editor and/or speaker correction of speech engine errors,  easy creation of fill-in-the-blank forms with audio or text prompts, data migration to/from SpeechMax™ session file to Microsoft Word, and other manual and/or pattern recognition processing of different data.  It also supports My AV Text™ multimedia presentations (text, audio, video, graphics, and internet content).  For example, user can  link document text to a particular website page or run program (e.g., video player).  Other features are covered on the SpeechMax  and About pages. 


SweetSpeech™ speech engine and tool kit support nonadaptive speaker-specific speech recognition.  Speech recognition is server-based and returns text after complete transcription of audio file.  Toolkit includes high level tools to help transcriptionist, speech editor, or other party create and update speaker lexicon (with little or no reliance upon expensive technical expertise), as well as speaker's acoustic and language models.  Tools include proprietary text to speech phoneme generator to help create an accurate speaker-specific lexicon, speech engine automatic lexical questions generation, and accumulator and combiner techniques to train and update the speech engine with audio-tagged verbatim text generated by SpeechMax.  For example, using these high-level tools, it is anticipated that many speech recognition editors and/or manual transcriptionists could be trained to generate the speech and language data for speech recognition and other automatic processing of speech and language tasks.     

Microsoft research also indicates that nonadaptive speaker-specific approach is potentially more accurate than mainstream speaker adaptive technology.

Increased accuracy in bar graph from Microsoft research refers to 8.6% error reduction of nonadaptive recognition compared to adaptive technique with less than 8 hours of training data, and 12.3% error reduction with greater than 15 hours of training.  Word error rate (%) is about 1% lower for nonadaptive speaker-specific model for different levels of speaker training data.  Authors estimate word error rate is about 7% for training data of 12,000 sentences for nonadaptive technique, and 8% for adaptive model with same level of training data.  Similarly, there is improvement in relative error reduction of 45% and 52.2% for nonadaptive approach compared to speaker-independent model. 


SpeechServers™ has provided server-based speech recognition with return of session file after complete transcription of audio.  It also has supported back-end user enrollment with audio file and manually-transcribed audio-linked text.  As a result, user does not have to read text displayed on screen and record with microphone or handheld recorder.  Repetitive iterative training of misrecognized word has been used for Dragon and Microsoft speaker adaptive systems.  Software supports accumulator combiner enrollment and update speech user profile for company's proprietary nonadaptive speech engine SweetSpeech™.  Dialog from early SpeechServers™ version is shown below and was implemented for use with speaker-adaptive systems. 


Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)