U.S. patents and patent pending

Company started speech recognition software development in 1998 in Crown Point, IN office, about 1 hour southeast from Chicago and now operates out of Terre Haute, IN. 

Company has received multiple U.S. and foreign patents.  Articles published in the U.S. and overseas have evaluated company software.

Customers have included health care organizations, doctors, lawyers, law enforcement, solo transcriptionists, transcription companies, and large and small businesses in the U.S. and overseas.


"We selected their SpeechServers™ running with Dragon NaturallySpeaking . . . .  we recommend it to anyone looking for cost-effective software for server-based speech recognition."  Robert Duffy, Programmer, Columbia University Medical Center, New York, NY

"After implementation of web transfer server and web services for transcription, there was no significant interruption in service for the doctors or transcriptionists.  The typing ladies think the new server is wonderful . . . . "   Ian Yates, Medical I.T. Ltd., Queensland, Australia

"Since installing your product, acWAVE™ we have processed over 1,000,000 conversions (on one server alone) without a problem. . . . Thank you for such a great product."  Scott D. Stuckey, ASP Product Manager, Voice Systems, Inc. Tampa, FL



Various publications have reviewed company software.  Software was described in a 2010 compilation edited by William Meisel, editor of leading industry trade journal Speech Strategy News.  Book includes articles written by contributors from Nuance, Microsoft, IBM, Custom Speech USA, and other companies.


Description of error correction using text compare of dual output from server-based speech recognition
TechnoLawyer review from 2002 describes error spotting using dual-engine text compare.  Error correction is followed by SpeechServers™ repetitive iterative training of speech user profile using SpeechTrainer™ service.  Using this feature, if a word is incorrectly recognized, audio is replayed by server and retranscribed until it is accurate, or for a preset number of iterations. 


SpeechMax™ speech-oriented HTML session file editor for reviewing and correcting output from manual and/or automatic processing of audio, text, or image data

SweetSpeech™ nonadaptive speech recognition and toolkit for training speech user profile for speech recognition and other speech and language processing

SpeechServers™ server-based speech recognition, as well as other speech and language processing

workflow manager for dictation, transcription, speech recognition, text to speech, audio file conversion, machine translation, audio mining, telephone dictation, and other speech and language processing.  Company software has integrated with Dragon (Nuance), Microsoft, and IBM speech recognition, ATT text to speech, and other speech and language programs.

Dictation and transcription utilities include
acWAVE™ for audio file conversion, CustomMike™ driver for handheld microphone, MacroBLASTER™ macro editor for programmable keypad for voice and barcode command, and PlayBax™ for foot pedal transcription playback for manual transcription. 

SpeechMax™ . . . . More than a word processor™

SpeechMax™ is a one-of-a-kind, speech-oriented, multilingual, and multiwindow HTML document editor. 

Initially designed for editing speech recognition, this  versatile software now supports synchronizing, reviewing, and correcting output from manual or automatic processing of audio, text, or image data.  This includes dictating real-time speech recognition, desktop server-based speech recognition, text to speech, machine translation, and other speech and language processing.  Interface can text compare two or more speech recognition session files, as well as data processed by two or more manual transcriptionists.  Each document window supports one or more annotation windows for text or audio comments aligned to document window input audio, text, or image.  Interface supports aligning text, audio, or image content or video segment to specific document text. 

Other features:
--Single, dual-window, or multiwindow display
--Synchronize display text flow for R to L and L to R text
--Tile multiple windows horizontally, vertically, or in cascade
--Synchronize audio text utterances  > 2 session files
--Synchronization==>identical session file segment numbers
--Text differences highlighted for rapid error detection
--Tab navigate to synchronized segment
--Playback audio, correct utterance text
--Create composite, best-guess session file from > 2 files
--Highlight differences between best-guess and other files

WordCheck text compare for error detection
General practice standards and regulations emphasize the importance of detecting spelling and misrecognition errors.   See, e.g., The Joint Commission, Division of Health Care Improvement, "Transcription Translates to Patient Risk," Quick Safety (April 2015).  

A word processor spellchecker detects spelling errors only and not misrecognition.  WordCheck text compares output from two or more recognition systems.  It highlights text differences in red to alert the reviewer (e.g., primary speaker or speech editor) of a discrepancy and need to carefully review text.  If the reviewing party is the speaker, he or she usually will remember what is dictated.  If reviewing party is editor, he or she will typically need to review text after listening to audio.   SpeechMax™ API also supports output difference detection for pattern recognition for other audio, text, or image data using corresponding third-party developer toolkits.

Identical misrecognition not highlighted
Nuance® case study reports up to 98% accuracy for physician users of Dragon® Medical. MacSpeech Dictate had one error in 124 words; other recognition systems had substantially lower accuracy.  Voice Recognition Software Dictation Test   Reasons for speech recognition errors are poorly understood.  Which Words are Hard to Recognize? Prosodic, Lexical, and Disfluency Factors that Increase Speech Recognition Error Rates 

Two or more speech engines sometimes make an identical mistake.  For example, speaker says "that" but both engines transcribe "hat".  When this occurs, text compare process does not detect output difference.  Identical misrecognitions occur less frequently with highly accurate speech recognition.  Text editor or speaker review of recognized text help maintain a high level of accuracy. 

Why WordCheck™ for speech recognition is needed:
Improved computer processing supports faster and more reliable speech recognition and other pattern recognition.  This has led to growing use of pattern recognition, artificial intelligence, and big data analytics.  As indicated, speech recognition never makes a spelling error.  Standard spell check cannot identify recognition errors.   Many programs assign numerical confidence scores to results to help users identify possible recognition errors.  However, company has found that end users have trouble interpreting and understanding confidence scores, that identifying output text differences in red as an indicator of possible error is more easily understood by the typical end user, and that comparison of  recognition output can help save significant time in finding recognition mistakes. 

What the text compare interface looks like--how text compare can improve or enhance the process
The screenshot below shows transcribed dictation for a chest x-ray report.  Users can choose to view a single text only that is color-coded to indicate number of texts with different recognition.  User can also select vertical, horizontal, or tiled display.  Speech utterances (phrases separated by a short pause) are optionally delimited by vertical purple lines in the dual-document view below.  Utterance (audio phrase) boundaries are established by SpeechMax™ analysis of utterance waveform.  Synchronization (normalization) results in display of utterance text from two or more speech engines arising from the same audio.  Synchronized utterances each have the same audio start/stop times. 

In one approach, editor reviews text and listens to audio representing only differences.  Editor immediately can tab to text differences in red, playback audio word or phrase, select correct output for final version, and so on until complete.  Editor returns copy to dictating speaker for review.  Speaker has option to make changes and approve for final distribution, or return to editor for additional editing.  Annotation (comment) windows linked to specific document text enable the speaker to indicate the specific document text changes requested.  After any additional editing, process can format, copy into template, or otherwise prepare document for distribution. 

Sometimes errors corrected by speaker will represent identical misrecognitions where both speech engines mistranscribe (misrecognize) the same word the same way.  There are no identical misrecognitions in the two texts below. On the other hand, more commonly, two speech engines mistranscribe the same word differently so that both are incorrect (e.g., one engine transcribes as "mouse" and another transcribes as "douse").  The text differences in red will indicate that there is a difference for review by editor. In the example below, editor reviews about 10% (1 x 10%) of audio since the difference in each text reflects same speech audio.  In the cases where speech engines each misrecognize a different word in a 10 word sentence, editor reviews about 20% (2 x 10%) of word audio tags.  This represents the two different audio tags played back by the editor.  Where the speech engines each misrecognize the same word differently (e.g., "mouse" and "douse" for "house") in a single, 10 word sentence, the editor reviews about 10% (1 x 10%) of audio.

Editor need review only the same, single word audio tag since both engines misrecognize the same word in a different way.  When the speaker reviews the final text, he or she can correct errors not corrected by the speech editor, if any.  The speaker can make further edits to the final copy before it is distributed.  

The software can be used in different ways.  For example: 

In FirstLook mode, editor synchronizes, text compares, tabs to differences, reviews audio and text, corrects as needed, and returns report to dictating speaker.  

In SecondLook mode, editor reviews all audio, and makes correction as needed to single session file text without use of normalization or text compare.  After completion, editor normalizes and diffs original session files, compares to previously corrected session file to see if all necessary changes were made, makes any additional needed changes, and returns report to dictating speaker. 

Other variations are supported.
Text edit time without error-spotting
American Health Information Management Association (AHIMA) position paper indicates that it takes a speech recognition editor 2-3 minutes to listen to and edit a minute of speech recognition text. 
(AHIMA October 2003)  With 1 hour (60 minutes) of dictation audio, it can take up to 180 minutes to edit the document using traditional speech recognition editor techniques.

Fewer text differences => more editor time savings
Hours/minutes saved by using text compare to correct errors depends upon speech editor efficiency using software and speech recognition accuracy.    As speech engines become more accurate, there are more time savings.  With 90% accuracy in both speech engines, there is estimated up to 80% reduction of audio review time by speech editor using text compare.   Speech editor can reduce review time to as little as 36 minutes--a time savings of up to 144 minutes (.8 x 180).  Speaker saves time because speech editor has edited most, if not, all errors--leaving speaker with few, if any, corrections to make.  With highly accurate systems and expected relatively low rate of identical error, speech recognition using text compare to guide editor to potential errors appears to have an acceptable risk rate.  In those cases where organizations require doctors and other dictating speakers to self-correct their dictation, or where speech editors correct text for speaker review, supplementary use of WordCheck by dictating speaker or speech editor can enhance error detection.   

* Expected audio review time refers to expected percentage decrease in speech editor audio review time when reviewing text with differences.  Speech editor corrects most or all errors for dictating speaker.  Text returned to speaker generally will have about the same error rate as manual transcription with highly accurate speech recognition.  When there are no recognition errors, speech editor review time drops to zero with no differences being detected for review.  No recognition errors are highlighted when speech engines make the same mistake (identical misrecognition).  Bar graph shows decrease in audio review time when comparing output from two different speech engines.  It is assumed that each speech engine misrecognizes different words from those misrecognized by the other speech engine. 

More efficient correction makes organization more efficient 
Some employers make speech recognition speakers self-correct.  This reduces or eliminates back-end editing costs.  There are trade-offs because self-correction limits a speaker's ability to do other work.   Some physicians, for example, say that they could spend more time with patients, perform more surgeries, or read more x-rays or EKGs if there are back-end editors making corrections.  Other speech recognition users such as physician assistants, nurses, lawyers, law enforcement, or government workers who must self correct similarly would have more time for other tasks if they spent less time finding and correcting recognition errors.

A technology whose time has come
With improved software and faster computers in recent years, speech recognition has become highly accurate for many dictating speakers.  Speech recognition accuracy of up to 98% is obtainable for some speakers.  Occasionally there is 100% accuracy.  Even with 90% accurate recognition, there are significant speech editor productivity gains compared to use with less accurate earlier recognition technology (see bar graph above). 

Using WordCheck™ with other pattern recognition or manual processing
Company's patented process finds mistakes using the principle that output differences by two pattern recognition processes indicate a mistake by one or both.  Same logic applies to differences between three or more texts, as well as nonspeech audio, text, or image pattern recognition or manually-generated, synchronized output.  Synchronized text compare may also be applied to:
--Audio mining
--Speaker identification
--Machine translation
--Natural language understanding
--Facial recognition
--Fusion biometrics (e.g., facial + speaker recognition)
--Computer-aided diagnosis (CAD) for medical purposes
--Manually-generated transcription or translation with audio-tagged text (as may be useful for education)

Novel ScrambledSpeech™ feature supports comparison of segments of audio-aligned text transcribed from the same audio and randomly reordered to limit transcriptionist or editor knowledge of document content. 

SpeechMax™ Text Comparison--Summary

Navigate to errors quickly with SpeechMax™
--Interface synchronizes audio-tagged text
--Recognition differences indicate recognition error
--Text compare identifies recognition differences
--Editor navigates to text differences
--Correctionist (editor) reviews audio, corrects if required
--No highlighting if identical misrecognition 
--Speech editor may review only differences
--Speaker expected to review entire text

Other features
--Voice correction by second speaker of first speaker text
--Train both original speaker and correcting speaker
--Train multiple speakers
--Use for web-based, multispeaker collaborative document
--My AV Notebook™=voice-oriented electronic scrapbook
--Use for speech-oriented multimedia, karaoke, lecture 
--Edit document text and audio tags to improve audio quality --ScrambledSpeech™ control reorders speech utterances and associated text, limits number of utterances sent to any one speech editor, and thereby protects confidentiality by limiting correctionist knowledge of document content as a whole during correction

For additional information, see About


Nonadaptive speech recognition user profile training
--SpeechSplitter™ segments audio before transcription
--Text manually transcribed from audio
--Need > 6-8 hours text/audio to create individual profile 
--Accuracy nonadaptive > adaptive recognition, see below.
--Create acoustic and language models, lexicon
--Also create small group model for meetings or videos
--Best result if same small-group speakers over time
--Use data/models for other speech/language processing

Includes speech engine and do-it-yourself (DIY) toolkit
--Create user profile tuned to speaker's speaking style
--DIY tools for novice with minimal/no lexical expertise
--Use text to speech phonetic generator for lexicon
--Speech engine automatic lexical questions generator
--Use for virtually all languages or dialects
--Create profiles for speech impaired or medical diagnostics
--Create profile for robot speaker or imaginary languages

One-of-a kind high-level tools for nonexperts
This do-it-yourself toolkit includes easy-to-use techniques for transcriptionists, local IT personnel, and other users to create acoustic model, language model, and lexicon for nonadaptive speech recognition, as well as speaker independent speaker dependent software. Among other aids, it includes text to speech (TTS) phonetic generator for phonemes.  User listens to speech, enters phoneme spelling for words and sub words to TTS phonetic generator, plays back synthetic speech generated from phonemes, determines accuracy of sound compared to recorded speech, and modifies phonemes if required until synthetic speech best matches recorded speech.  Experience indicates that consistent phoneme use is more important than specific text representation.  Software supports creation of speech recognition profiles based upon user-specific data, data from small groups (e.g., board of directors or speakers in a video), or large groups (to create speaker independent group model).  Use SharedProfile™ data to create models for other automatic speech and language processing such as voice commands, text to speech, machine translation, speaker ID, and natural language processing.   High-level tools, such as automatic pronunciation generator and automatic questions generation, reduce need for expensive lexical expertise when creating speech user profiles. 

Higher accuracy with nonadaptive speech recognition
With nonadaptive techniques, models more accurately reflect single-speaker speech and background and channel noise.  Microsoft research (below) indicates that "massively speaker specific" nonadaptive models have higher accuracy than industry-dominant speaker adaptive technique. 
SweetSpeech™ model builder also supports custom, personalized speech user profile creation for conversation or for medical diagnostics for Parkinsonism, autism, patients with cleft lip, and other diseases affecting speech.

Increased accuracy in bar graph below refers to 8.6% error reduction of nonadaptive recognition compared to adaptive technique with less than 8 hours of training data, and 12.3% error reduction with greater than 15 hours of training.  Word error rate (%) is about 1% lower for nonadaptive speaker-specific model for different levels of speaker training data.  Word error rate is estimated about 7% for training data of 12,000 sentences for nonadaptive technique, and 8% for adaptive model with same level of training data.  Similarly, there is improvement in relative error reduction of 45% and 52.2% for nonadaptive approach compared to speaker-independent model.

For additional information, see About


 Server-based speech recognition
--Dragon, IBM, Microsoft, SAPI 5.x, SweetSpeech™
--TransWaveX™ returns text or audio-linked text
--CompleteProfile™ enrolls speaker with audio/text
--SpeechTrainer™ supports iterative corrective training
--Server transcribes audio, compares text to verbatim
--Continues until achieves target accuracy or limit iterations
--Create SweetSpeech™ single speaker or group profile  
--Integrates with Command!™ workflow system (including  CallDictate™ telephony server)

For additional software product information, see About

Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)