[New site is under construction,  Currently, it represents nested content within the older site.  The legacy site header is displayed on the Home page and other pages with the new header below.  Limited content from the old site is displayed on the current Home page.  Company has created new Contact, About, SpeechMax™, SpeechServers™, and SweetSpeech™ pages. Clicking on new header hotspots will take you to a newly-resdesigned page.  Support, Downloads, Resellers, Search, and Translate pages remain the same.]    

Software that enhances the user experience and improves productivity, accuracy, and confidentiality.

Custom Speech USA, Inc. (CSUSA) is an integrator, software developer, and reseller of a wide-range of speech solutions and development tools. 


In 1998, CSUSA provided installation and training services for desktop users of Dragon NaturallySpeaking speech recognition (SR).  

Company later developed software for business systems integration, workflow, workflow designer, digital and telephone dictation, manual transcription (MT), audio conversion, and macro editor.  Speech and language processing includes real-time and server-based speech recognition (SR) and text to speech (TTS). 

Company became a certified Nuance (Dragon) reseller and a Microsoft Certified partner.   It has used Microsoft, Dragon, IBM ViaVoice speaker-adaptive speech recognition (SASR), AT&T NaturalVoices, and ProNexus (telephony) software development tools. 

CSUSA software integrates with Microsoft Word and other popular, mass-marketed products, including Philips dictation microphones, Sony and Olympus handheld recorders, VEC Electronics (Infinity) transcriptionist foot control and headset, and X-Keys (P.I. Engineering) programmable keypad. 

CSUSA approach has enabled customers to use boxed SR software that they already owned, or purchase less expensive runtimes from CSUSA.

Company has worked with over 75 market partner resellers in the U.S. and overseas. 

Customers have included transcription companies, physicians and hospitals, law firms, software developers, business, insurance, schools and universities, law enforcement, and government. 

For selected customer list click here.


CSUSA has received positive comments from many customers, including:

"We selected their SpeechServers™ running with Dragon NaturallySpeaking . . . .  we recommend it to anyone looking for cost-effective software for server-based  speech recognition."  Robert Duffy, Programmer, Columbia University Medical Center, New York, NY

"After implementation of web transfer server and web services for transcription, there was no significant interruption in service for the doctors or transcriptionists.  The typing ladies think the new server is wonderful . . . . "
Ian Yates, Medical I.T. Ltd., Queensland, Australia

"Since installing your product, acWAVE™ we have processed over 1,000,000 conversions (on one server alone) without a problem. . . . Thank you for such a great product."  Scott D. Stuckey, ASP Product Manager, Voice Systems, Inc. Tampa, FL


Various publications have reviewed CSUSA software, including Speech Strategy News, Law Office Computing, Law Technology News, TechnoLawyer, Proceedings of Australasian Technology Workshop (2005), and Speech in the User Interface:  Lessons from Experience (2010).

Lessons from Experience 

  • Document window supports read/write text, audio, or image
  • Representative training datasets important for accurate SR
  • Reading enrollment script often unrepresentative of daily speech
  • Dictation often discarded as useless byproduct of transcription process
  • SR systems segment and time stamp the same dictation audio differently
  • Other speech to text conversion variables (e.g., acoustic model) differ
  • Different SR text for same audio indicates higher risk of misrecognition
  • SR text matches indicate increased reliability
  • At 90% SR accuracy, misrecognitions may occur every sentence
  • SR editors often listen to mostly correctly recognized speech
  • SR confidence scores can be misleading as to accuracy
  • Especially true when comparing output from different SR systems
  • SR misrecognizes, adds, or omits words, but never misspells them
  • With improved SR accuracy, need ways to quickly "spot" potential errors
  • Similar issues with other text, audio, and image pattern recognition
  • Benefits from availability of comparison of different-system outputs

Company also developed software to support languages and dialects underserved by mainstream SR, increase SR accuracy in potentially "noisy" environments such as home living room or car, increase low SR accuracy for meetings, interviews, or videos,

Value Proposition--Selected Features        

1.  Reduce SR editor audio review time up to 90% with 90% accurate SR, use text compare to highlight differences from two SR engines to direct editor to potential errors rather than listening to audio from likely accurate text . . . Time savings from decreased audio review increase with more accurate SR.  With more accurate SR and fewer errors, there are fewer differences and less audio for SR editor to review.  Expected document accuracy is about the same as from gold standard manual transcription.

2.  Improve SR accuracy with nonadaptive speaker-specific user profile compared to conventional adaptive SR . . . . Use company text compare and other tools to create highly accurate individual speech user profiles from dictation, everyday conversational speech, meetings, video audio, and other speech for individuals or small groups.  Improved SR accuracy enhances the speaker experience and reduces expense of human review.  

3.  Selectively modify SR session file document text or audio tags; this supports speaker B can voice correction speaker A's session file text and multi-speaker collaborative document creation  . . .  Company's document window and annotation (comment) desktop features permit one or more users to make unlimited number of text or audio annotations to document text to modify session content with corruption of speaker A or speaker B profile.
4.  Protect confidentiality of automatically or manually processed session file data during; redact confidential text or speech, divide session files segments into > 2 groups, and scramble session file segments within each group before outside processing . . . Offsite and remote storage and processing of electronic data raise issues about individual privacy and confidentiality.  Selective redaction, division, and scramble techniques provides a that balances privacy concerns and needs for efficient transcription.  

Major Speech and Language Software

CSUSA software integrates with Dragon, IBM, and Microsoft Windows SASR and ATT NaturalVoices   Company speech recognition and text to speech software are Microsoft Windows SAPI 5.x compliant.  Potential exists for integration with other SR and TTS SAPI 5.x compliant programs.

For more information on version and operating system requirements, click here

SpeechMax™ is a multilingual, multiwindow HTML desktop speech-oriented session file editor for real-time and server-based SR and other speech and language processing.  It supports standard formatting options, spell-check, macros (for word expansion or other purposes), "undo" and "redo," disaster recovery, and custom style sheets.  It is an alternative to the user interface provided by speech technology vendors. 

Text compare differences between SR software indicate higher risk of misrecognition.  Matches indicate likelihood of greater reliability.  User can open two or more windows to compare documents.  Software improves voice correction of another's speaker's SR in collaborative documents and helps protect privacy and confidentiality with remote, off-site SR correction.   Software read/writes proprietary session file format (.SES).

Software supports:

  • Dictation record, audio playback, and manual transcription
  • Speaker-side real-time dictation and desktop server-side SR
  • Speaker or editor correction audio-aligned SR text
  • Main document window with text and annotation (comment) window
  • Text compare across document or synchronized by phrase (utterance)
  • Correct errors and save verbatim text session file for SR training
  • Manually transcribe presegmented dictation to create SR training data
  • Text compare results other audio, text, or image pattern recognition

A 2002 review described TurboTranscribe™ and VerbatiMAX™ text compare functionality using Dragon NaturallySpeaking and IBM ViaVoice. 

Thanks to Custom Speech USA's SpeechMax, the days of having to choose between Dragon NaturallySpeaking and IBM ViaVoice have come to a close. Operating on the two-heads-are-better-than-one principle, a companion application called SpeechServers runs your dictation through both NaturallySpeaking and ViaVoice. SpeechMax then compares the two results, and enables you to correct the text far more rapidly than you could when using NaturallySpeaking or ViaVoice alone. SpeechMax can display a split screen containing the transcription from each program, and you can quickly select the text from Dragon or IBM that is correct for the final version. To speed up the correction process, SpeechMax's TurboTranscribe technology highlights the likely errors, and enables you to playback specific portions of the original audio by simply selecting text. (Using the optional VerbatiMAX technology, you can also compare manually transcribed text to speech recognition text to generate verbatim text for automated speech training.)  After correcting the text, you can send it and the accompanying audio back to SpeechServers for automated, repetitive training to further improve recognition accuracy. . . . See J. Pascoe, "Two Speech Recognition Programs Are Better Than One," TL Newswire (June 5, 2002)

An early diagram shows "dual-engine" text compare workflow: 

Later enhancements include (1) SpeechMax™ multidocument text compare, (2) SpeechMax™ text compare of manual or automated pattern recognition processing of synchronized source audio, text, or image data, (2) SpeechServers™ dictation audio presegmentation before manual transcription, and (3) SweetSpeech™ speaker-specific, nonadaptive speech recognition and speech and language processing toolkit. 

Value proposition (SpeechMax™): 

This software is a session file processor for audio, text, and image processing.  Among other benefits, it provides significant value in three ways:  

1.  It can synchronize text from different speech recognition systems.  After synchronization, it can differences from the same audio (increased risk of misrecognition) and matches (increased likelihood of reliablity).  This helps operator detect SR potential errors more rapidly.  It also helps the process more quickly generate data sets to train speech recognition, sometimes without need for human supervision.  Similar techniques apply to other speech and language processing, as well as other audio, text, or image pattern recognition.

2.  Software supports selective modification of SR text or audio tags.  This enables a second speaker to voice correct the SR text created by a first speaker within a speech-generated multicollaborative document.  The process uses the 1st and 2nd speaker's audio and corrected text to train the respective speech profiles for both users without corruption of the user profiles.   

3.  Software provides privacy and confidentiality protection for remote, offsite editing of SR document with speech and text redaction, document division, and scrambled session file data.

Synchronized Text Compare

1.  Using output from two or more SR programs, automated wave analysis identifies identical start/end SR session file text arising from same audio.  Retagging and resegmenting algorithm creates an identical number of synchronized segments in each session file.  Operator may text compare by segment (usually short phrase) or use traditional text compare across entire document.

2. By tabbing to differences, listening to audio, and correcting text, speech editor can more rapidly correct output.  This typically has  about the same error rate as gold standard manual transcription (about 5% or less). 

3.  Text compare supports an expected  80% reduction in speech recognition editor audio review time for 90% accurate SR.  It is expected that audio review time using text compare approaches zero as SR accuracy approaches 100%, resulting in significant time savings.  

3.  Text compare error-spotting technology is also supported for other audio, text, or image pattern recognition.  It also supports text compare of manual processing of delimited source data.

4.  Software can compare synchronized session files text results from virtually any source or processing method.  Synchronization requires same number of segments, not identical data content.  

5. Nontext results can also be displayed and synchronized using main document windows and/or annotation feature that can open files, websites, or run programs (e.g., media player).

6. Matched text more likely accurate and can use for "unsupervised" training speech user profile without manual verification.  The process is supported, for example, whether SR uses Hidden-Markov models, Gaussian mixtures, neural networks, or other SR methodology. 

For more information on text compare, click here

Selective modification of session file text and audio tags, including voice correction by second speaker of first speaker's speech recognition

7.  Program supports selective modification of speech recognition text or audio tags.  This feature supports voice correction in collaborative documents and  training user profiles with modified text audio annotation pairs.  Process has other applications, including creation of speech user profiles for robotic speech using synthetic speech.

For more information on multispeaker correction in a collaborative document and other topics, click here

Privacy and confidentiality protection with speech and text redaction and session file division and scramble

8.  Software limits access to content during the correction and editing process.  It uses selective audio text redaction, division of session files into two or more groups, and scrambling of segments within the different groups.  After processing, the censored, redacted material is typically restored along with merging and unscrambling session file segments.  The process strikes a balance between privacy concerns and maintaining efficient speech editing of SR and other transcription.

For more information on privacy and confidentiality protection, click here

Other productivity enhancing features

9. Session file editor implements other productivity-enhancing features.  These include synchronized speech/text for speech editor playback; automatic reassignment word audio tags after session file correction; rapid creation of verbatim text with verbatim annotation tab; creation of separate audio-tagged final (distribution) and verbatim text (training) session files; and rapid identification of document audio without having to play the entire audio file.

10. Software provides a single, speech-oriented graphical user interface to process, compare, or report synchronized results from a variety of source input generated by computers, humans, or both. 

For more general information on the product, click here

SpeechServers™ supports server-based SR so that a speech engine can be centralized and output used at several PCs. Desktop audio file autotranscribe is also available.  Software outputs audio-linked text in original Dragon or other proprietary format.  It also optionally converts to CSUSA session file format (.SES).  Software can also output text  (.TXT).

Software supports:

  • Speech user profile enrollment for Dragon and IBM SR
  • Repetitive, iterative training for Dragon and IBM SR

  • Server-based SR for Dragon, IBM, Microsoft, and SweetSpeech™
  • Presegmentation dictation audio before manual transcription (MT)
  • MT transcribes segmented audio to create transcribed session file

  • Verbatim transcribed session file for SweetSpeech™ user profile training 

  • Create SweetSpeech™ profile for single speaker (speaker dependent)
  • Create SweetSpeech™ small-group or large-group speaker models

Value proposition (SpeechServers™)  

1.  Software supports automated speech user profile training for adaptive and nonadaptive SR, server-based transcription, and audio file presegmentation.

2.  For adaptive SR, CommandProfile™ service enrolls and creates speech user with verbatim text and audio file that is characteristic of speaker's real speech, such as day-to-day dictation audio and the transcribed text.   

3.  SpeechTrainer™ service provides repetitive, iterative corrective adaptive training.  It transcribes audio file, compares output with verbatim text, corrects text through correction window, and retranscribes until correct or limit iterations. 

4.  SpeechSplitter™ service presegments dictation speech into an audio-linked untranscribed session file (USF).  Manual transcription results in audio-linked verbatim, training transcribed session file (TSF). Same process applies to other speech, such as recorded conversation, video speech, or professional voice talent reading for audio book. 

5. SweetSpeech™ nonadaptive SR uses verbatim audio-linked TSF or other audio linked pairs to train or update the speech user profile.

6.  Servers provide speech recognition output in the form of audio-linked session files or simple text file.  The audio-linked session file may represent  manufacturer-specific format (such as .DRA for older Dragon) or common session file format developed by CSUSA (.SES).

7.  System is Microsoft Windows SAPI 5.x compliant but is otherwise independent of SR speech to text conversion techniques.  For example, system can process speaker-adaptive (SA) or nonadaptive SR.   

For more general information on the server-based processing, click here

SweetSpeech™ is a speech engine and model builder for speaker-dependent nonadaptive SR.  It also supports speaker-dependent small-group profiles for meetings, legal proceedings, or videos.  It is designed a "do-it-yourself" (DIY) toolkit for transcription companies, business, law, health care, and government.  Companies and other users can create speaker models from day-to-day transcription or other speech data.

The speech and language data associated with these models is accessible to toolkit user.  It is available for use with other speech and language processing, such voice commands, interactive voice response for telephony, speaker ID, text to speech, phoneme generation, machine translation, audio mining, or natural language understanding.   Software is Unicode compatible and supports unilingual, bilingual, or multilingual speech user profiles.   

Software supports:

  • Nonadaptive (NASR) approach compared to mainstream SASR
  • Profile creation without MAP, MLLR, or similar mathematical approximation

  • Creation single user profile with speaker's dictation or other speech

  • Creation small group or large group user profile 
  • Profile based on conversational, dictation, meeting, or interview speech

  • Profile tuned to speaker speaking style, word use, and pronunciation

  • Microphone, handheld recorder, telephone/cell, video, or other speech
  • Profile reflecting recording device, background, or channel noise

Value proposition (SweetSpeech™):

1. Software includes automatic TTS pronunciation generator for phonetics editor, automatic linguistics questions generation, and other tools for automating production of speech user profiles and reducing need for expensive lexical expertise. 

2. No microphone speaker enrollment or corrective adaptation required.

3. With this system, user can create nonadaptive SR user profile from dictation, conversational, or other speech.  Process may generate training datasets from manual transcription or extraction from transcribed SR.

4. Automatic tying/untying states for creation/updates to speech user profile using accumulator and combiner techniques. 

5. A 2004 research article evaluated nonadaptive single-speaker SR user profile with less than 8 and over 15 hours of training data.  This resulted in improved accuracy of about 1% compared to adaptive SR.  It also showed higher relative error reduction of nonadaptive SR compared to speaker-independent and adaptive SR, and less rapid saturation of acoustic model compared to speaker-adaptive system. 

For discussion of this article's findings, click here 

6.  Company software supports creation of nonadaptive small group profiles for meetings, interviews, or video transcription.  Intended use is for group of 2 or more speakers that meet frequently over longer periods of time.   Group profile represents range of speech characteristics of different speakers.  In addition, software supports creation of separate speaker-dependent speech user profiles for each speaker. 

For more information on the software features, click here


Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)