Practical Solutions for Common Tasks

Awarded multiple U.S. and foreign patents, the CSUSA provides products and services for audio, speech, text, image, and multimedia processing for the home or office.    Major product categories include desktop utilities, session file editor, workflow, speech and language processing toolkit, PC solutions, software suites, and third-party products for speech recognition (SR) and other functions.

CSUSA software includes SpeechMax
session file editor, SpeechServers server-end SR transcription, user training, and audio presegmentation, and SweetSpeech  speech and language processing toolkit including nonadaptive SR and model builder. 

CSUSA Command! software supports workflow, telephone dictation, and call management. 

Desktop utilities include acWAVE
audio conversion tool for Sony and Olympus handheld recorders and other audio, CustomMike driver for Philips handheld and other microphones, MacroBLASTER™ macro editor for programmable keypad, voice commands, and barcode, PathPerfect™ workflow diagram editor, PlayBax™ transcriptionist audio playback tool for Infinity foot control and similar devices,  TTSVoice™ voice for text to speech, and desktop SR autotranscribe for Microsoft Vista. 

See Contact to obtain specific product information.

Various speech-related and other publications have reviewed company products.  Most recently, Speech in the User Interface: Lessons from Experience (2010), a collection of articles edited by speech-industry consultant and Speech Strategy News (Tarzana,CA) editor William Meisel, included a review of company software.

Company has shown its software at various trade shows, meetings, and other venues, including ABA Law Practice Management Section, ABA TECHSHOW, American Association for Medication Transcription (AAMT), Annual Conference of the New Hampshire Medical Group Management Association, Annual Neuroimaging Symposium at the Barrow Neurological
Institute, IBM Annual Conference for Speech Recognition Dealers, Legal Tech Chicago, Medical Transcription Industry Association (MTIA), Northern Illinois Physicians for Connectivity, Northwestern University Medical Informatics,  OPEN MRI 2000 Conference, and Radiology Society of North America Annual Meeting.  

Dragon, IBM, and Windows SR Compatibility/OS Support

Company developed its software using Microsoft Windows.  Current software is compatible with Windows XP with limited testing with Vista or later Windows OS.  Company uses Microsoft SAPI 5.x for SweetSpeech, Microsoft (Vista), Dragon (Nuance), and IBM speech recognition, and Microsoft and ATT Natural Voices text to speech. As of 2007, IBM no longer supported ViaVoice speech recognition. 

Server-based and/or real-time speech recognition software supported includes Dragon Professional, Medical, and Legal 10, Dragon Preferred 10, IBM ViaVoice Professional 10, Windows Vista SR, and this company's nonadaptive SweetSpeech™.  System also supports SAPI 5.x text to speech, including Microsoft, AT&T Natural Voices.  System may run with Dragon Professional, Medical, or Legal v. 8.10.000.285 or higher, but only version 10.x is supported.  System may run with IBM Professional v.8.x or higher, but only IBM USB Pro 10.x is supported. 

Company developed latest version SpeechMax software with Microsoft .NET 2.0.  It tested this and other company software with Windows XP.  Limited testing with early Windows Vista indicated device driver issues. 

Partial Customer List

AlphaBest, Tarzana, CA
American Business Systems, Keller, TX
Anthurium Solutions, Inc., Boston, MA
Arkansas Public Defender, Little Rock, AK
Associates in Oncology/Hematology, Rockville, MD
AudioEye, Inc. Tucson, AZ
Automated Business Products, Salt Lake City, UT
BEA Systems, San Jose, CA  (acquired by Oracle)
Billings Clinic, Billings, MT
Avon Lake Police Department, Avon Lake, OH
Cisco Systems, Inc. San Jose, CA
City of Bloomington, Bloomington, IN
Coconino Sheriff's Office, Coconino County, Flagstaff, AZ
Columbia University Medical Center, NY, NY
DialAmerica Marketing, Mahwah, NJ
Diversified Software Systems, Morgan Hill, CA
Dyviniak Word Processing Services, Covina, CA
eScription, Needham, MA (acquired by Nuance)
FlowServe Corporation, Irving, TX
InfoComm Development Authority of Singapore
Los Angeles County Office of the Medical Examiner, Los Angeles, CA
M3 Medical Management Services, Chicago, IL
MD Notes, Sewickley, PA
Montserrat Day Hospitals, Spring Hill, Queensland, Australia
Neurology Center, Fairfax, VA
Northwest Regional Pathologists, Bellingham, WA
Open High-Field MRI and CT, Westchester, Larchmont, NY
Orthopaedic Specialists of NW Indiana, Munster, IN
Page Weavers Internet Development, Sacramento, CA
Pathology Center, Bellingham, WA
Pay-Tel Communications, Greensboro, NC
Procter & Gamble, Cincinnati, OH
QBE Insurance Group, Sydney, Australia
Reading Ware, LLC, San Diego, CA
Roberts & Schaefer, Chicago, IL
Town of Munster, Munster, IN
Vee Technologies, Bangalore, India
Villanova University, Villanova, PA
Whiteco Industries Legal Department, Merrillville, IN

Company began as "speech trainer" for Dragon desktop users. Company later became a reseller, integrator, and developer. 

More information is included below about software development milestones, summary of software features by name, and web video demos.

Products and Services

Solutions . . . .   That Work For You . . . .
Session File Editor--SpeechMax™   Workflow   
  Full Edition   General
   Reader/Viewer/Player   Web Services
   Add-In for Microsoft Office   Browser-Based Work Queue
   Session File Processor   Server-Based
Desktop Utilities      SpeechServers™ (SR)
   Audio Conversion      Text to Speech 
   Text to Speech      Audio Conversion 
   Transcription Playback      Telephone Dictation
   Sound Recorder      Call Management
   Macro Editor   Other
   Workflow Diagram Editor      SweetSpeech™ (SR engine/toolkit)
   Autotranscribe for Microsoft Vista        Software Development Kits
Suites      PC Solutions
   SpeechProfessional™      Third-Party Products
   EnterpriseSpeech™      Services

Reseller/Integrator/Developer

 

 

 

CSUSA has worked as an integrator, software developer, and reseller of speech solutions and development tools.

Company software supports digital and telephone dictation, manual transcription, client-side and server-side speech recognition, voice commands, text to speech, audio mining, translation, speaker identification, natural language understanding, audio conversion, other speech and language, and workflow management. 

Services have included custom programming for advanced speech processing, custom telephone solutions, including VoIP, and  business systems integration.

CSUSA SpeechMax™  session file editor and SAPI 5.x compliant SpeechServers support Dragon, IBM, and Windows adaptive SR.  Company created software  for use as an add-on to boxed, off-the-shelf software that they buyer may already own, or with runtime licenses. 

With SpeechMax™ session file editor, the physician, lawyer, or other speaker can dictate and view the transcribed SR text real-time.  The speaker, or a speech  editor, can correct the errors in the session file editor, and save the corrected text for training the speech user profile.

Alternatively, the speaker can record dictation, and send it for transcription by speech to text conversion by SpeechServers.  The speaker can also correct the transcribed session file, or as with real-time speech recognition, send it for correction by a speech editor using SpeechMax™.

SpeechMax™ has a unique document window architecture with text and audio annotation (comment) windows.  Software supports easy, rapid forms creation by secretary or other end user, multispeaker, collaborative documents, embedded session file AV multimedia, and privacy protection. 

CSUSA has sold its products in the U.S. and overseas to large and small businesses directly or through over 75 resellers. 

Customers represent a variety of fields, industries, and backgrounds--transcription, health care, law, insurance, software development, education, law enforcement, government, and others. 

CSUSA has shown its software at  medical, legal, and transcription trade shows and meetings.

Various publications have reviewed its software.


 

 

Beginning in 1998, CSUSA introduced various enhancements for Dragon and other third-party speaker-adaptive speech recognition (SASR), including: 

(1) pretraining speech user profile using dictation audio and transcription with early versions of session file editor
SpeechMax™ and  SpeechServers to reduce speaker microphone enrollment time essentially to zero,

(2) repetitive, iterative server-based training for adaptive speech user profile to improve SASR accuracy with SpeechServers
uses automation to make SR more accurate and reduce speaker or speech editor correction time, and

(3) speaker and speech editor auto error detection with dual- or multi-speech engine text compare to reduce speaker and speech editor search time for SR errors--80% or more for 90% accurate SR--using session file editor
SpeechMax™.  Approach applies to speaker adaptive SR, nonadaptive SR (NASR), speaker independent (SISR), and other SR.  It also applies to text compare applied to other manual or automatic processing of audio, text, or image data. 

SR text compare uses wave form analysis to resegment and retag audio-linked text to synchronize output from different SASR, NASR, SISR, or other SR.  This creates synchronized session files with same number of text segments for efficient comparison.   

CSUSA has also created enhancements that support selective modification of SpeechMax™
session file editor document window SR text and audio tags. 

Enhancements also help protect individual privacy and confidentiality.  Special tools redact (censor), divide, and scramble session file speech content before remote editing or other processing.   

CSUSA also developed SweetSpeech, a proprietary, speaker-dependent, nonadaptive speech recognition (NASR) engine and model builder toolkit.  This innovative software enables a transcription company, business, law firm, hospital, law enforcement, or government to:

(1) create SR speech user profiles form dictation and manual transcription, thereby utilizing day-to-day dictation audio that was formerly discarded as a useless byproduct of the transcription process, and

(2) use data from this interchangeable SR profile also for, voice commands, interactive voice response (for telephony), speaker identification, machine translation, audio mining, text to speech, phoneme generation, and natural language understanding. 

CSUSA submitted its first U.S. patent application in 1998.   

Software Development Milestones

SpeechMax™  SpeechServers  SweetSpeech™

1998: Implemented early version of SpeechServers server-based based SR for Dragon NaturallySpeaking using SDK for real-time SR.  Enrolled speech user with audio file and no microphone enrollment.  Used early version of  SpeechMax™ that synchronized dictated speech with highlighted text.  Software also included early version of comparison of SR text to verbatim transcribed manual text.  This determined speaker SR accuracy and whether speaker should progress to full automation phase with server-based SR and speech editor correction.

1999:  IBM ViaVoice did not accept audio file enrollment.  Company developed microphone emulation to direct audio file to IBM and enroll speaker. 

1999: Created text compare software using dual-engine approach using early version of
SpeechMax.  Designed to indicate likely SR errors and reduce speech editor audio review time.  Used Dragon and IBM speech engines.   

1999: Implemented post enrollment, server-based repetitive, iterative training for speaker-dependent speaker-adaptive systems with early version of
SpeechServers™. Software transcribes audio, compares transcription to verbatim text, and corrects until system correctly transcribes audio for preset number of iterations. 

1999: Used early version of
SpeechMax™ used to exclude unintelligible or unusable audio and text from training data used to train speech user profile.

2000:  Additional refinements made in display dual-engine text and speech user training.     Company also completed early versions of Command! workflow software.

2001: Utilized start-end word text comparison technique to synchronize different speech engine audio aligned text. 

2003:  Extended text comparison to form dictation.

2004: Introduced "second generation session file" concept to promote more efficient synchronization of output from different SR systems.  Supports conversion of third-party SR session file to proprietary CSUSA session file format (originally .CSF, now .SES) using Dragon, IBM, or other proprietary SDK software.  Process results in same audio-tagging or all converted third-part SR and display in common
SpeechMax™ reader. 

2004:  After conversion to company format, software supports retag start/duration times of session file segments.  This res/ults in an equal number of segments in each session file.  Each segment has the same start/duration times as the same segment in another session file transcribed using the same audio.  This supports  comparison of text from different speech engines synchronized arising from the same audio. 

2004: Automatically audio-retagging corrected word(s) and corresponding phrase audio to maintain audio-text alignment.

2004: Created efficient SpeechMax™ techniques for separately saving both audio-linked distribution (final) and verbatim text (what the speaker said, including uncorrected errors or extraneous statements).  The latter is used  for speech user profile training.  Updated  to support saving both audio-linked verbatim (training) text and audio-linked distribution (final) text for quality assurance and other purposes.  

2005: Developed multiwindow, multilingual desktop graphical user
SpeechMax™ interface for synchronized comparison of text output from SR and other pattern recognition.  Session file editor supports opening one or more windows, read/write text, audio, and image, and annotation window for entry of text or audio associated to document.  Text annotation may include command line to run program such as media player or access website to review source data or output. 

2005:  Session file support implemented for nearly unlimited open/modify files with different file types in editor.  Further, as long as session files have equal number of segments, they may be synchronized and displayed.  For example, process may synchronize and display sequential output from SR in source language, machine translation to foreign text, and text to speech conversion translation to speech. 

2005: Introduced SpeechServers server-end  desktop segmentation dictation audio before manual transcription.  Process creates  presegmented dictation audio untranscribed session file (USF).  Transcriptionist plays back USF audio and manually transcribes text to create audio-linked transcribed session file (TSF) for creation of final (distributable) text and audio-linked verbatim text for speech user profile training.

2005: 
SpeechMax™ resdesigned to support use of single session file editor for human processing of data and review of computer results, supports use of a speech-oriented, common application for reporting synchronized source data from a variety of source input.

2005: Introduced SpeechMax™ verbatim annotation where there is a discrepancy between "final" and "verbatim" text.  Verbatim annotation creates text for training session file (.TRS) that is automatically substituted for final text.  If there is no discrepancy, final text is automatically saved as verbatim text for creation of training session file (.TRS).

2005: 
SpeechMax™ tools developed to mark poor quality audio or nondictated text and exclude from speech user profile training. 

2005: SpeechMax™ supports text compare of synchronized (same segment) output results from manual or automated processing of data, audio, text, or other image.   

2005:  Methods developed using SpeechMax™ for efficient processing speech from two or more group speakers during a meeting, legal proceeding, interview, or video.  Process develops speech user data from manual transcription and/or SR to create small-group user profile and individual profiles for > 2 speakers. 

2005: Introduced
SpeechMax™ best guess" ("best result) session file for SR and other pattern recognition.  Composite session indicates single most likely result based upon synchronized computer and/or human (manual) results for  audio, text, or image data processing.  Consolidated text compare using three texts, for example, results in the following color coded results:  nonhighlighted text (clear) likely correct due to due to lack of differences, highlighted pink indicates differences between 2/3 files and increase risk of error, and highlighted red indicates differences between 3/3 sources and greater error risk.  User can determine visually from the color-coded display where there are differences and increased need to verify reliability with audio playback.   Process does not use confidence scores of results to determine color-coded best result.  

2005:  Process may use "best result" audio-linked matches (clear) for unsupervised  pattern recognition training.  For example, process may submit audio-linked text to speech engine model builder to create SR speech user profile without manual review.

2005: Completed development of
SweetSpeech speaker-specific SR engine and toolkit. Company software enables transcription companies and businesses to develop highly accurate speaker-specific SR user profile.  This usually takes as little as 6-8 hours of manually transcribed manual transcription of a speaker's segmented audio.  Training process can also use preexisting speaker independent and other SR software with second generation technology to create audio-linked text for speaker-specific user profile.  Use SpeechMax™ text compare to determine relative accuracy of different potential models.

2005:
SweetSpeech uses actual speech or channel of background noise to create speech user profile.  Software does not begin with creation of speaker-independent (SI)  model from hundreds or thousands of voices, does not adapt speaker-independent model to speaker-dependent with speaker enrollment and correction, and does not model changes in SI model based upon estimation of voice characteristics.  Potential applications include creation of SR for voice commands in "smart" environments where it is difficult to model for unique background or channel noise (e.g., airplane cockpit, car, or noisy home).  Literature supports concept of high recognition accuracy of speaker-specific SR systems.  

2005: Introduced specialized
SweetSpeech tools for training SR models.  Tools were developed to assist nonexperts with determination of lexical pronunciation using text to speech phoneme generator.  Software also provides for automated linguistics question file generation for state tying in cases of data sparsity.  Speech user profile creation and update training use accumulator and combiner technology to train systems.  State untying developed for training updates to user profile.   

2005: 
SweetSpeech tools created to use training session file data (.TRS) to create speech user profile for SR with interchangeable speech user data for SR and other speech and language processing, such as voice commands, call center telephony, speaker ID for voice biometrics, personalized text to speech, phoneme generation, audio mining, machine translation, and natural language processing for automated content extraction.  .

2005: Applied
SpeechMax™ text comparison and synchronization techniques to manual transcription and/or automated SR of multispeaker, single channel audio.  This represents speech recorded with single microphone for meeting (e.g., board of directors), interview, or group video.  Manual transcriptionist processes untranscribed session file (USF) to create transcribed session file (TSF) associating each segment to the speech of a particular speaker.  This results in creation of verbatim audio-aligned TSF data training sets for speech user profile training for an individual speaker and multispeaker, small-group speech user profile.  Use multispeaker profile and each individual speaker profile to transcribe group audio and text compare to determine likely correct output.

2006  Introduced
SpeechMax™ selective modification of document SR text or audio tags with text or audio using annotation window.  This enables speaker B to correct the transcribed text of speaker A and train both speaker's user profiles.    Applies to simple or complex documents, including session files with text, graphics, sound, or embedded audiovisual content (e.g., lectures, sales presentations, or electronic audio books).  Second speech user can correct the text dictated by a first user with original session file open in the buffered read/write document window with no corruption of either speech user profile.  System can use verbatim corrected text and first and second speaker audio to train profiles of respective speakers ("annotation training"). 

2007: 
SpeechMax™ session file lock introduced to preserve document integrity and create read-only, portable session file.

2007: Implemented
SpeechMax™ session file editor divide and scramble content feature and audio/text redaction to protect confidentiality and privacy, lock session protects data integrity.  Optional divide feature separates session file segments into two or more groups that may be sent to two or more processing nodes, division limits knowledge of a single transcriptionist, SR facility, or SR editor of entire content.  Segments of each division may be scrambled (reordered) to obscure meaning and content.  Each segment generally represents an utterance so that manual transcriptionist or speech editor has sufficient audio context to assist with transcription or correction.  Redaction is available for speech-recognition audio and text for names, addresses, and other identifying personal identification.  Workflow unscrambles corrected segments, reinserts redacted content (if applicable), and reassembles to create final, distributable document.

Summary of Features

Major Features SpeechMax™  . . .

Company has designed this multiwindow, multilingual desktop text editor for manual or automatic speech and language processing. The software supports advanced text comparison by editor of synchronized output text created by manual transcription or speech recognition to more rapidly detect likely errors made by manual transcriptionist  and/or speech recognition.  The techniques were originally developed for text comparison of transcribed dictation.  Given the increasing processing of data manually or by computers, company has extended text comparison to include conversion, interpretation, or analysis of other synchronized text, audio, or image pattern recognition source data processed by manual or automatic means or both. 

General

Speech-Oriented Desktop Graphical User Interface  . . .  Microsoft Windows compatible multiwindow, multilingual (Unicode) session file editor, can read/write .RTF, .TXT, .HTML, audio, image, and .SES proprietary session file.  Software supports one or more main document windows with text and audio annotation (comment) windows. Text annotation supports creation of one or more hyperlinks linked to specific document text.  Text annotation may also represent command to run program, such as video player.  Program supports real-time and server-based  speech recognition, plus other speech and language processing. It is designed for use as a common interface for dictation, speech recognition, voice commands, speaker recognition, translation, text to speech, audio mining, natural language understanding, and phoneme generation. It also supports comparison of synchronized results from other text, image, and audio pattern recognition, including medical imaging.  Easy-to-use tools are available for training a shared speech and language user profile.  A transcriptionist or other specialized member of the transcriptionist team can serve as a "speech trainer" to help create the profile   Application programming interface (API) and developer tools are available.   

Web Help

1. Data Synchronization and Text Comparison

Company text comparison techniques rely upon synchronization of delimited bounded input data and comparison of delimited output data resulting from same input data.  This differs from standard text comparison that compares text without reference to source data.

PlaySpeech™ . . . Selection of speech recognition word or phrase and synchronized playback of selected text, also supports playback selected utterance of an untranscribed session file (representing segmented audio only) or a manually-transcribed session file

VerbatiMAX™ . . . Compare manually transcribed verbatim text file with speech recognition session text to reduce search time for recognition errors and create audio-linked training text

Word Check™ . . . Text comparison of audio-aligned output from two or more speech engines and/or audio-aligned manually transcribed text, identifies likely speech recognition or manual transcription errors, and reduces transcriptionist search time, for continuous narrative dictation and structured dictation into form (also can text compare output processed automatically or manually from machine translation or other text, audio, or image pattern recognition, displays text in relation to aligned source data input)

SpeechLocate™ . . . Technique for locating identical audio segment using text comparison by matching start and end words, also use for audio mining

Second Source™ . . . Text compare document containing “boilerplate” and other standard language manually or automatically transcribed from previous audio file or continuous speech with similar document containing standard language created from new audio file or real-time speech

Second Generation Session File™ . . . Creation of a second session file having audio-synchronized text after readjustment of text audio tag information based on analysis of audio file wave, for example, process can realign asynchronized transcribed session file segments from Dragon NaturallySpeaking and IBM Via Voice speech recognition processing of same audio, Dragon and IBM "second generation" session files have equal number of synchronized segments in retagged session files, "second generation" file represents a form of standardized output that can be used to compare other speech recognition or create a portable session file for speech recognition or other audio-linked text that can be opened in a common user interface
 
SpeechParts™ . . . Dividing techniques based upon a predetermined boundary division may result in audio tags for utterances, for words, or for word subunits
 
tcReliablity™ . . . Word mapping tool to formulate statistical reliability that text compare matched phrase represent correct transcription and, based on this statistical reliability, have transcriptionist determine the need to evaluate a matched word or phrase with audio playback (not currently available)

DataInSync™ . . . Synchronized results from text, audio, or image pattern recognition processed by manual or automatic means, indicates that each output session file arising from the same bounded input data has the same number of segments

Best Session™ . . . software synchronizes output from two or more session files to create a most-likely "best result" representing most common output, or selected index engine output if all different, color coded best result highlights differences between speech recognition texts, for example, for a given word or phrase transcribed from same audio, no highlighting of best result word or phrase indicates no difference between 3 speech recognition texts and low risk of error, pink highlighting indicates differences in 2/3 texts and moderate risk of error, and red highlighting indicates differences in 3/3 texts and high risk of error, editor reviewer may quickly determine visually where likely errors by looking at highlighted best result text without having to review each individual text and may selectively tab to likely errors, listen to audio and correct, best result composite may also represent session file output from manual or automatic processing of text, audio, or image input data

2. Creation of Final (Distribution) and Verbatim (Training) Text

Session editor includes standard tools for text editing, plus additional, specialized tools for creating shared speech user profile for speech and language processing.  Some of these specialized tools are described below and further in website.

SweetSpeech . . . SweetSpeech™ feature, runs as server or desktop SpeechMax™ SR plugin, creates segmented, untranscribed session file (USF) from audio file, manual transcription converts to audio-aligned transcribed session file as report and/or training session file for acoustic model

Instant Verbatim™ . . .  Optional simultaneous creation of final (distribution) and verbatim text with document transcription, add training verbatim text as annotation if difference between final distribution text and verbatim (e.g., where speaker made mistake or extraneous comments that are not included in final text)

SpeechTracker™  . . . .  Identifies corrected session file text segment and automatically reassigns adjusted word audio tags to final and verbatim text

Best Speech Only™ . . .  Operator may exclude poorly audible speech, speech with considerable background or channel noise, or "ahs" and "uhms" from training session file, similarly, operator may also exclude text for headers, footers, and other nondictated text from training material

Text Split Plus™ . . . Alternative method of creating verbatim and/or distribution text, begins with SpeechSplitter™ untranscribed session file, may import text and open in SpeechMax™ document window, playback untranscribed session file and associate phrase text to each audio utterance to create audio-aligned text for electronic audio book or other audio-aligned text

SessionTraining™ . . . Training session file from document window representing delimited output data and delimited text, audio, or image input data; with speech recognition or manual transcription, session training file represents document window audio-tagged verbatim text 

3. Annotations, Forms, and Collaborative and Complex Documents

Operator may use text processing, data synchronization, and text comparison tools described above with more complicated documents.

Annotations (General) . . .  Both audio and text annotations (comments) are supported, one or more users may enter text associated to specific document text or space in document, text annotations may include messages, hyperlinks, or run program command

The Talking Form™ . . . Forms and templates  "made easy," generate audio prompts for user (audio annotations), enter text with keyboard, barcode, or speech recognition, includes integrated sound recorder for dictation and transcription audio playback, more than one speaker may complete form, use dictated speech and text for speech user profile training, use speech recognition to correct speech recognition text of another speaker, no corruption of either speech user profile, use corrections to train both speech user profiles, also selectively modify speech recognition document text audio-tags, associate form field name or other document text to one or more multilevel hyperlinks, may use annotations to provide knowledge base or information to form user, associate document text or form field name to text annotation to launch media player or other programs with command line

My AV Notebook™  . . . Software for "do-it-yourself" presentations, create single- or multi-speaker multimedia, session file may include nondictated headers and footers, speech-aligned text, synchronized lyrics and music, photos, graphics, video, or animation, may associate one or more hyperlinks or command line to specific document text, launch websites, video player, or other programs, interactivity with mouse, keyboard, barcode, and voice commands, synchronized real-time highlighting audio book text with audio, user can switch between listening and reading, elapsed time display makes it easy to return to audio book to read or play, use same techniques for illustrated lectures and classroom teaching aids, sales presentations, speeches, electronic scrapbooks, singalongs and karaoke, use speech text data to create text to speech voice font for professional voice talent, actors, politicians, and other speakers

SelectiveEdit™  . . . Selectively modify session file document audio tags or text through annotation window, concatenate annotation audio to train speech user profile, speaker B may edit speaker A speech recognition with no corruption of either speech user profile, switch tags functionality (transpose document and annotation text or audio), train user profile speaker A with corrected text and speaker B profile with speaker B's speech and associated text 

AnnotationTraining™ . . . Generate annotation training by selecting text audio annotation pairs by annotation ID and training speech user profile of one or more users, different from general session training which uses document verbatim text and playback audio to train speech user

4. Other Applications

Data input may consist of speech, text, handwriting, fingerprints, medical and other digital images, music, and other data that may be processed using pattern recognition.   There is a need for a single application to support conversion, interpretation, analysis, comparison, and reporting of synchronized speech, text, audio, or image data that has been processed manually, automatically, or by both.  Note that sequential reports based upon same data may be synchronized. 

Medical Imaging . . .
Training a representational model for a computer-aided diagnosis program having pattern recognition capabilities based upon human evaluation of bounded data sets

• Computer-aided diagnosis (CAD) for Mammography . . .
Use mammogram reporting session file to evaluate bounded CAD-determined suspicious areas by two or more radiologists

Other Pattern Recognition . . .
Data input from diverse sources including text, audio, and image, real-time and recorded, human and mechanically-generated audio, single-speaker and multi-speaker supported

For example, two or more computer-aided diagnosis programs for mammography may analyze the same delimited bounded suspicious areas, and grade each according to likelihood for cancer.  A breast radiologist may review and dictate final report in English synchronized to initial data input.  This may be followed by machine translation into Spanish synchronized to English and corrected by office editor.  All the session files may have equal segments such that SpeechMax™ user may load all the session files, select one file, and tab through selected file and highlight sequentially each segment in that file and synchronized segments of other files.  Regardless of data type or processing, session files with equal segment numbers may be synchronized. 

5. Privacy and Confidentiality, Document Integrity

ScrambledSpeech™ . . . Dividing and scrambling session file content prior to distributing for processing to limit any single human or automated processing node’s knowledge of document content

SpeechCensor™ . . . Selective redaction audio and text

SessionLock™ . . . Locked session file, read-only

Major Features SweetSpeech™ . . .

Speech recognition for dictation is a form of pattern recognition. Speaker-independent speech recognition uses audio, text, and pronunciation data from many different speakers to create a multispeaker speech user profile.  For many years, leading speech recognition software have supported speaker-dependent user profile creation using speaker-adaptive techniques.  Examples include Dragon NaturallySpeaking, Philips SpeechMagic, Microsoft Windows speech recognition, and IBM ViaVoice (no longer supported by IBM).

Speaker adaptation uses microphone enrollment by reading one or more scripts to create a new user and correction of day-to-day speech recognition errors to further train (adapt) the new user.  The new speech user profile is more speaker-dependent than the original model.  The adaptation process uses MAP (Maximum A Posteriori) or MLLR (Maximum Likelihood Linear Regression) or other estimation techniques to create an acoustic model approximating the characteristics of the speaker’s speech.  It is an estimation that does not model the actual speaker's speech or actual background or channel noise. 

The resulting speech user profile includes data about the statistical probability of basic speech sounds following and preceding others (acoustic model), the order or combination of particular words (language model), and text representation of speaker word pronunciation (lexicon).  Off-the-shelf speech recognition generally makes user profile data transparent (invisible) and inaccessible to the speaker or developer.  These users cannot use this data to create other models for other speech and language processing, such as voice commands, speaker recognition, or text to speech.  For a snapshot in time of industry software and practices, see Speech Strategy News sample issue newsletter.

Nonadaptive speaker-specific speech recognition creates speech user acoustic and language models and lexicon based only on the speaker's speech.  Research indicates  that speaker-dependent speaker-specific speech recognition using just the speaker's speech data is more accurate than speaker-adaptive speaker-dependent or speaker-independent models.  In addition, with more open access to speech user profile data, speakers and developers could use the data and models for other speech and language processing, such as voice commands, interactive voice response for telephony, speaker identification, text to speech, and other speech and language processing.   

Company provides an alternative with SweetSpeech™ toolkit and speech recognition engine with:

Just My Speech™ . . . Use this company’s toolkit and text editing software to create data sets for speaker-specific speech user profile from microphone, telephone, or mobile device recorded speech, company do-it-yourself tools are designed to assist software developer, local information or transcription services, or advanced end user in creating a user profile based upon the speech of a single speaker, including speaker with poor accuracy with speaker-adaptive software despite training and correction, speaker with no speech recognition available for language or dialect, or speaker with an unusual accent or speech impediment 

Multiple Speech Recognition Options . . . May also create custom group user profiles for small group of speakers meeting over time with speech recorded using a single microphone or other recording device (multispeaker, single channel), e.g., corporate board meetings, lengthy depositions or trials, groups subject to law enforcement or national security surveillance, software also supports creation of large group multispeaker speaker-independent speech recognition

SharedProfile™ . . .  Includes acoustic and language models, lexicon, formatting preferences, and audio segmentation as shared data for speech recognition, voice commands, interactive voice response for telephony (IVR), speaker recognition, audio mining, text to speech, phonetic generation, machine translation, and natural language processing, users and developers can use the shared models as "interchangeable parts" for other speech and language processing

Model Builder . . . Software includes tools for creating personalized acoustic model, language model, and lexicon, Unicode compatible software supports speech and language processing in many languages

SpeechScape™ (Acoustic Model) . . . Build, train, and update one or more acoustic models with audio and verbatim text, use speaker's speech and text for speaker-specific training, utilize data from business or professional dictation, YouTube, Facebook, or other web audio, or cell phone or other telephony, use accumulator and combiner techniques to create and modify acoustic model, automatically generate linguistic questions file in cases of data sparsity, software includes tying and untying states for creation and combining one or more accumulator files, model creation generally requires 6-8 hours of good quality audio data, recommend minimum 8 kHz/8-bit for telephony, 16 kHz/16bits for other audio, may also use software to create small group and large group models, create and store acoustic model locally or on the cloud, use acoustic model also for voice commands, interactive voice response for telephony, speaker recognition, audio mining, text to speech, and other speech and language processing

WordContext™ (Language Model) . . . Add text samples, build model, revise model based upon use of different or additional samples, use trigram or other N-modeling for speech recognition, may also create custom user profile for bilingual or multilingual speaker who dictates in two or more languages simultaneously, use language model tool  for gisting translation (providing a rough idea of what speaker says), use language model also to edit categories tool for semantic processing for natural language understanding and key concept search

WordPronounce™ (Lexicon Model) . . . Add or remove words to lexicon, lexicon associates each word in language model to phonetic pronunciation, audio pronunciation generator uses text-to-speech to assist in generating lexical (phonetic) pronunciation, use for speech recognition and other speech and language processing  

Segmentation Speech and Text . . . May use same speech segmentation parameters for speech recognition, voice commands,  interactive voice response for telephony, speaker recognition and audio mining, and same text segmentation rules for text to speech, machine translation, and natural language processing, may use common segmentation parameters for two or more instances of manual and/or automated processing of same dictation audio, common segmentation results in synchronized data reflecting equal number of input dictated phrases or sentences (utterances) and output text segments (consisting of audio-aligned text)

Formatting . . . Unformatted speech recognition engine text “forward” formatted in postprocessing step before speaker or transcriptionist views text (e.g., converts transcription "fourth of July, eighteen sixty five" to "July 4, 1865"), formatted text is “reverse” formatted when text submitted for training acoustic model, toolkit supports end user selection of preferred formatting for dates, measures, weights, currencies, telephone numbers, and other text

User Management and other Tools . . . Speech and language processing toolkit includes tools for user management and user settings, plus new user and train new user wizards

SweetSpeech™ SAPI 5.x Text to Speech (TTS) and Speech Recognition (SR) Plugins for SpeechMax™ . . .  Supports SpeechMax™ desktop processing with Microsoft or AT&T Natural Voices text to speech or Microsoft and SweetSpeech speech recognition and with other SAPI 5.x compatible software,  

SpeechSplitter™ . . . SweetSpeech feature runs as remote server function or as desktop SpeechMax™ speech recognition plugin, see discussion below under SpeechMax™.

Web Help 

Major Features SpeechServers . . .

Software supports Microsoft SAPI 5.x server-based speech recognition with SweetSpeech™, Microsoft Windows speech recognition, Dragon NaturallySpeaking, and IBM ViaVoice (early versions only, no longer supported by IBM).  Potentially supports other SAPI 5.x speech recognition.  Desktop server-based autotranscribe available for SweetSpeech, Microsoft, and Dragon engines is also available.

CompleteProfile™ . . . Pretrain speech user profile with verbatim text and audio file to eliminate traditional microphone enrollment for speaker-adaptive speaker-dependent systems

StealthSpeech™ . . . Use microphone emulation to enroll speech users with audio file instead of traditional speaker enrollment where system requires some form of microphone enrollment

SpeechTrainer™ . . . Automate post-enrollment corrective adaptation to more quickly improve accuracy by transcribing audio, selecting differences between transcribed output and provided verbatim text, applying corrections, and process repeated until appropriate conditions are met (target accuracy, unable to correct further, maximum number of cycles) (supported with early versions Dragon Naturally Speaking and IBM ViaVoice only)

TransWaveX™ . . . Dictation audio with text output only

SaveSession™ . . . Dictation audio with audio-linked text (session file) output, use audio-linked verbatim text for speech user training

Web Help

Demo #1A shows SpeechSplitter™ utterance (phrase) segmentation to create untranscribed session file (USF) from dictation.  Transcriptionist manually transcribes in SpeechMax™ to create transcribed session file (TSF) using PlaySpeech™ functionality   Demo also shows realignment segment boundary marker to include audio for "period" with larger adjacent utterance.   Flash WMP   
 
Demo #1B
 shows SpeechSplitter™ utterance (phrase) segmentation to create untranscribed session file from dictation. This demo shows a case where the transcriptionist would like to work with fewer vertical utterance markers.  Transcriptionist imports previously transcribed text, sequentially listens to each untranscribed utterance using  PlaySpeech™, and sequentially delimits each utterance by toggling play audio control.  The result is a transcribed session file as above.  Flash WMP    

Demo #2 shows server-based transcription using prototype SweetSpeech speech recognition.  In-house staff created speech user profile with SweetSpeech speech and language processing toolkit.  Video first shows text immediately after speech-to-text conversion (raw speech engine decoding).  This is followed by regular expressions algorithms to search and match text strings.  Conversion rules may reflect speaker or institutional preferences.  Speech user profile typically reflects these preferences.  User loaded post-formatting transcribed session file (TSF) into SpeechMax™ to play back audio and make any needed corrections.  Flash WMP 

Demo #26 Click here for Flash or WMP video demo of dictation with Olympus handheld recorder and remote Dragon server-based speech recognition.

Demo #27 Click here for WMP video demo using Vista Windows speech recognition and desktop autotranscribe.  

Demo #3 shows single-window dual-engine comparison using server-based SweetSpeech and Dragon NaturallySpeaking.  User sequentially opens Dragon and Custom Speech USA™ session files, clicks compare documents toolbar button to highlight differences, plays differences using menu dropdown, makes changes, increases leading/trailing playback to listen to word "of," and copies/pastes "well-maintained" from Dragon to format text, and enters new lines. Operator saves final distribution report as .TXT. Small cap "l" was transcribed by both engines and capitalized as "L" for report distribution. Since user did not create a separate verbatim annotation, the final text is automatically saved as the verbatim text with Instant Verbatim™ feature.  Flash WMP

Demo #4 shows double-window dual-engine comparison of Demo #3 session files.  Operator selects toolbar window icon to horizontally display  audio-aligned text.  Operator specifically references option of play entire phrase (including difference) as opposed to playing difference only.  Flash  WMP

Demo #5  shows double-window SpeechMax™ VerbatiMAX™ comparison of uncorrected speech recognition transcribed session file (TSF) with verbatim text .TXT.  Any difference represents an error.  Text comparison reduces transcriptionist review time.  This supports batch, automated correction of transcribed session file (TSF) to verbatim transcribed session file for speech user profile training.  Flash WMP  

Demo #6 shows one potential application of text comparison of student interpretation, conversion, or analysis of bounded input data compared in training and education.  In one approach, a medical transcription (MT) instructor can create a verbatim transcribed session file text key and compare to student output.  Text comparison identifies errors in spelling and punctuation and generates accuracy rate.  Instructor and student do not have to search/replay any part of original audio file with audio-tagged text.  Multilingual SpeechMax™ supports medical transcriptionist training in multiple languages. Best Session™ best result composite display provides the instructor with a quick, visual estimate of the errors made by two or more students. Accuracy levels are determined automatically.  Flash WMP

Demo #7A  shows proof-of-concept prototype software for elementary school phonetics training ("phonics").  Local elementary school teacher requested development of prototype to show local school board as proof of concept.  he video for "r" phonics shows how teacher can customize training with web-based resources and use text comparison.  Teacher can customize training to child's needs and locate make available web resources using multilevel text annotations to specific document text.  Accuracy levels are determined automatically.  Flash WMP

Demo #7B shows proof-of-concept prototype for language training related to teaching chants for Old Testament Hebrew.  Prototype shows how pronunciation (chant) can be customized and personalized by instructor.  Prototype uses Hebrew English translation available from World ORT.  English is read from left to right.  Hebrew is read from right to left.  Software synchronizes English, Hebrew, and instructor chant.  Instructor can provide informational audio or text comments to help student.  Software can access one or more web sources related to assist student.  Flash

Navigating the Bible II © 2000 World ORT, London, UK. US contact World ORT, New York, NY.       
Demo #8
 shows open MT Desk website for medical transcriptionist in training learning about prostate cancer treatment. This functionality can be used to launch video player or any other program.  Flash WMP

Demo #9 shows use of spell check supplemented by audio annotation pronunciation of medical term for student medical transcriptionist. Flash WMP

Demo #10 shows use of Employee information form and data migration to Microsoft Word.  Audio annotation (blue highlighting) supports text and/or audio entryText annotation (purple highlighting) supports text only entry.   Flash WMP

Demo #11 shows form creation.  Form creation generally involves entry of field name and creation of audio annotation within otherwise empty session file segment.  Flash WMP

Demo #12  discusses
The Talking Form™ and audio prompt creation for form user.   Flash WMP

Demo #13A  illustrates SpeechMax™ Microsoft Office Toolbar Add-In with data transfer to/from Microsoft Word and SpeechMax™.  Using software speaker can dictate into Microsoft Word with Dragon speech recognition, transfer text and audio data to SpeechMax™ for transcription, and migrate data back to Microsoft Word.  User can also enter data into SpeechMax™ and migrate to Microsoft Word. WMP 

Demo #13B  illustrates SpeechMax™ Microsoft Office Toolbar Add-In with  transfer XML data to/from Continuity of Care Record (CCR) to/from Microsoft Word and to/from SpeechMax™.  Add-In supports download/upload XML data to/from Microsoft Word and CCR.  Demo represents proof-of-concept workflow.  Video demonstrates download data into Word using Add-In and data modify based upon written or dictated information.  If dictation, Transcriptionist may play back dictated audio usingSpeechMax™, PlayBax™, or other software and transcribe into Word.  Alternatively, use may dictate into Microsoft Office using speech recognition.  Data may be transferred to SpeechMax™ and modified using dictation/transcription, speech recognition, keyboard, or bar code.  Modified data may be transferred to Word and uploaded directly into CCR using the Add-In.  Alternatively, operator may modify in Word before upload. WMP 

The first 7 examples demonstrate how user can create complex presentations with speech, audio, dictated and nondictated text, and graphics for electronic audio book, electronic scrapbook, presentation on segmenting dictation, lecture on geography of Ireland, presentation Gettysburg Address, sales presentation, and language instruction for introductory German. Only the final example (singalong or karaoke) shows a completed presentation.
 
The last example shows short AV presentation using "Stairway to Heaven" by Led Zeppelin. In the last presentation, note highlighted, audio-synchronized text, document window elapsed time display, and slider bar available for user adjustment of play location. Click on presentation hyperlinks to see web page images, Flash or WMP video also available for all except the segmenting dictation lecture.

Electronic audio book:  Romeo and Juliet Demo #14A Flash WMP
Electronic scrapbook:  One Fabulous Vacation Demo #14B Flash WMP
Lecture #1:  Segmenting Dictation   Also see related  Demo #1A Flash WMP
Lecture #2:  Geography of Ireland  Demo #15 Flash WMP
Speech: Gettysburg Address  Demo #16 Flash WMP
Sales presentation:  Pet Palace  Demo #17 Flash WMP
Language instruction:  Introductory German Demo #18 Flash WMP
Singalong/karaoke:  Stairway to Heaven (Led Zeppelin)  Demo #19 Flash WMP

"Stairway to Heaven," produced by Jimmy Page, executive producer Peter Grant, © 1971 Atlantic Recording Corporation for the United States and WEA International Inc. for the world outside of the United States.

Demo #20 video shows operator copying and pasting English (source) text into web-based machine translator.  User delimits translation Spanish (target) text output with vertical placeholders.  Operator clicks the synchronizes session tags button.  Clicking in English source text segment indicates the corresponding translated segments.  Flash WMP 

Demo #21 video shows operator tiling document windows horizontally and synchronizing French translation with English source.  Operator opens third document window for previously delimited Spanish translation.  Operator synchronizes Spanish translation with both English and French translations and synchronizes French and Spanish translations.  Operator selects each English segment sequentially and confirms synchronized highlighting of French and Spanish segments.  Flash WMP 

To compare accuracy of translation, operator may repeat process by substituting identically delimited French and Spanish translations from different manual or automatic translation source.  Thereafter, operator text compare against initial English or French translations or other standard. Initial translations are shown in the screen shot below. 

Demo #22 shows use of multilevel annotations for multilevel knowledge base for creation of a nondisclosure agreement and efficient document assembly with SpeechMax™.  In example, user selects phrases from various "fill-in-the-blank" alternatives.  User text compares alternative form selections.  User can also use color-coded composite Best Session™ to visualize the variability in selection choice for each field. Color coding shows agreement between source documents in knowledge base. Redder indicates considerable difference.  Pink indicates minimal difference and clear no difference.   Knowledge base may be utilized by law firm, business, or other organization.  Flash

Demo #23 video shows divide/scramble untranscribed audio session file and merge/unscramble transcribed session files. Flash WMP

Demo #24 video shows transcribed session file in SpeechMax™ and selective redaction of patient name from transcribed session file (TSF). Flash WMP

Demo #25 shows transcribed session file in SpeechMax™, selective redaction of patient name from transcribed session file (TSF), playback of export in PlayBax™ transcription software controlled with foot pedal, and transcription of redacted audio file in Word.  Flash WMP

 

 

Price, terms, specifications, and availability are subject to change without notice. Custom Speech USA, Inc. trademarks are indicated.   Other marks are the property of their respective owners. Dragon and NaturallySpeaking® are licensed trademarks of Nuance® (Nuance Communications, Inc.)