Practical Solutions for
Awarded multiple U.S. and foreign patents, the
CSUSA provides products and services for audio, speech, text, image, and multimedia processing for the home or office. Major
categories include desktop utilities, session file editor, workflow, speech and
language processing toolkit, PC solutions, software suites, and third-party
products for speech recognition (SR) and other functions.
CSUSA software includes
session file editor,
server-end SR transcription, user training, and
audio presegmentation, and
speech and language processing toolkit including nonadaptive SR and model
software supports workflow, telephone dictation, and call management.
Desktop utilities include acWAVE™
audio conversion tool for Sony and Olympus handheld recorders and other audio, CustomMike™
driver for Philips handheld and other microphones,
MacroBLASTER™ macro editor for programmable keypad, voice commands, and barcode,
PathPerfect™ workflow diagram editor, PlayBax™ transcriptionist audio playback
tool for Infinity foot control and similar devices, TTSVoice™ voice for
text to speech, and desktop SR autotranscribe for Microsoft Vista.
to obtain specific product information.
Various speech-related and other publications have reviewed company products.
Most recently, Speech in the User Interface: Lessons from Experience (2010), a collection of
articles edited by speech-industry consultant and Speech Strategy News (Tarzana,CA)
editor William Meisel, included a review of company software.
Company has shown its software at
various trade shows, meetings, and other
Practice Management Section, ABA
TECHSHOW, American Association for
Medication Transcription (AAMT), Annual
Conference of the New Hampshire Medical
Group Management Association, Annual
Neuroimaging Symposium at the Barrow
Institute, IBM Annual Conference for
Speech Recognition Dealers, Legal Tech
Chicago, Medical Transcription Industry
Association (MTIA), Northern Illinois
Physicians for Connectivity,
Northwestern University Medical
Informatics, OPEN MRI 2000
Conference, and Radiology Society of
North America Annual Meeting.
Dragon, IBM, and Windows SR
Company developed its software using Microsoft Windows.
is compatible with Windows XP with limited testing with Vista or later Windows
OS. Company uses Microsoft SAPI 5.x
Microsoft (Vista), Dragon (Nuance), and IBM
speech recognition, and Microsoft and ATT Natural Voices
text to speech. As of
IBM no longer supported ViaVoice speech recognition.
Server-based and/or real-time speech recognition
software supported includes Dragon Professional, Medical, and Legal 10, Dragon Preferred 10,
IBM ViaVoice Professional 10,
Windows Vista SR, and this company's nonadaptive SweetSpeech™. System also supports SAPI 5.x text to speech, including
Microsoft, AT&T Natural Voices. System may run with Dragon Professional,
Medical, or Legal v. 8.10.000.285 or higher, but only version 10.x is supported.
System may run with IBM Professional v.8.x or higher, but only IBM USB Pro
10.x is supported.
software with Microsoft .NET 2.0. It tested this
and other company software
with Windows XP. Limited testing with early Windows Vista
indicated device driver issues.
Partial Customer List
AlphaBest, Tarzana, CA
American Business Systems, Keller, TX
Anthurium Solutions, Inc., Boston, MA
Arkansas Public Defender, Little Rock, AK
Associates in Oncology/Hematology, Rockville, MD
AudioEye, Inc. Tucson, AZ
Automated Business Products, Salt Lake City, UT
BEA Systems, San Jose, CA (acquired by Oracle)
Billings Clinic, Billings, MT
Avon Lake Police Department, Avon Lake, OH
Cisco Systems, Inc. San Jose, CA
City of Bloomington, Bloomington, IN
Coconino Sheriff's Office, Coconino County, Flagstaff, AZ
Columbia University Medical Center, NY, NY
DialAmerica Marketing, Mahwah, NJ
Diversified Software Systems, Morgan Hill, CA
Dyviniak Word Processing Services, Covina, CA
eScription, Needham, MA (acquired by Nuance)
FlowServe Corporation, Irving, TX
InfoComm Development Authority of Singapore
Los Angeles County Office of the Medical Examiner, Los Angeles, CA
M3 Medical Management Services, Chicago, IL
MD Notes, Sewickley, PA
Montserrat Day Hospitals, Spring Hill, Queensland, Australia
Neurology Center, Fairfax, VA
Northwest Regional Pathologists, Bellingham, WA
Open High-Field MRI and CT, Westchester, Larchmont, NY
Orthopaedic Specialists of NW Indiana, Munster, IN
Page Weavers Internet Development, Sacramento, CA
Pathology Center, Bellingham, WA
Pay-Tel Communications, Greensboro, NC
Procter & Gamble, Cincinnati, OH
QBE Insurance Group, Sydney, Australia
Reading Ware, LLC, San Diego, CA
Roberts & Schaefer, Chicago, IL
Town of Munster, Munster, IN
Vee Technologies, Bangalore, India
Villanova University, Villanova, PA
Whiteco Industries Legal Department, Merrillville, IN
Company began as "speech trainer" for Dragon desktop users. Company
later became a reseller, integrator, and developer.
information is included below about software development milestones, summary of
software features by name, and web video demos.
Products and Services
CSUSA has worked as an integrator, software
developer, and reseller of speech solutions and
supports digital and telephone
dictation, manual transcription, client-side and
server-side speech recognition, voice commands,
text to speech, audio mining, translation,
speaker identification, natural language understanding, audio conversion,
other speech and language, and
Services have included custom programming for advanced speech
processing, custom telephone solutions,
including VoIP, and business systems
session file editor and
SAPI 5.x compliant
support Dragon, IBM, and Windows adaptive
SR. Company created software for use as an
add-on to boxed, off-the-shelf software that they buyer
may already own, or with runtime licenses.
session file editor, the
physician, lawyer, or other speaker can dictate and view
the transcribed SR text real-time. The speaker, or
a speech editor, can correct the errors in the
session file editor, and save the corrected text for
training the speech user profile.
speaker can record dictation, and send it for
transcription by speech to text conversion by
SpeechServers™. The speaker can
also correct the transcribed
session file, or as with real-time speech
recognition, send it for correction by a speech
has a unique document window architecture with text and
audio annotation (comment) windows. Software supports
easy, rapid forms creation by secretary or other end
user, multispeaker, collaborative documents, embedded
session file AV multimedia, and privacy protection.
sold its products in the U.S. and overseas to large and
small businesses directly or through over
Customers represent a
variety of fields, industries, and
backgrounds--transcription, health care, law, insurance,
software development, education, law enforcement,
government, and others.
has shown its software at medical, legal, and transcription trade
shows and meetings.
Various publications have
reviewed its software.
Beginning in 1998, CSUSA introduced various enhancements for
Dragon and other third-party
recognition (SASR), including:
(1) pretraining speech user profile using
dictation audio and
transcription with early
versions of session file editor
SpeechServers™ to reduce
speaker microphone enrollment time essentially to zero,
(2) repetitive, iterative server-based training for
speech user profile to improve SASR accuracy with
uses automation to make SR more
accurate and reduce speaker or
speech editor correction time, and
(3) speaker and speech editor auto error detection with
dual- or multi-speech engine
text compare to
reduce speaker and speech editor search time for SR
errors--80% or more for 90% accurate SR--using session
Approach applies to speaker
adaptive SR, nonadaptive SR (NASR), speaker
independent (SISR), and other
SR. It also applies to
text compare applied to other
manual or automatic processing
of audio, text, or image data.
SR text compare uses
wave form analysis to resegment and retag audio-linked
text to synchronize output from
different SASR, NASR, SISR, or
other SR. This creates
synchronized session files with same number
of text segments for efficient
CSUSA has also created
enhancements that support selective modification of SpeechMax™
session file editor document window SR text and audio
Enhancements also help protect individual privacy and
confidentiality. Special tools redact (censor),
divide, and scramble session file speech content before
remote editing or other
CSUSA also developed
proprietary, speaker-dependent, nonadaptive speech
engine and model builder toolkit. This innovative
software enables a transcription company, business, law
firm, hospital, law enforcement, or government to:
(1) create SR speech user profiles form dictation and
manual transcription, thereby utilizing day-to-day
dictation audio that was formerly discarded as a
useless byproduct of the transcription process, and
(2) use data from this interchangeable SR profile also
for, voice commands, interactive voice response (for
telephony), speaker identification, machine translation,
audio mining, text to speech, phoneme generation, and
natural language understanding.
CSUSA submitted its first
application in 1998.
Software Development Milestones
1998: Implemented early version of
SpeechServers™ server-based based SR for Dragon
NaturallySpeaking using SDK for real-time SR. Enrolled
speech user with audio file and no microphone enrollment. Used early
that synchronized dictated speech with highlighted text. Software also
included early version of comparison of SR text to verbatim transcribed manual text.
This determined speaker SR accuracy and whether
speaker should progress to full automation phase with server-based SR and speech editor correction.
ViaVoice did not accept audio file enrollment. Company developed
microphone emulation to direct audio file to IBM and enroll speaker.
1999: Created text compare software using dual-engine approach using early
to indicate likely SR errors and reduce speech editor audio review time.
Used Dragon and IBM speech engines.
1999: Implemented post enrollment, server-based repetitive, iterative training
for speaker-dependent speaker-adaptive systems with early version of
SpeechServers™. Software transcribes audio, compares
transcription to verbatim text, and corrects until system correctly transcribes
audio for preset number of iterations.
1999: Used early version of
to exclude unintelligible or unusable audio and text from training data used to
train speech user profile.
2000: Additional refinements made in display dual-engine text and speech
user training. Company also completed early versions of Command! workflow software.
2001: Utilized start-end word text comparison technique to synchronize different
speech engine audio aligned text.
2003: Extended text comparison to form dictation.
2004: Introduced "second generation session file" concept to promote more
efficient synchronization of output from different SR systems.
Supports conversion of third-party SR session file to proprietary CSUSA session file
format (originally .CSF, now .SES) using Dragon, IBM, or other proprietary SDK
software. Process results in same audio-tagging or all converted
third-part SR and display in common
After conversion to company format,
software supports retag start/duration times of session
file segments. This res/ults in an equal number of segments in each session
file. Each segment has the same start/duration times as the same segment
in another session file transcribed using the same audio. This supports
comparison of text from different speech engines synchronized arising
from the same audio.
2004: Automatically audio-retagging corrected word(s) and corresponding phrase audio
to maintain audio-text alignment.
2004: Created efficient
SpeechMax™ techniques for
separately saving both audio-linked distribution
(final) and verbatim text (what the
speaker said, including uncorrected errors or extraneous statements). The
latter is used for speech user profile training.
to support saving both audio-linked verbatim (training)
text and audio-linked distribution (final) text for quality assurance and other
2005: Developed multiwindow, multilingual desktop graphical user
SpeechMax™ interface for
synchronized comparison of text output from SR and other pattern recognition. Session file editor supports opening one or more windows,
read/write text, audio, and image, and annotation window for entry of text or
audio associated to document. Text annotation may include command line to
run program such as media player or access website to review source data or
2005: Session file support implemented for nearly unlimited
open/modify files with different file types in
editor. Further, as long as session files have equal number of segments,
they may be synchronized and displayed. For example, process may
synchronize and display sequential output from SR in source language, machine
translation to foreign text, and text to speech conversion translation to
desktop segmentation dictation audio before manual transcription. Process
creates presegmented dictation audio untranscribed session file (USF).
Transcriptionist plays back USF audio and manually transcribes text to create audio-linked
transcribed session file (TSF) for creation of final (distributable) text and
audio-linked verbatim text for speech user profile training.
resdesigned to support use of single session file editor for human processing of
data and review of computer results, supports use of a speech-oriented, common
application for reporting synchronized source data from a variety of source
annotation where there is a discrepancy between "final" and "verbatim" text.
Verbatim annotation creates text for training session file (.TRS) that is
automatically substituted for final text. If there is no discrepancy,
final text is automatically saved as verbatim text for creation of training
session file (.TRS).
tools developed to mark poor quality audio or nondictated text and exclude from
speech user profile training.
supports text compare of synchronized (same segment) output
results from manual or automated processing of data, audio, text, or other
Methods developed using
for efficient processing speech from two or more group speakers during a
meeting, legal proceeding, interview, or video. Process develops speech
user data from manual transcription and/or SR to create small-group user profile
and individual profiles for > 2 speakers.
guess" ("best result) session file for SR and other
pattern recognition. Composite session indicates single most likely result
based upon synchronized computer and/or human (manual) results for audio, text,
or image data processing. Consolidated text compare using three texts, for
example, results in the following color coded results: nonhighlighted text (clear)
likely correct due to due to lack of differences, highlighted pink indicates
differences between 2/3 files and increase risk of error, and highlighted red indicates differences between
3/3 sources and greater error risk. User can determine visually from the
color-coded display where there are differences and increased need to verify
reliability with audio playback.
does not use confidence scores of results to determine color-coded best result.
2005: Process may use "best result" audio-linked
matches (clear) for
unsupervised pattern recognition training. For example, process may submit
audio-linked text to speech engine model builder to create SR speech user
profile without manual review.
2005: Completed development of
SweetSpeech™ speaker-specific SR engine and toolkit. Company
software enables transcription companies and businesses to develop highly
accurate speaker-specific SR user profile. This usually
takes as little as 6-8 hours of manually transcribed manual
transcription of a speaker's segmented audio. Training process can also use preexisting speaker independent and other
SR software with second generation technology to create
audio-linked text for speaker-specific user profile. Use
text compare to determine relative accuracy of different potential models.
SweetSpeech™ uses actual speech
or channel of background noise to create speech user profile.
Software does not begin with creation of speaker-independent (SI) model
from hundreds or thousands of voices, does not adapt speaker-independent model
to speaker-dependent with speaker enrollment and correction, and does not model
changes in SI
model based upon estimation of voice characteristics. Potential applications include creation of SR for voice
commands in "smart" environments where it is difficult to model for unique
background or channel noise (e.g., airplane cockpit, car, or noisy home). Literature
supports concept of high recognition accuracy of speaker-specific SR systems.
2005: Introduced specialized
SweetSpeech™ tools for training
SR models. Tools were developed to assist nonexperts with
determination of lexical pronunciation using text to speech phoneme generator.
Software also provides for automated linguistics question file generation for
state tying in cases of data sparsity.
Speech user profile creation and update training use accumulator
and combiner technology to train systems. State untying developed for
training updates to user profile.
tools created to use training session file data (.TRS) to create speech user profile for
SR with interchangeable speech user data for SR and other speech and language
processing, such as voice commands, call center telephony, speaker ID for voice
biometrics, personalized text to speech, phoneme generation, audio mining,
machine translation, and natural language processing for automated content
SpeechMax™ text comparison and synchronization techniques to manual
transcription and/or automated SR of multispeaker, single
channel audio. This represents speech recorded with single microphone for meeting (e.g., board of
directors), interview, or group video. Manual transcriptionist processes untranscribed session file
(USF) to create transcribed session file (TSF) associating each
segment to the speech of a particular speaker. This results
in creation of verbatim
audio-aligned TSF data training sets for speech user profile training for an
individual speaker and multispeaker, small-group speech user profile. Use multispeaker profile and each individual speaker profile to transcribe group
audio and text compare to determine likely correct output.
SpeechMax™ selective modification of document SR text or audio
tags with text or audio using annotation window. This enables speaker B to
correct the transcribed text of speaker A and train both speaker's user
Applies to simple or complex documents, including session files with text,
graphics, sound, or embedded audiovisual content (e.g., lectures, sales
presentations, or electronic audio books).
Second speech user can correct the
text dictated by a first user with original session file open in the buffered
read/write document window with no corruption of either speech user profile.
System can use verbatim corrected text and first and second speaker audio
to train profiles of respective speakers ("annotation training").
session file lock introduced to preserve document integrity and create
read-only, portable session file.
SpeechMax™ session file editor divide and scramble content feature and
audio/text redaction to protect confidentiality and privacy, lock session
protects data integrity. Optional divide feature separates session file segments
into two or more groups that may be sent to two or more processing nodes,
division limits knowledge of a single transcriptionist, SR facility, or SR
editor of entire content. Segments of each division may be scrambled (reordered) to obscure
meaning and content. Each segment generally represents an utterance so that
manual transcriptionist or speech editor has sufficient audio context to assist
with transcription or correction. Redaction is available for speech-recognition
audio and text for names, addresses, and other identifying personal
identification. Workflow unscrambles corrected segments, reinserts redacted
content (if applicable), and reassembles to create final, distributable document.
Company has designed this multiwindow, multilingual desktop text editor for
manual or automatic speech and language processing. The software supports
advanced text comparison by editor of synchronized output text created by manual
transcription or speech recognition to more rapidly detect likely errors made by
manual transcriptionist and/or speech recognition. The techniques were
originally developed for text comparison of transcribed dictation. Given the
increasing processing of data manually or by computers, company has extended
text comparison to include conversion, interpretation, or analysis of other
synchronized text, audio, or image pattern recognition source data processed by
manual or automatic means or both.
Speech-Oriented Desktop Graphical User Interface
. Microsoft Windows compatible multiwindow, multilingual (Unicode) session file
editor, can read/write .RTF, .TXT, .HTML, audio, image, and .SES proprietary
session file. Software supports one or more main document windows with text and
audio annotation (comment) windows. Text annotation supports creation of one or
more hyperlinks linked to specific document text. Text annotation may also
represent command to run program, such as video player. Program supports
real-time and server-based speech recognition, plus other speech and language
processing. It is designed for use as a common interface for dictation, speech
recognition, voice commands, speaker recognition, translation, text to speech,
audio mining, natural language understanding, and phoneme generation. It also
supports comparison of synchronized results from other text, image, and audio
pattern recognition, including medical imaging. Easy-to-use tools are available
for training a shared speech and language user profile. A transcriptionist or
other specialized member of the transcriptionist team can serve as a "speech
trainer" to help create the profile Application programming interface (API)
and developer tools are available.
Data Synchronization and Text Comparison
Company text comparison techniques rely upon synchronization of delimited
bounded input data and comparison of delimited output data resulting from same
input data. This differs from standard text comparison that compares text
without reference to source data.
. . . Selection of speech recognition word or phrase and synchronized playback
of selected text, also supports playback selected utterance of an untranscribed
session file (representing segmented audio only) or a manually-transcribed
. . . Compare manually transcribed verbatim text file with speech recognition
session text to reduce search time for recognition errors and create
audio-linked training text
. . . Text comparison of audio-aligned output from two or more speech engines
and/or audio-aligned manually transcribed text, identifies likely speech
recognition or manual transcription errors, and reduces transcriptionist search
time, for continuous narrative dictation and structured dictation into form
(also can text compare output processed automatically or manually from machine
translation or other text, audio, or image pattern recognition, displays text in
relation to aligned source data input)
. Technique for locating identical audio segment using text comparison by
matching start and end words, also use for audio mining
. . . Text compare document containing “boilerplate” and other standard language
manually or automatically transcribed from previous audio file or continuous
speech with similar document containing standard language created from new audio
file or real-time speech
Second Generation Session File™
. . . Creation of a second session file having audio-synchronized text after
readjustment of text audio tag information based on analysis of audio file wave,
for example, process can realign asynchronized transcribed session file segments
from Dragon NaturallySpeaking and IBM Via Voice speech recognition processing of
same audio, Dragon and IBM "second generation" session files have equal number
of synchronized segments in retagged session files, "second generation" file
represents a form of standardized output that can be used to compare other
speech recognition or create a portable session file for speech recognition or
other audio-linked text that can be opened in a common user interface
SpeechParts™ . . .
Dividing techniques based upon a predetermined boundary division may result in
audio tags for utterances, for words, or for word subunits
tcReliablity™ . . .
Word mapping tool to formulate statistical reliability that text compare matched
phrase represent correct transcription and, based on this statistical
reliability, have transcriptionist determine the need to evaluate a matched word
or phrase with audio playback (not currently available)
. . . Synchronized results from text, audio, or image pattern recognition
processed by manual or automatic means, indicates that each output session file
arising from the same bounded input data has the same number of segments
. . . software synchronizes output from two or more session files to create a
most-likely "best result" representing most common output, or selected index
engine output if all different, color coded best result highlights differences
between speech recognition texts, for example, for a given word or phrase
transcribed from same audio, no highlighting of best result word or phrase
indicates no difference between 3 speech recognition texts and low risk of
error, pink highlighting indicates differences in 2/3 texts and moderate risk of
error, and red highlighting indicates differences in 3/3 texts and high risk of
error, editor reviewer may quickly determine visually where likely errors by
looking at highlighted best result text without having to review each individual
text and may selectively tab to likely errors, listen to audio and correct, best
result composite may also represent session file output from manual or automatic
processing of text, audio, or image input data
Creation of Final (Distribution) and Verbatim (Training) Text
Session editor includes standard tools for text editing, plus additional,
specialized tools for creating shared speech user profile for speech and
language processing. Some of these specialized tools are described below and
further in website.
. . . SweetSpeech™ feature, runs as server or desktop
creates segmented, untranscribed session file (USF) from audio file, manual
transcription converts to audio-aligned transcribed session file as report
and/or training session file for acoustic model
. . . Optional simultaneous creation of final (distribution) and verbatim text
with document transcription, add training verbatim text as annotation if
difference between final distribution text and verbatim (e.g., where speaker
made mistake or extraneous comments that are not included in final text)
. . . . Identifies corrected session file text segment and automatically
reassigns adjusted word audio tags to final and verbatim text
Best Speech Only™
. . . Operator may exclude poorly audible speech, speech with considerable
background or channel noise, or "ahs" and "uhms" from training session file,
similarly, operator may also exclude text for headers, footers, and other
nondictated text from training material
Text Split Plus™ . . .
Alternative method of creating verbatim and/or distribution text, begins with
SpeechSplitter™ untranscribed session file, may import text and open in
document window, playback untranscribed session file and associate
phrase text to each audio utterance to create audio-aligned text for electronic
audio book or other audio-aligned text
. . . Training session file from document window representing delimited output
data and delimited text, audio, or image input data; with speech recognition or
manual transcription, session training file represents document window
audio-tagged verbatim text
3. Annotations, Forms, and Collaborative and Complex Documents
Operator may use text processing, data synchronization, and text comparison
tools described above with more complicated documents.
Annotations (General) . . .
audio and text annotations (comments) are supported, one or more users may enter
text associated to specific document text or space in document, text annotations
may include messages, hyperlinks, or run program command
The Talking Form™
. . . Forms and templates "made easy," generate audio prompts for user (audio
annotations), enter text with keyboard, barcode, or speech recognition, includes
integrated sound recorder for dictation and transcription audio playback, more
than one speaker may complete form, use dictated speech and text for speech user
profile training, use speech recognition to correct speech recognition text of
another speaker, no corruption of either speech user profile, use corrections to
train both speech user profiles, also selectively modify speech recognition
document text audio-tags, associate form field name or other document text to
one or more multilevel hyperlinks, may use annotations to provide knowledge base
or information to form user, associate document text or form field name to text
annotation to launch media player or other programs with command line
My AV Notebook™
. . . Software for "do-it-yourself" presentations, create single- or
multi-speaker multimedia, session file may include nondictated headers and
footers, speech-aligned text, synchronized lyrics and music, photos, graphics,
video, or animation, may associate one or more hyperlinks or command line to
specific document text, launch websites, video player, or other programs,
interactivity with mouse, keyboard, barcode, and voice commands, synchronized
real-time highlighting audio book text with audio, user can switch between
listening and reading, elapsed time display makes it easy to return to audio
book to read or play, use same techniques for illustrated lectures and classroom
teaching aids, sales presentations, speeches, electronic scrapbooks, singalongs
and karaoke, use speech text data to create text to speech voice font for
professional voice talent, actors, politicians, and other speakers
. . . Selectively modify session file document audio tags or text through
annotation window, concatenate annotation audio to train speech user profile,
speaker B may edit speaker A speech recognition with no corruption of either
speech user profile, switch tags functionality (transpose document and
annotation text or audio), train user profile speaker A with corrected text and
speaker B profile with speaker B's speech and associated text
. . . Generate annotation training by selecting text audio annotation pairs by
annotation ID and training speech user profile of one or more users, different
from general session training which uses document verbatim text and playback
audio to train speech user
input may consist of speech, text, handwriting, fingerprints, medical and other
digital images, music, and other data that may be processed using pattern
recognition. There is a need for a single application to support conversion,
interpretation, analysis, comparison, and reporting of synchronized speech,
text, audio, or image data that has been processed manually, automatically, or
by both. Note that sequential reports based upon same data may be
Medical Imaging . . .
Training a representational model for a computer-aided diagnosis program having
pattern recognition capabilities based upon human evaluation of bounded data
• Computer-aided diagnosis (CAD) for Mammography . . .
mammogram reporting session file to evaluate bounded CAD-determined suspicious
areas by two or more radiologists
Other Pattern Recognition . . .
input from diverse sources including text, audio, and image, real-time and
recorded, human and mechanically-generated audio, single-speaker and
example, two or more computer-aided diagnosis programs for mammography may
analyze the same delimited bounded suspicious areas, and grade each according to
likelihood for cancer. A breast radiologist may review and dictate final report
in English synchronized to initial data input. This may be followed by machine
translation into Spanish synchronized to English and corrected by office
editor. All the session files may have equal segments such that
may load all the session files, select one file, and tab through selected file
and highlight sequentially each segment in that file and synchronized segments
of other files. Regardless of data type or processing, session files with equal
segment numbers may be synchronized.
Privacy and Confidentiality, Document Integrity
. . . Dividing and scrambling session file content prior to distributing for
processing to limit any single human or automated processing node’s knowledge of
. . . Selective redaction audio and text
. . . Locked session file, read-only
. . .
Speech recognition for dictation is a form of pattern
recognition. Speaker-independent speech recognition uses
audio, text, and pronunciation data from many different
speakers to create a multispeaker speech user profile.
For many years, leading speech recognition software have
supported speaker-dependent user profile creation using
speaker-adaptive techniques. Examples include
Dragon NaturallySpeaking, Philips SpeechMagic, Microsoft
Windows speech recognition, and IBM ViaVoice (no longer
supported by IBM).
adaptation uses microphone enrollment by reading one or
more scripts to create a new user and correction of
day-to-day speech recognition errors to further train
(adapt) the new user. The new speech user profile
is more speaker-dependent than the original model.
The adaptation process uses MAP (Maximum A Posteriori)
or MLLR (Maximum Likelihood Linear Regression) or other
estimation techniques to create an acoustic model
approximating the characteristics of the speaker’s
speech. It is an estimation that does not model
the actual speaker's speech or actual background or
The resulting speech user profile includes data about
the statistical probability of basic speech sounds
following and preceding others (acoustic model), the
order or combination of particular words (language
model), and text representation of speaker word
pronunciation (lexicon). Off-the-shelf speech
recognition generally makes user profile data
transparent (invisible) and inaccessible to the speaker
or developer. These users cannot use this data to
create other models for other speech and language
processing, such as voice commands, speaker recognition,
or text to speech. For a snapshot in time of
industry software and practices, see
Speech Strategy News
sample issue newsletter.
Nonadaptive speaker-specific speech recognition creates
speech user acoustic and language models and lexicon
based only on the speaker's speech. Research
indicates that speaker-dependent speaker-specific
speech recognition using just the speaker's speech data
more accurate than
speaker-adaptive speaker-dependent or
speaker-independent models. In addition, with more
open access to speech user profile data, speakers and
developers could use the data and models for other
speech and language processing, such as voice commands,
interactive voice response for telephony, speaker
identification, text to speech, and other speech and
Company provides an alternative
SweetSpeech™ toolkit and speech recognition engine
Just My Speech™ . . . Use this company’s toolkit and text editing software to
create data sets for speaker-specific speech user
profile from microphone, telephone, or mobile device
recorded speech, company do-it-yourself tools are
designed to assist software developer, local information
or transcription services, or advanced end user in
creating a user profile based upon the speech of a
single speaker, including speaker with poor accuracy
with speaker-adaptive software despite training and
correction, speaker with no speech recognition available
for language or dialect, or speaker with an unusual
accent or speech impediment
Multiple Speech Recognition Options . . .
also create custom group user profiles for small group
of speakers meeting over time with speech recorded using
a single microphone or other recording device (multispeaker,
single channel), e.g., corporate board meetings, lengthy
depositions or trials, groups subject to law enforcement
or national security surveillance, software also
supports creation of large group multispeaker
speaker-independent speech recognition
SharedProfile™ . . . Includes
acoustic and language models, lexicon, formatting
preferences, and audio segmentation as shared data for
speech recognition, voice commands, interactive voice
response for telephony (IVR), speaker recognition, audio
mining, text to speech, phonetic generation, machine
translation, and natural language processing, users and
developers can use the shared models as "interchangeable
parts" for other speech and language processing
Model Builder . . .
Software includes tools for creating personalized
acoustic model, language model, and lexicon,
Unicode compatible software supports speech and language
processing in many languages
SpeechScape™ (Acoustic Model) . . . Build,
train, and update one or more acoustic models with audio
and verbatim text, use speaker's speech and text for
speaker-specific training, utilize data from business or
professional dictation, YouTube, Facebook, or other web
audio, or cell phone or other telephony, use accumulator
and combiner techniques to create and modify acoustic
model, automatically generate linguistic questions file
in cases of data sparsity, software includes tying and
untying states for creation and combining one or more
accumulator files, model creation generally requires 6-8
hours of good quality audio data, recommend minimum 8
kHz/8-bit for telephony, 16 kHz/16bits for other audio,
may also use software to create small group and large
group models, create and store acoustic model locally or
on the cloud, use acoustic model also
for voice commands, interactive voice response for
telephony, speaker recognition, audio mining, text to
speech, and other speech and language processing
WordContext™ (Language Model) . . . Add text
samples, build model, revise model based upon use of
different or additional samples, use trigram or other
N-modeling for speech recognition, may also create
custom user profile for bilingual or multilingual
speaker who dictates in two or more languages
simultaneously, use language model tool for
gisting translation (providing a rough idea of what
speaker says), use language model also to edit
categories tool for semantic processing for natural
language understanding and key concept search
Model) . . . Add or remove words to lexicon,
lexicon associates each word in language model to
phonetic pronunciation, audio pronunciation generator
uses text-to-speech to assist in generating lexical
(phonetic) pronunciation, use for speech recognition and
other speech and language processing
Segmentation Speech and Text
. . . May use same speech segmentation parameters for speech
recognition, voice commands, interactive voice
response for telephony, speaker recognition and audio
mining, and same text segmentation rules for text to
speech, machine translation, and natural language
processing, may use common segmentation parameters for
two or more instances of manual and/or automated
processing of same dictation audio, common segmentation
results in synchronized data reflecting equal number of
input dictated phrases or sentences (utterances) and
output text segments (consisting of audio-aligned text)
Formatting . . . Unformatted speech
recognition engine text “forward” formatted in
postprocessing step before speaker or transcriptionist
views text (e.g., converts transcription "fourth of
July, eighteen sixty five" to "July 4, 1865"), formatted
text is “reverse” formatted when text submitted for
training acoustic model, toolkit supports end user
selection of preferred formatting for dates, measures,
weights, currencies, telephone numbers, and other text
User Management and other
Tools . . . Speech and language processing
toolkit includes tools for user management and user
settings, plus new user and train new user wizards
SweetSpeech™ SAPI 5.x Text to
Speech (TTS) and Speech Recognition (SR) Plugins for
SpeechMax™ . . . Supports
desktop processing with Microsoft or AT&T Natural Voices
text to speech or Microsoft and
recognition and with other SAPI 5.x compatible software,
. . .
feature runs as remote server
function or as desktop
speech recognition plugin, see discussion below under
SpeechServers™ . .
Software supports Microsoft SAPI 5.x server-based speech
recognition with SweetSpeech™, Microsoft Windows speech
recognition, Dragon NaturallySpeaking, and IBM ViaVoice
(early versions only, no longer supported by IBM).
Potentially supports other SAPI 5.x speech recognition.
Desktop server-based autotranscribe available for
SweetSpeech™, Microsoft, and Dragon engines is also
. . . Pretrain speech user profile with verbatim text
and audio file to
eliminate traditional microphone enrollment for
speaker-adaptive speaker-dependent systems
. . . Use microphone emulation to enroll speech users
with audio file
instead of traditional speaker enrollment where system
requires some form of microphone enrollment
SpeechTrainer™ . . .
Automate post-enrollment corrective adaptation to more
quickly improve accuracy by transcribing audio,
selecting differences between transcribed output and
provided verbatim text, applying corrections, and
process repeated until appropriate conditions are met
(target accuracy, unable to correct further, maximum
number of cycles) (supported with early versions Dragon
Naturally Speaking and IBM ViaVoice only)
TransWaveX™ . . .
Dictation audio with text output only
SaveSession™ . . .
Dictation audio with audio-linked text (session file)
output, use audio-linked verbatim text for speech user
Demo #1A shows
utterance (phrase) segmentation to create untranscribed
session file (USF) from dictation. Transcriptionist
manually transcribes in
SpeechMax™ to create transcribed session file (TSF)
using PlaySpeech™ functionality
also shows realignment segment boundary marker to
include audio for "period" with larger adjacent
Demo #1B shows
utterance (phrase) segmentation to create untranscribed
session file from dictation. This demo shows a case
where the transcriptionist would like to work with fewer
vertical utterance markers. Transcriptionist
imports previously transcribed text, sequentially
listens to each untranscribed utterance using
and sequentially delimits each utterance by toggling
play audio control. The result is a transcribed
session file as above.
Demo #2 shows
server-based transcription using prototype
speech recognition. In-house staff created
speech user profile with
language processing toolkit. Video
first shows text immediately after speech-to-text conversion
(raw speech engine decoding). This is followed by
regular expressions algorithms to search and match text
strings. Conversion rules may reflect
speaker or institutional preferences. Speech user
profile typically reflects these preferences. User loaded
post-formatting transcribed session file (TSF) into
to play back audio and make any needed corrections.
Click here for
or WMP video demo of
dictation with Olympus handheld recorder and remote Dragon server-based speech
Click here for
WMP video demo using Vista Windows speech recognition and desktop
Demo #3 shows
single-window dual-engine comparison using server-based
Dragon NaturallySpeaking. User sequentially
opens Dragon and Custom Speech USA™ session
files, clicks compare documents toolbar button to
highlight differences, plays differences using menu
dropdown, makes changes, increases leading/trailing
playback to listen to word "of," and copies/pastes
"well-maintained" from Dragon to format text, and enters new lines.
Operator saves final distribution report as .TXT. Small cap
"l" was transcribed by both engines and capitalized
as "L" for report distribution. Since user did
not create a separate verbatim annotation, the final
text is automatically saved as the verbatim text with
Instant Verbatim™ feature.
Demo #4 shows
double-window dual-engine comparison of Demo #3
session files. Operator selects toolbar window
icon to horizontally display
audio-aligned text. Operator specifically
references option of play entire phrase (including
difference) as opposed to playing difference only.
comparison of uncorrected speech recognition
transcribed session file (TSF) with verbatim text .TXT.
Any difference represents an error.
Text comparison reduces
transcriptionist review time. This supports batch, automated
correction of transcribed session file (TSF) to
verbatim transcribed session file for speech
user profile training.
Demo #6 shows
one potential application of text comparison of student
interpretation, conversion, or analysis of bounded
input data compared in training and education. In one
approach, a medical transcription (MT) instructor can create a
verbatim transcribed session file text key and compare to
student output. Text comparison
identifies errors in spelling and punctuation and
generates accuracy rate. Instructor and
student do not have to search/replay any part of original audio
file with audio-tagged text. Multilingual
medical transcriptionist training in
multiple languages. Best
best result composite display provides the
instructor with a quick, visual estimate of the
errors made by two or more students. Accuracy levels
are determined automatically.
proof-of-concept prototype software for elementary school phonetics training ("phonics").
Local elementary school teacher requested
development of prototype to show local school board
as proof of concept. he video for "r" phonics shows how teacher
can customize training with web-based resources and
use text comparison. Teacher can customize
training to child's needs and locate make available
web resources using multilevel text
annotations to specific document text.
Accuracy levels are determined automatically.
shows proof-of-concept prototype for language training
related to teaching chants for Old Testament Hebrew.
Prototype shows how pronunciation (chant) can be
customized and personalized by instructor.
Prototype uses Hebrew English translation available
from World ORT. English is read from left to
right. Hebrew is read from right to left.
Software synchronizes English, Hebrew, and
instructor chant. Instructor can provide
informational audio or text comments to help
student. Software can access one or more web
sources related to assist student.
Navigating the Bible II
2000 World ORT, London, UK. US contact World ORT, New York,
Demo #8 shows
open MT Desk website for medical
transcriptionist in training learning about prostate
cancer treatment. This functionality can be used to
launch video player or any other program.
shows use of spell check supplemented by audio
annotation pronunciation of medical term for student
shows use of Employee information form and data
migration to Microsoft Word. Audio
annotation (blue highlighting) supports text and/or audio entry. Text
annotation (purple highlighting) supports text only entry.
Demo #11 shows
form creation. Form creation generally involves
entry of field name and creation of audio annotation
within otherwise empty session file segment.
Demo #12 discusses
The Talking Form™
and audio prompt creation for form user.
Office Toolbar Add-In with data
transfer to/from Microsoft Word and
SpeechMax™. Using software speaker can
dictate into Microsoft Word with Dragon
speech recognition, transfer text and audio
SpeechMax™ for transcription, and migrate data
back to Microsoft Word. User can also enter data
SpeechMax™ and migrate to Microsoft Word.
Office Toolbar Add-In with
transfer XML data to/from Continuity of Care Record (CCR)
to/from Microsoft Word and to/from SpeechMax™. Add-In supports download/upload
XML data to/from Microsoft Word and CCR. Demo
represents proof-of-concept workflow. Video
demonstrates download data into Word using Add-In
and data modify based upon written or dictated
information. If dictation, Transcriptionist
may play back dictated audio usingSpeechMax™,
or other software and transcribe into Word.
Alternatively, use may dictate into Microsoft Office
using speech recognition. Data may be
SpeechMax™ and modified using
dictation/transcription, speech recognition,
keyboard, or bar code. Modified data may be
transferred to Word and uploaded directly into CCR
using the Add-In. Alternatively, operator may
modify in Word before upload.
The first 7 examples demonstrate how user can
create complex presentations with speech, audio,
dictated and nondictated text, and graphics for
electronic audio book, electronic scrapbook,
presentation on segmenting dictation, lecture on
geography of Ireland, presentation Gettysburg Address,
sales presentation, and language instruction for
introductory German. Only the final example (singalong
or karaoke) shows a completed presentation.
The last example shows short AV presentation using
"Stairway to Heaven" by Led Zeppelin. In the last
presentation, note highlighted, audio-synchronized text,
document window elapsed time display, and slider bar
available for user adjustment of play location. Click on
presentation hyperlinks to see web page images, Flash or
WMP video also available for all except the segmenting
Electronic audio book: Romeo and Juliet
Electronic scrapbook: One Fabulous Vacation Demo #14B
Lecture #1: Segmenting Dictation
Also see related
Lecture #2: Geography of Ireland
Speech: Gettysburg Address
Sales presentation: Pet Palace
Language instruction: Introductory German
Demo #18 Flash
Singalong/karaoke: Stairway to Heaven (Led
"Stairway to Heaven," produced by Jimmy Page, executive
producer Peter Grant,
1971 Atlantic Recording
Corporation for the United States
and WEA International Inc. for the
world outside of the
video shows operator copying and pasting English
(source) text into web-based machine translator.
User delimits translation Spanish (target) text
output with vertical placeholders. Operator
clicks the synchronizes session tags button.
Clicking in English source text segment indicates
the corresponding translated segments.
video shows operator tiling document windows
horizontally and synchronizing French translation
with English source. Operator opens third
document window for previously delimited Spanish
translation. Operator synchronizes Spanish
translation with both English and French
translations and synchronizes French and Spanish
translations. Operator selects each English
segment sequentially and confirms synchronized
highlighting of French and Spanish segments.
To compare accuracy of translation, operator may
repeat process by substituting identically delimited
French and Spanish translations from different
manual or automatic translation source.
Thereafter, operator text compare against initial
English or French translations or other standard. Initial
translations are shown in the screen shot below.
shows use of multilevel annotations for multilevel
knowledge base for creation of a nondisclosure agreement
and efficient document assembly with
SpeechMax™. In example,
user selects phrases from various
"fill-in-the-blank" alternatives. User text
compares alternative form
selections. User can also use color-coded
composite Best Session™
to visualize the variability in selection
choice for each field.
Color coding shows agreement between
source documents in knowledge base. Redder indicates considerable difference. Pink
indicates minimal difference and clear no difference.
Knowledge base may be utilized by law firm,
business, or other organization.
video shows divide/scramble untranscribed audio
session file and merge/unscramble transcribed
shows transcribed session file in
redaction of patient name from transcribed session
shows transcribed session file in
redaction of patient name from transcribed session
file (TSF), playback of export in
transcription software controlled with foot pedal,
and transcription of redacted audio file in Word.