than a word processor™
For dictation, transcription, speech
recognition and other speech and language processing
Supports analysis, conversion, and
processing of other text, audio, and image pattern
Reduce search time for pattern
recognition errors with advanced text compare
U.S. patents and patents
SpeechMax™ supports audio record for
dictation, audio playback with transcriptionist foot
pedal or hotkeys,
real-time speech recognition, desktop autotranscribe
audio file with speech recognition, edit server-based
speech recognition, multilingual, multiwindow
document display, and text and audio annotation
Session file content represents the sum
document and annotation (comment) data from audio, text,
or image data processing.
Software has HTML display and XML storage of
proprietary session file content. Session file
represents sum of document and annotation content.
Reader Edition displays
session file text, audio, and image content.
Document license-embedded SessionLock™ is designed to
prevent unauthorized editing.
Consolidated text compare with color
Word runs SpeechMax™ Microsoft
Office Toolbar Add-In. Add-In has Next/Previous"
toolbar arrows for user navigation to Word bookmarks.
Add-In also has Import/Export toolbar functions that
support annotation and phrase migration wizards. These
wizards support data migration to/from SpeechMax™
session file to Word.
with SpeechProfessional™ combo package.
obtain more information about version compatibility
software with different SR and OS.
- Document window supports read/write text,
audio, or image
window supports read/write text,
audio, or image
- Read/write includes .TXT, .RTF,
.HTML, or .SES (proprietary session file)
- Enter text by keyboard, barcode, or
- Record speaker audio with integrated
annotation window sound recorder
- Continuous or segmental audio
- Use keyboard hot keys or foot control
to start/stop playback
- Real time dictation into speech
- Copy/paste graphics into document
- Standard text compare and text
compare by phrase
- Select text, playback audio with
audio-tagged document text
- Single, double, or multiple window
- Tile documents horizontally,
vertically, or cascade
- Create text and/or audio annotation
(comment) to selected document text
- Unlimited annotations (multilevel
data) to selected text
- Unlimited users (multiuser
collaborative) may annotate same text
annotation may include comment or question
- Text annotation may include
hyperlink, command line
- Command line may launch program,
e.g., open media player or web page
- Use annotation window sound recorder
to record audio comment
- Load audio file or use text to
speech to create audio annotation
- Switch (transpose)
- Selectively move annotation text into
document to replace session file text
- Selectively move annotation audio
into document to replace audio
session file segments into two or more groups
- Sort (scramble)/unsort (unscramble)
group session file content
- Session lock converts
document/annotation content to read only
- Data migration document phrase or
- Application programming interface
- Plugins (addins) for third
- Web services and file management
available through workflow manager
- Web-based Help files
compare across document (like word processor)
- Represents text compare of text strings with no
reference to source data
compare by phrase (synchronized text compare)
- Used to
compare SR audio-aligned text
different SR SDK to converts to CSUSA proprietary
equal number of segments (phrases) in .SES session
phrase (utterance) arises from same audio
respective phrase has same start/duration time
synchronized, text in segments (phrase) compared
- Differences highlighted, matches clear
- Differences indicate higher error risk
indicate greater reliability
- Any type of text can be
phrase compared if synchronized
- With equal numbers of
segments, session files are "synchronized"
- Phrase compare supported
even if underlying source data different
- For example, phrase
compare synchronized SR text and translation text
= bounded synchronized data input for pattern
- Input data include may include text,
audio, image, volumes, and spaces
placeholders delimit bounded input and
Screen shot shows toolbars and main document
window. Purple vertical markers represent placeholders delimiting phrases
in the Pledge of Allegiance.
Screen shot shows main document window and annotation
window. Annotation window includes sound recorder and text box.
Main window has formed created using annotations. Audio annotation (blue
highlighting) supports text and/or audio entry.
Text annotation (purple highlighting, not shown) supports text only entry.
Text "Streeter" in annotation window represents last name. It may be moved
to first blank after "Full Name" in the document window form.
Multiwindow Tiled Display
Screen shot shows tiled horizontal display of 3 session
files representing original English text and 2 translations.
Vertical placeholders are noted delimiting text in the material in 3 document
windows. The translations have the same number of segments delimited
by the same punctuation. The last synchronized segment is highlighted in
all three windows. Software supports horizontal, vertical, and cascade tiling.
Text Compare By Document/By
Software supports text comparison across the document as with
text processor text compare. Comparison looks at text strings only without
regard to source data. Text comparison by phrase synchronizes document and
compares text to the same audio-linked text.
For SR text compare, software converts Dragon (Nuance), Microsoft
Windows, IBM, or other SR session files to CSUSA session file format
selects text compare across document or by phrase (utterance).
SR session files are typically "asynchronized"
with different start/duration time stamps for each respective
audio-aligned segment. If operator selects by phrase option,
resegments and retags the audio for each
4. This results in an identical number of session file
segments for each SR engine. Same number segments each have the same start/duration
5. These synchronized files are text compared
segment by segment (utterance by utterance).
Speech editor highlights differences by
text compare representing potential SR errors. Editor tabs to
differences and listen to audio, and corrects SR errors as required.
Editor also corrects obvious speaker errors, e.g., speaker said right
when meant left. Speech editor may insert comments or
questions about text and send to speaker as annotation (comment).
Editor saves audio-linked verbatim text (what speaker said) as training
session file (.TRS). There is minimal audio to review with highly accurate SR
systems since there are few differences.
Synchronization = Session File with Same Number of Segments
Software can phrase compare texts from any source or processing method as long as
they have the same number of segments. The
delimiters vary depending upon the type of file.
For dictation, manual transcription, and speech
recognition, the delimiters are generally utterance
boundaries. For translation, it can be punctuation
such as period, comma, colon, or semicolon.
Text Compare Display Options
Process may perform text compare in the multiwindow session file editor
1. Two or more viewable buffered
2. Viewable session and
"hidden" session, display differences as
3. Different texts displayed as
popup dialog when cursor passed over
4. Consolidated text compare
1. Two or more viewable buffered
Using "dual-engine" comparison with two
visualized buffered document windows, speech editor can text compare with both
sets of text visualized. Operator selects phrase compare or text compare
(across document) by clicking on the "compare phrases" or "compare document".
Phrase Compare. The first example represents phrase compare where segments
are synchronized and equal in number. Overall accuracy is about the same
in both texts. Dropdown menu indicates the difference in the other (lower)
panel. The window reflects an old (SpeechMax+) version of the software.
b. Text Compare Across
In the second example (newer version of
software), user has selected conventional text compare
across the entire document. Top window text has
about 90% accuracy, bottom window has just under 100%
accuracy. Dropdown menu selection is play
difference (audio) for the first highlighted difference.
sessions in this dual-window display,
relatively few differences suggest
both are highly accurate.
Transcriptionist should only have to listen to
about 10% of speech recognition audio with these
highly accurate texts. Reduction in
transcriptionist review time is about 90%. Example
is consistent with conclusion that increasing
accuracy will reduce text differences, and that
differences will approach zero as accuracy nears
2. Viewable session and
"hidden" session, display differences as popup
Demo #3 . . .
Single-window dual-engine comparison using server-based
Dragon NaturallySpeaking . . . User sequentially
opens Dragon and Custom Speech USA™ session
files, clicks compare documents toolbar button to
highlight differences, plays differences using menu
dropdown, makes changes, increases leading/trailing
playback to listen to word "of," and copies/pastes
"well-maintained" from Dragon to format text, and enters new lines.
Operator saves final distribution report as .txt. Small cap
"l" was transcribed by both engines and capitalized
as "L" for report distribution. Since user did
not create a separate verbatim annotation, the final
text is automatically saved as the verbatim text
Instant Verbatim™ feature.
Different texts displayed as popup
dialog when cursor passed over text
4. Consolidated text compare,
e.g., Text A vs. Text B, Text B vs. Text C, Text A
vs. Text C ==> color coding.
SR editor can also use consolidated text compare
using 3 or more SR systems in Best Guess mode.
Color code to indicate degree of
differences, for example, clear = no
difference, pink = difference between 2,
or red = all 3 differ.
With large number of sessions, high
number of matches indicates increased
likelihood of reliability. Text is
virtually all clear. Large
number of differences indicates
increased likelihood of error. Text is
In the color-coded Best Guess example
above, consolidated text compare highlights about
55% of text (representing differences) with
nonhighlighted text (nondifferences). There is a
potential reduction in audio review time of 45%.
Focus on "differences" audio and reduce audio review.
A major SR company advertises up to
If editor reviews 10 hours of
audio with this accuracy, 9 hours, 48
minutes represents correctly recognized speech,
and 12 minutes represents errors.
SR accuracy, editor reviews 10 hours of audio
to find 1 hour of misrecognitions. With increasing accuracy of
and other pattern recognition, an emerging issue is efficiently finding
the increasingly few pattern recognition errors.
SR accuracy, at least 80% decreased review time
User tabs to
highlighted differences, skips matches
Selects text, plays back audio, and corrects as needed
User may open both documents to view simultaneously if
transcription accuracy about same as gold standard MT
Decreased audio review time as accuracy increases
Significant productivity gains with highly accurate
Zero audio review time if both SR systems 100% accurate
1. Large SR transcription volumes suggest large
potential savings in speech editor review time with very
2. Helps dictating speakers who self-correct.
Speakers can use
this process in "preview" mode to detect potential
before undertaking their own review. They can
also use process in "review" mode to determine that potential
misrecognitions have been reviewed.
3. Create speech user profiles with unsupervised
training. Matches indicate likely
accurate text. Use for SR speech user profile
training without manual audio review. Text compare
results of computer-generated speech user profile
against manually-created standard or other
computer-generated speaker profile.
4. Compare text output using different SR technologies. SR may
use hidden Markov models with Gaussian mixtures, neural
networks, adaptive, nonadaptive, speaker dependent, speaker
independent, or other techniques. Synchronize text
output "by phrase" from speech engines engines
using different conversion variables and compare
relative recognition accuracy.
5. There are other
uses for text compare in speech and language processing,
including training, synchronized translation, and use of
knowledge base, as described in video demos.
Demo #6 . .
medical transcriptionists . . .
potential application of text compare of student
interpretation, conversion, or analysis of bounded
input data in training and education. In one
approach, a medical transcription (MT) instructor can create a
verbatim transcribed session file text key and compare to
synchronized student output. Text comparison
identifies errors in spelling and punctuation and
generates accuracy rate. Synchronized audio
makes training more efficient. Operator need
only click on text to hear associated audio. Multilingual
medical transcriptionist training in
consolidated text compare. This composite provides the
instructor with a quick, visual estimate of whether
other students made similar or different errors. Software
determines individual student accuracy levels automatically.
. . .
elementary school phonics . . .
proof-of-concept prototype software for elementary school phonetics training ("phonics").
Local elementary school teacher requested
video to demonstrate proof of concept to local school board. The
video shows how teacher
uses synchronized text compare (by phrase).
levels are determined automatically. Video also
shows how teacher can customize training with web-based resources.
Video displays how has made available
web resources using software text
annotations to specific document text.
panel of phonetic sounds are used for multiple
students, teacher can potentially (not shown)
synchronize phrases. To do
so, operator can load each session file separately
into the multiwindow session file editor. Once
synchronized, teacher could sequentially listen to
each child's pronunciation. In this way,
teacher can compare varied pronunciations for each
of the phonics phrases.
Demo #21 . . .
French, and Spanish translation . . .
Video shows operator tiling
document windows horizontally and synchronizing
French translation with English source.
Operator opens third document window for previously
delimited Spanish translation. Operator
synchronizes Spanish translation with both English
and French translations and synchronizes French and
Spanish translations. Operator selects each
English segment sequentially and confirms
synchronized highlighting of French and Spanish
Delimiting output with punctuation common to
different languages (e.g., periods, commas, colons,
or semicolons) simplifies potential automation of
the text compare process text compare. It
would not be difficult To compare accuracy of each
translation (not shown), operator can obtain
identically delimited French and Spanish
translations from different manual or automatic
translation sources. Operator may phrase
compare, respectively, synchronized French
translations and same for Spanish texts.
Initial translations are shown in the screen shot
Demo #22 . . .
Law firm knowledge
base with color-coded frequency of previous word
use . . .
Video shows use of
knowledge base for creation of a nondisclosure agreement
and efficient document assembly. In example,
user selects phrases from various
"fill-in-the-blank" alternatives. User text
compares alternative form
selections. User can also use color-coded
composite Best Session™
to visualize the variability in selection
choice for each field.
Color coding shows agreement between
source documents in knowledge base. Redder indicates considerable difference. Pink
indicates minimal difference and clear no difference.
Knowledge base may be utilized by law firm,
business, or other organization. Flash
6. Limitations of Text
Compare: Process does not identify every error. Two or
more speech engines sometimes misrecognize the
same audio. With an identical misrecognition, no
difference is detected. For example, speaker says
"hat" and SR engines A and B both transcribe as "that".
However, text compare highlights nonidentical
misrecognitions. For example, speaker says "hat"
and SR engine A transcribes "that" and SR engine B
Company experience is that identical or nonidentical
misrecognitions occur less frequently with improving SR
accuracy. Goal using text comparison is to
return text with the same or lower error rate
compared to gold standard manual transcription (usually
estimated about 5% or less).
7. Matches as Indicator of
Reliability: Process may also use "nondifferences" mode to identify
likely accurate SR training data.
Potentially, process may use nondifferences for
unsupervised (nonhand labeled) SR speech user profile
training. This differs from supervised or hand-labeled
data entry where human reviewer supervises and manually checks
and corrects data submitted for training.
unsupervised mode, developer can use text comparison to help "computers train
computers." A high number of matches with other
session files indicates increased likelihood of
reliability. Operator can use export audio text
pairs functionality to save unvalidated results of matched
transcription to train SR
supports text compare of text results from pattern
Use Text Compare with Other Audio, Text,
or Image Pattern Recognition
session file text output may represent
human or automated processing or
delimited text, audio, or image data.
For example, the image below shows an audio wave.
If it is associated to delimited text description, it is
a session file.
Other potential sources of pattern recognition data
, e.g., optical character recognition, automatic
document classification; audio, e.g., music, traffic
noise and soundscapes, baby cries, and dog barks
characterization; or image, e.g., computer-aided
medical diagnosis, facial profiling, retinal or iris
scanning, handwriting, or fingerprinting.
As with SR
text compare, steps include convert to .SES format, synchronize data files, compare text
output (if available), and identify differences
and matches. Determine if differences represent
error. Assume matches represent likely correct.
If output data is corrected, user may obtain training
data for pattern recognition models.
With image source data,
text output results may be displayed in
document window and text compared. Image
may be displayed in document window or
using annotation feature to link to
file, webpage, or video.
computer-aided diagnosis (CAD) for
mammography can generate mammogram
session file with suspicious areas
marked for review and described by text.
These may be viewed by reference to a
hyperlink. Process may compare text output from
> 2 CAD systems and/or human
reviewers for each suspicious area.
Decision as to final report may be made
by single radiologist in review.
Process can use data from surgical or
biopsy specimen to improve CAD pattern
Selective Modification of Session
File Text and Audio
session file editor has document window (upper) and
annotation (comment) window (lower).
Operator may selectively edit document window speech
recognition session file text or audio
using text or audio annotations.
user has entered annotations, user may switch
(transpose) document/annotation content, or selectively
replace (move) annotation text or audio into document window to
modify session file text or replace audio. The
switch and replace functions use the copy paste
functions of the underlying operating system.
Process may use annotation text audio pairs
to train the speech user profile. Annotation training differs from general session
training that uses document window audio-linked verbatim training
session file (.TRS) to train speech user.
Session file editor supports voice correction of another
speaker's session file, i.e., selective modification of
text previously dictated with speech recognition.
This enables another one or more parties to voice
correct the speech recognition text of another speaker
and to use the modified text and audio to train the
speech user profiles of both speakers. This is not
supported by Dragon NaturallySpeaking or other
off-the-shelf desktop speech recognition software.
speech recognition, Speaker B can open another speech user profile in a
buffered document window and listen to the audio-linked
text created by Speaker A. Speaker A may represent
a dictating lawyer, physician, or other dictating
speaker. Speaker B may represent
a speech editor or another speaker attempting to
collaborate on a document with Speaker A.
If Speaker B attempts to voice correct
A text, saving the Speaker B voice with
the corrected text will corrupt the Speaker A speaker
profile. If Speaker B opens his/her profile and corrects
the text, there is no training of Speaker A profile.
This limits voice correction by a speech editor or use of desktop
SR to create multispeaker collaborative documents and
train speech user profiles.
a second speaker can open the .SES session file dictated by a first speaker in
the main document window (upper) and use the annotation (comment) window (lower)
to markup the session file of the first speaker with voice. The selective modification
document window text of the first speaker using OS
This supports selective modification of
document text and audio tags. This also allows second
author/speaker to make corrections with voice without damage to
1st or 2nd speech user profile. This
assists transcriptionist or court reporter voice
correction of documents, as well as multispeaker
creating and editing collaborative documents.
sound recorder records SR
audio, text entered in text annotation window
(top), annotation (comment) window
annotation correction of Speaker A SR text
had dictated "Alan Smith" but was
recognized as "Adam Smith"
Speaker A profile open in buffered
Speaker B correctionist plays back 1st
using SR ("Alan Smith") ( see
Dictates using SR into Annotation
Document Window (top) with
vertical purple utterance markers separating text and
Adam Smith text highlighted in blue, Annotation Window
(bottom) with sound recorder, annotation text window
with Alan Smith text, SPEAKER B highlighted under
Annotation Name column as party making annotation
1. Train both speech user profiles.
After second speaker correction of
first speaker SR text, the system can train both speech
user profiles using the verbatim text "Alan Smith" and
the respective audio dictated by first speaker and
second speaker. There is no corruption of either
speech user profile.
2. Supports sequential voice creation/correction
multispeaker documents on a local area network or
document emailed as attachment to other party.
Unlimited number of annotations (comments) per dictated
text are supported. These comments may represent
May potentially use similar document/annotation concept for web-based, real-time multispeaker collaborative documents. Utilize
first window to view text/playback audio. Use
second window to dictate SR correction or change.
Copy/paste text from second window into first. Use
corrected text to train Speaker A profile. Use
Speaker B audio and text dictated into second window to
train/update Speaker B profile. Potentially second
window may reside in first browser or in a second, newly
Software supports selective modification
of audio tag of document window text.
Mainstream adaptive SR correction and user training
focuses on text
correction of audio-linked text. There is no
convenient way to selectively modify SR audio tag with
human or synthetic voice or other audio.
system, user can
also selectively modify the audio tag of audio-linked SR. This relies upon OS
copy/paste functionality. Process requires
creating and summating time stamp offsets of the session
file to maintain alignment unless original audio is same
length as the replacement audio.
One use for this feature is modification
of audio instructions associated to a form field in
SpeechMax™ supports simplified form creation with data fields
entered in the main document window as text by keyboard
or SR. User can create complex forms
without use of expensive software development kits or
knowledge of advanced
scripting. Forms creation uses text and audio
annotation (comment) features.
Forms creator can enter text and audio
comments. Forms creation can also create text annotations
to form fields that link to
websites or run programs (such as media player).
These can supply additional information to person who is
completing the form.
As a result,
party completing the form can access
text, spoken instructions, website, or other
information relevant to form completion.
Operator can also selectively
modify document window text audio tag using audio
1. Create form or other text with SR, replace audio tag
representing original SR dictation with audio
content. User completing form selects form field text
and hears relevant audio information. Potential
benefit where necessary to hear audio, e.g., testing
hearing or recognition of speech or music.
2. Train profile for real or artificial language with
synthetic (robotic) speech
Substitute synthetic voice for SR human voice tag.
This creates session file text with audio-linked
synthetic voice. This can be used to create speech
recognition for a real or artificial language with a
synthetic voice font. Speech user profile would
consist of model based upon the synthetic voice.
This is one approach to creating speech user profile for
3. Improve audio quality of SR audio-tag when intended
for playback. Electronic audio book may be read by
professional voice talent and transcribed with SR.
Voice talent may also dictate audio which is presegmented and transcribed manually. Some audio
may be poor quality and require redo by voice talent. Process may selectively replace
("splice in") phrase audio for one or more segments or
replace audio for individually-tagged words.
4. Associate foreign language audio tag to source
text dictated by SR
This creates a learning tool for foreign language.
For example, user clicks on English in
document window and
hears foreign translation as audio tag.
Associate song to lyrics. Dictate song lyrics,
transcribe with SR or manual transcription, delete
spoken audio tags, and substitute song for spoken word. (The
following demo was created differently, but shows the
. . .
Singalong/karaoke: Stairway to Heaven (Led
. . .
Demo shows playback of audio (song)
and highlighted text and slider bar in
"Stairway to Heaven," produced by Jimmy Page, executive
producer Peter Grant,
1971 Atlantic Recording
Corporation for the United States
and WEA International Inc. for the
world outside of the
Using the top slider bar, a user may "drag" playback
point to another point within the same segment, or to a
point in another segment. Software also supports finds a corresponding location
in text or resume listening at a new time stamp using
slider bar. A slider bar may also be associated with
each separate window.
synchronization techniques support dynamic display of
synchronized text in two or more windows. One
window may play the song and text, one or more other
windows may display only translated song lyrics.
Protect Confidential Session Data
redaction limits access to confidential information such
as name or address, division limits amount of data
transferred to any individual outside processing node,
and scramble disorders content making it more difficult
Transcription or SR editing
is often processed out-of-office at remote sites.
Federal and state government, businesses, and professional offices would often
like to outsource data processing to cheaper centers, but in ways that limit
disclosure of confidential information. The following is designed for
security requirements as might apply to routine business or professional dictation.
of content during manual transcription or SR editing
Redact word or phrase SR audio and text
Divide session file segments into
> 2 groups
divisions to different processing nodes
Option to scramble (sort) segments within division
Maintains intelligible unit (utterance) for
transcription or editing SR
Utterance = phrase or short sentence
Balancing obscuring content vs. efficient transcription
division of the scrambled session file is seen in the window below.
1. Limit knowledge of whole
for minimal security level protection. Solution is
not designed for data requiring a high level of security.
2. Maintains intelligible unit (utterance) for
manual transcription or SR correction. While
data is obscured by scramble, utterance phrase or
sentence should be intelligible to manual
transcriptionist or SR speech editor.
3. Simple process to reverse after returned
redact/divide/scramble after document return (see
application to TV captioning or court reporting by
remote court reporter. May segment and
distribute single segment to different captionist or
court reporter. May stream
untranscribed session file
segment (audio utterance), transcription results in
transcribed session file with audio linked text.
Demo #23 . . .
Divide, scramble, merge, and unscramble . . .
video shows divide/scramble untranscribed audio
session file and merge/unscramble transcribed