Speech Recognition:


Speech Recognition:

Page 22

Speech Recognition:

speech

Is It Ready for Docs?

In two years, there will be virtually no need for transcription. And, therefore, no need for medical transcriptionists (MTs). Just kidding. I figured that would be a sure way to get your attention. Particularly if you’re a transcriptionist.

OK, now that your heart is pumping faster, your fists are tightly clenched and your eyes are riveted on this page, let’s do a realistic appraisal of the state of technology for speech recognition. Are we still looking at the seemingly eternal three- to five-year rolling window that has been forecast ever since the early 1980s? Is the timeframe even shorter? Maybe there’s even some stuff going on now that you might want to get into.

First a couple of important definitions. We will use the term “front-end” speech recognition to describe systems where the dictator speaks into a high-quality microphone attached to a PC, the recognized words are displayed right after they are spoken and we expect the dictator to correct any and all misrecognitions. “Back-end” recognition will refer to a different scenario: the dictator speaks into a telephone, the voice gets recorded on a digital dictation server, the voice file then gets run through a recognition engine to produce a draft, and both the draft text and the synchronized voice file are sent to an editor to correct and finalize. Both of these approaches can work with portable digital voice recorders (including Pocket PCs), which will certainly become increasingly popular over the next few years, but we won’t complicate this relatively brief article by dealing with that issue too.

Front-end speech recognition is the Holy Grail…at least for healthcare professionals who don’t make their living in the dictation/transcription industry. Doctors handle the whole nine yards. They dictate, correct and authenticate. All at one sitting. No transcription. No labor cost. No delay. Of course, it might be a little tricky for the doc to get the demographics attached and to make sure the report gets appropriately into the text repository, but those challenges can be overcome. We all know how flexible and resourceful physicians are when it comes to dictation procedures. And those same observational skills make us see how innocent and remorseful O. J. Simpson is.

However, there’s another challenge presented by front-end recognition, and this one is not so easily overcome. It takes weeks, if not months, for dictators to achieve a satisfactory 97+ percent accuracy rate. And, during that period, there’s a tremendous amount of time taken away from patient care. Fortunately, most physicians don’t care about being inefficient and wasting time. Yeah, right. So, that’s a wide and shaky bridge to cross. And many docs fall off along the way.

But that’s just one of the obstacles. While getting there is surely not half the fun, being there isn’t that great either. Even with an impressive error rate of just 3 percent, that means about one out of every 33 words is misrecognized. So the dictator must make one correction, on average, every three lines. This can add a total of maybe 20 to 30 minutes onto a single day, enough time to see two or three additional patients, which could easily pay for an MT, with some golf money left over.

We can, therefore, conclude that front-end recognition is totally worthless. Well, that’s a bit of an overstatement for what is an extremely impressive technology. But it really is just not ready today for the vast majority of physicians. And it won’t be ready tomorrow either. I should mention, at this time, that front-end recognition does work pretty well for radiology and pathology, due to the simplicity of those vocabularies. (Please don’t tell the radiologists and pathologists I said that.) But in this article, we’re focusing on general clinical dictation.

So let’s move around to the back end. The game plan here is to not change physician dictation behavior in any way whatsoever. If they dictate into a digital dictation system over a telephone, just leave well enough alone. The goal is simply to substantially reduce the cost of transcription, via dramatic improvements in productivity. I used the words “substantially” and “dramatic” because if we just experience incremental improvements, it won’t be worth making the changeover. All of us love innovation; we hate change.

There are two components to the formula on making back-end recognition worth doing. First, the accuracy rate has to get up really high, probably over 95 percent. And second, the editing tools have to be super slick, because otherwise it can take longer to correct even the 5 percent errors than to transcribe from scratch. In which case, we can all agree that it would be rather stupid to go there.

Telephones have lousy microphones, and the phone lines cut off both the high end and the low end of the frequency spectrum. The bottom line is a crappy voice file. It’s fine for conversation and even for digital dictation, but it’s acoustically inferior for purposes of speech recognition. Just as important, the dictator is highly unlikely to behave in a way that is conducive to enhancing accuracy. In a front-end situation, the dictator sees what words are misrecognized, and is incented to fine-tune the way these words are dictated. In other words, the system is training the dictator, even while the dictator is training the system. Not so in the back-end scenario.

Because all the processing is going on in a back-end server, the dictators never see the raw recognition results. They may even believe that conventional transcription continues. If they routinely do not dictate punctuation, then there’s no punctuation fed to the recognizer. So either the software or the editor will have to add it in. Similar issue for “ahs” and “ums,” which need to be removed by the software, or else the editor will have to delete whatever word the recognizer whimsically decides to insert for the sound. Coupled with the crappy acoustics, this is not a situation that leads to optimism about the outcome. So maybe we should just forget the whole thing.

But human ingenuity coupled with brute force can accomplish amazing things. And, believe it or not, this challenge is being met and conquered. It does require, however, an immense amount of customization of each dictator’s “acoustic model” and “language model.” The acoustic model represents how the individual dictator pronounces all the possible sounds (phonemes) that get put together in various ways to produce words. The language model lists the words the dictator uses, and it charts how the dictator puts these words together to form sentences. Actually, it uses “n-grams” to reflect the probability of some combination of sounds being a particular word, based on the words around it, but we won’t get real technical here.

Anyway, the speech recognition engine combines the contributions from the dictator’s acoustic and language models to come up with its best assessment of what has been said. Building those individualized models currently requires hours of dictations and thousands of reports—for each dictator—coupled with intensive analysis, but future advances will naturally compress the magnitude of this up-front investment.

Back-end recognition does have one big advantage: no time pressure. Because there’s no need to display the words on a screen as they’re being spoken, we can take several minutes to run a one-minute dictation through even more than one language model. We want to get the draft text as accurate as possible, so that minimal editing is required. And time is not of the essence.

We are in the embryonic days of this technology’s implementation. But the organism is beginning to take form. And it’s pretty cool looking. With impressive accuracy and those slick editing tools I mentioned, productivity can be at least doubled. And it will keep getting better until it gets asymptotic at about four times the speed of transcription, which will bring it to approximately real time. That will take 98 percent to 99 percent accuracy, which is maybe three to five years away, but eventually attainable by leveraging the high degree of individual customization that is the all-important key to this approach.

Every time a speech-recognized draft is edited, the models for that dictator are improved and the future accuracy increases a bit. After doing this for lots of dictators for a couple of years, the accuracy may get high enough so that many docs will be willing to edit their own reports. To make that palatable, the draft will have to approach the accuracy that would be provided by an MT.

Now I am well aware that this leaves the dictators hanging out there without someone to correct their crummy grammar and syntax. But given the benefits of an immediately available authenticated report, with no transcription expense, there may come a time where health care executives forcefully ask the docs to handle it themselves…if the accuracy is high enough to make this a reasonable request. That’s certainly not right around the corner, but it will surely occur early in this millennium.

In the meantime, the world is safe both for conventional transcription and for speech-recognition editing. And it’s safe for docs to keep doing what they’re doing in this arena. As Larry Weed, father of the problem-oriented medical record, says, “If physicians were in charge of airports, there would be no radar. Just intensive-care units all around the periphery.” In any case, while the demand for transcription continues to increase, the productivity advances projected in this article will enable us to keep pace with the increasing demand. It’s a brave new world that has such wonderful technology in it.

Joe Weber is executive vice president of DVI and has more than 35 years’ experience in health care administration, research, consulting and marketing. He can be reached at joeweber@alum.mit.edu.

About The Author