Vol. 15 •Issue 4 • Page 25
Physician and MT Screening for Speech Recognition
Is this the year speech recognition technology begins to live up to some of the promise that has long surrounded it?
This may well be the year of speech recognition in health care—the year this technology finally begins to live up to some of the promise that has long surrounded it. In years past, speech recognition had consistently fallen short of expectations and many of the institutions that were among its earliest adopters also became some of its harshest critics. For the early adopters, the hype confronted reality and, more often than not, lost. Although it has been heralded for many years as a way of reducing the cost of transcription, it has not been until very recently that dramatic improvements in the quality of speech recognition solutions has enabled these systems to live up to some of their expectations.
Now, as we enter an era in which speech recognition actually works well enough for a mass audience, it is important to realize that there still remains some gap between the hype and the reality. The hospitals and medical centers that plan to implement speech recognition will need to understand this gap and navigate through it to leverage speech recognition effectively to reduce transcription costs.
Front-end vs. Back-end
In health care, there are two varieties of speech recognition solutions that are on the market today: “front-end” and “back-end.” Front-end speech recognition solutions directly involve the end user whose speech is being recognized, typically the physician. A physician dictates into a PC, and the speech recognition engine produces text, in real-time, as the physician speaks. The physician is able to edit the results of the speech recognition immediately and produce a final document without the involvement of an MT. Thus, front-end speech recognition holds the promise of eliminating transcription costs entirely.
Presently, however, the reality is that few physicians are suitable candidates for front-end speech recognition—it simply does not fit well into their workflow. Among those few that do find PC-based dictation feasible, many choose not to perform self-editing; they find it too time consuming. Thus, today, the percentage of physicians that will use front-end speech recognition and edit their own documents is very small, a fact that should not be overlooked when comparing front-end speech recognition to back-end solutions.
Back-end speech recognition solutions do not require any direct physician involvement. Physicians dictate as they always would, via telephones, PCs, digital handhelds, etc. Speech recognition takes place after the dictation is complete, in batch mode on a server. The subsequent results are passed onto an MT for editing. Back-end speech recognition holds the promise of turning MTs into editors, thereby increasing their productivity and thus eroding transcription costs. Presently, however, the reality is that while back-end speech recognition works and can indeed reduce transcription costs, not all physicians are suitable candidates for back-end speech recognition, and not all MTs will become more productive as editors.
Given these realities, hospitals and medical centers that charge blindly into speech recognition with high expectations will likely come away frustrated by the experience. Successful deployments of speech recognition, on the other hand, will be those in which health care administrators recognize the need to screen physicians and MTs. For physicians, this screening will determine those suitable for front-end recognition, candidates suitable for back-end speech recognition, and those who should continue to use the traditional manual transcription model. For MTs, the screening will determine MTs who can become more productive as editors with speech recognition, and those who should continue performing traditional transcription.
Before investing real money for a speech recognition solution, ask your prospective vendor whether they will screen your physicians and provide you with an analysis of how many of your physicians are suitable candidates for speech recognition, front-end or back-end. Not all vendors have this capability, but working with one that does will make the difference between a successful deployment of speech recognition and a costly failure.
Physician screening for speech recognition uses a physician’s existing dictations and transcribed reports to make a determination of the suitability of that physician for speech recognition. There is no disruption in the workflow for the physician. In most cases, those physicians are not even aware that they have been screened for speech recognition. Typically, anywhere from 2 to 5 hours of audio per physician (along with the transcribed reports corresponding to those audio files) is sufficient for the screening to take place.
With this requisite audio data for each physician in hand, screening physicians for speech recognition will yield a goldmine of data that will be key in determining the cost/benefits of speech recognition. Consider the results of a screening at one hospital, shown in Figure 1. Here, 46 physicians were screened and assigned a score from 0 to 10 that indicates their suitability for back-end speech recognition. A higher score corresponds to higher speech recognition accuracy and means that they are better candidates for speech recognition.
For back-end speech recognition, how these scores translate into actual speech recognition accuracy is less important than how a specific score translates into productivity when MTs become editors and begin editing the results of speech recognition. In the case study of Figure 1, we ranked physicians according to their score into three categories: (1) black—excellent candidates for back-end speech recognition for whom there will be a clear cost savings, (2) gray—average candidates for back-end speech recognition for whom there may be a potential for cost savings with speech recognition, and (3) white-poor candidates for back-end speech recognition for whom there is unlikely to be any cost savings with speech recognition. All of these categories are dynamic in that given the proper incentives and the proper feedback from the system, physicians can be taught to modify their dictation styles to become better candidates for speech recognition. Consequently, while this screening process shows the potential for cost savings in the absence of any behavior modification on the part of physicians, using a feedback loop that tells the poor or average candidates for speech recognition how to alter their dictation styles can propel many of those in the lower categories into the higher ones and thus increase the potential for cost savings.
For front-end speech recognition, using the same audio and text data, candidates are ranked according to factors that differ from those used for back-end speech recognition. While the primary concern with back-end speech recognition is the anticipated productivity gain of turning an MT into an editor, for front-end speech recognition, the primary concern is the degree of adaptation in their dictation style that a physician would have to undergo to achieve very high accuracy results. In practice, only some of the best candidates for back-end speech recognition will make reasonably good candidates for front-end speech recognition once this data is combined with other subjective factors: a) is the physician willing to undergo the adjustment in the present workflow to adapt to front-end speech recognition, and b) are they willing to edit their own work? If the answer to either of these questions is no, then this physician is best suited for back-end and not front-end speech recognition.
Screening MTs for speech recognition is typically overlooked as a necessary ingredient in a successful speech recognition deployment. For MTs, a screening process usually requires a 3- to 4-hour session in which MTs are taught the basics of editing speech recognition output and measured for their acuity as editors. Scores can be assigned and MTs analyzed for their potential as editors in much the same way as the scores for physicians can be used to determine which physicians’ dictations should be sent through the speech recognition workflow. Subjective factors may also play a role here: some MTs will resist change and prefer to continue working according to the traditional model.
The key point in all of this is that not all physicians’ dictations are suited for speech recognition and not all MTs are suited for editing speech recognition output. Even the most successful deployments of speech recognition will continue to require some portion of the work to go through conventional transcription. Consequently, it is not necessary that every MT be turned into an editor and show clear productivity gains. What is critical is identifying the portion of dictation volume in your institution that should be going through speech recognition, and the MTs that should become part of the new speech recognition workflow and those who should remain part of the conventional transcription workflow. And, all of this should be done before you purchase a speech recognition solution, not after. This will ensure that your total spend on speech recognition is consistent with the projected savings. Institutions that work with vendors to perform this physician and MT screening in advance will be the clear winners in the new era of speech recognition in health care. The rest will find themselves sinking a lot of money into a technology that creates more frustration than relief and repeats the mistakes of speech recognition deployments in years past.
Dr. Harjinder Sandhu is the chief technology officer and co-founder of MedRemote, a technology company that provides solutions for speech recognition as well as traditional dictation and transcription systems.