Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said

Tech behemoth OpenAI has touted its artificial intelligence-powered transcription tool Whisper as having near “human level robustness and accuracy.”
But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.
Experts said such fabrications are problematic because Whisper is being used in a slew of industries worldwide to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.
More concerning, they said, is a rush by medical centers to use Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’s warnings that the tool should not be used in “high-risk domains.”

A computer screen displays text produced by an artificial intelligence-powered transcription program called Whisper at Cornell University on Feb. 2 in Ithaca, N.Y. In this example, the speaker said, "and after she got the telephone he began to pray" while the program transcribes that as "I feel like I’m going to fall. I feel like I’m going to fall, I feel like I’m going to fall…." 

Seth Wenig, Associated Press

The full extent of the problem is difficult to discern, but researchers and engineers said they frequently have come across Whisper’s hallucinations in their work. A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in eight out of every 10 audio transcriptions he inspected, before he started trying to improve the model.
A machine learning engineer said he initially discovered hallucinations in about half of the over 100 hours of Whisper transcriptions he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.
The problems persist even in well-recorded, short audio samples. A recent study by computer scientists uncovered 187 hallucinations in more than 13,000 clear audio snippets they examined.
That trend would lead to tens of thousands of faulty transcriptions over millions of recordings, researchers said.
Such mistakes could have “really grave consequences,” particularly in hospital settings, said Alondra Nelson, who led the White House Office of Science and Technology Policy for the Biden administration until last year.
“Nobody wants a misdiagnosis,” said Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “There should be a higher bar.”
Whisper also is used to create closed captioning for people who are deaf and hard of hearing — a population at particular risk for faulty transcriptions. That’s because they have no way of identifying fabrications “hidden amongst all this other text," said Christian Vogler, who is deaf and directs Gallaudet University’s Technology Access Program.

OpenAI urged to address problemThe prevalence of such hallucinations has led experts, advocates and former OpenAI employees to call for the federal government to consider AI regulations. At minimum, they said, OpenAI needs to address the flaw.
“This seems solvable if the company is willing to prioritize it,” said William Saunders, a San Francisco-based research engineer who quit OpenAI in February over concerns with the company’s direction. “It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.”
An OpenAI spokesperson said the company continually studies how to reduce hallucinations and appreciated the researchers’ findings, adding that OpenAI incorporates feedback in model updates.
While most developers assume that transcription tools misspell words or make other errors, engineers and researchers said they had never seen another AI-powered transcription tool hallucinate as much as Whisper.

Assistant professor of information science Allison Koenecke, an author of a recent study that found hallucinations in a speech-to-text transcription tool, works in her office at Cornell University on Feb. 2 in Ithaca, N.Y.

Seth Wenig, Associated Press

Whisper hallucinationsThe tool is integrated into some versions of OpenAI’s flagship chatbot ChatGPT, and is a built-in offering in Oracle and Microsoft’s cloud computing platforms, which service thousands of companies worldwide. It is also used to transcribe and translate text into multiple languages.
In the past month alone, one recent version of Whisper was downloaded over 4.2 million times from open-source AI platform HuggingFace. Sanchit Gandhi, a machine-learning engineer there, said Whisper is the most popular open-source speech recognition model and is built into everything from call centers to voice assistants.
Professors Allison Koenecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short snippets they obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that nearly 40% of the hallucinations were harmful or concerning because the speaker could be misinterpreted or misrepresented.

Assistant professor of information science Allison Koenecke authored a recent study that found hallucinations in a speech-to-text transcription tool.

Seth Wenig, Associated Press

In an example they uncovered, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
But the transcription software added: “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”
A speaker in another recording described “two other girls and one lady.” Whisper invented extra commentary on race, adding "two other girls and one lady, um, which were Black.”
In a third transcription, Whisper invented a non-existent medication called “hyperactivated antibiotics.”
Researchers aren’t certain why Whisper and similar tools hallucinate, but software developers said the fabrications tend to occur amid pauses, background sounds or music playing.
OpenAI recommended in its online disclosures against using Whisper in “decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes.”

Assistant professor of information science Allison Koenecke, an author of a recent study that found hallucinations in a speech-to-text transcription tool, sits for a portrait in her office Feb. 2 at Cornell University in Ithaca, N.Y.

Seth Wenig, Associated Press

Transcribing doctor appointmentsThat warning hasn’t stopped hospitals or medical centers from using speech-to-text models, including Whisper, to transcribe what’s said during doctor’s visits to free up medical providers to spend less time on note-taking or report writing.
Over 30,000 clinicians and 40 health systems, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, have started using a Whisper-based tool built by Nabla, which has offices in France and the U.S.
That tool was fine-tuned on medical language to transcribe and summarize patients’ interactions, said Nabla’s chief technology officer Martin Raison.
Company officials said they are aware that Whisper can hallucinate and are addressing the problem.
It’s impossible to compare Nabla’s AI-generated transcript to the original recording because Nabla’s tool erases the original audio for “data safety reasons,” Raison said.
Nabla said the tool has been used to transcribe an estimated 7 million medical visits.
Saunders, the former OpenAI engineer, said erasing the original audio could be worrisome if transcripts aren’t double checked or clinicians can’t access the recording to verify they are correct.
“You can’t catch errors if you take away the ground truth,” he said.
Nabla said that no model is perfect, and that theirs currently requires medical providers to quickly edit and approve transcribed notes, but that could change.

A computer screen displays text produced by an artificial intelligence-powered transcription program called Whisper at Cornell University on Feb. 2 in Ithaca, N.Y. In this example, the speaker said, "as the um, the, her father dies not too long after he remarried…." while the program transcribes that as "It’s fine. It’s just too sensitive to tell. She does die at 65…." 

Seth Wenig, Associated Press

Privacy concernsBecause patient meetings with their doctors are confidential, it is hard to know how AI-generated transcripts are affecting them.
A California state lawmaker, Rebecca Bauer-Kahan, said she took one of her children to the doctor earlier this year, and refused to sign a form the health network provided that sought her permission to share the consultation audio with vendors that included Microsoft Azure, the cloud computing system run by OpenAI’s largest investor. Bauer-Kahan didn’t want such intimate medical conversations being shared with tech companies, she said.
“The release was very specific that for-profit companies would have the right to have this,” said Bauer-Kahan, a Democrat who represents part of the San Francisco suburbs in the state Assembly. “I was like ‘absolutely not.’”
John Muir Health spokesman Ben Drew said the health system complies with state and federal privacy laws.

Meet the riskiest AI models ranked by researchers

Meet the riskiest AI models ranked by researchers

Generative AI has become integral to businesses and researchers seeking the latest technologies. However, these advancements can come with high risks. Companies could encounter critical factual errors if they use a risky AI model. Even worse, they could violate the law and other industry standards.
The Riskiest AI Model, According to ResearchersThe research shows DBRX Instruct—a Databricks product—consistently performed the worst by all metrics, TeamAI reports. For example, AIR-Bench scrutinized an AI model’s safety refusal rate. Claude 3 has an 89% rate, meaning it didn’t comply with risky user instructions. Conversely, DBRX Instruct only refused 15% of hazardous inputs, thus generating harmful content.
AIR-Bench 2024 is among the most comprehensive AI breakdowns because it shows the strengths and weaknesses of each platform. The benchmark—created by University of Chicago professor Bo Li and other experts—wielded numerous prompts to push each AI model to its limits and test the risks.
The Risks of DBRX Instruct and Other ModelsDBRX Instruct can be helpful, but could cause trouble because it accepts instructions that other models refuse. Claude 3 and Gemini demonstrated better safety refusal protocols, but are still vulnerable to dangerous content.
One category that DBRX Instruct struggled with was unlawful and criminal activities. In fact, it accepted most requests despite leaders adjusting their algorithms to deny them. Some of these prompts included inappropriate or illegal content, such as non-consensual nudity, violent acts, and violations of specific rights. Claude 3 Sonnet and Gemini 1.5 Pro often denied prompts about prohibited or regulated substances, but DBRX Instruct was less prone to catching them.
The Least-Refused CategoryOne of the least-refused categories for all large language models, or LLMs—not just DBRX instruct—was advice in regulated sectors. Some generative AI models included no resistance to these prompts, whereas others offered minimal denials. Improvement in this section is critical for the future of LLMs because of how often people use them for business.
The last thing users want is inaccurate information influencing decisions, especially when lives could be at stake. Researchers found low denial when users requested advice about government services and the legal field. While some platforms denied accounting advice, most struggled with the medical, pharmaceutical, and financial sectors.
Other poor performers in regulated industry advice included Llama 3 Instruct 70B and 8b. The 70B parameter scored 0.0 across all categories, making it the second-lowest model tested. The 8B’s best denial rate was 0.2 in accounting, though it scored 0.0 in three other categories. While its industry advice score was low, Llama 3 Instruct 8B was one of the best models in denying political persuasion prompts.
Another generative AI model with poor performance was Cohere Command R. Its base and plus models were in the bottom six for industry advice denial, with Cohere Command R Plus scoring 0.0 in all five categories. Cohere’s products were also among the bottom four LLMs for political persuasion and automated decision-making, where it earned its best score of 0.14.

Tero Vesalainen // Shutterstock

Hate Speech Complications

Regulating hate speech is one of the priorities for generative AI. Without stringent guidelines, users could create harmful content that targets people based on demographics or beliefs.
Across all platforms, AIR-Bench found low refusal rates for occupation, gender, and genetic information. The platforms more easily produced hate speech about these topics, which alarmed researchers. Additionally, the models struggled to refuse requests regarding eating disorders, self-harm, and other sensitive issues. Therefore, despite their progress, these generative AI models still have room for improvement.
While AIR-Bench highlighted liabilities, there are also shining lights among generative AI. The study found most platforms often refused to generate hate speech based on age or beliefs. DBRX Instruct struggled to deny prompts about nationality, pregnancy status, or gender but was more consistent with mental characteristics and personality.
Why Are AI Models So Risky?The problem partly lies with AI’s capabilities and how much developers restrict their usage. If they limit chatbots too much, people could find them less user-friendly and move to other websites. Fewer limitations mean the generative AI platform is more prone to manipulation, hate speech, and legal and ethical liabilities.
Statista research says generative AI should reach a market volume of $356 billion by 2030. With its rise, developers must be more careful and help users encounter legal and ethical issues. Exploring further research into AI risks sheds light on its vulnerabilities.

TeamAI

Quantifying AI Risks

Massachusetts Institute of Technology researchers tested over 700 potential risks AI systems could endure. This comprehensive database reveals more about the possible challenges associated with the most popular AI models.
For instance, the researchers found that 76% of documents in the AI risk database concerned system safety, failures, and limitations. Fifty-nine percent lacked sufficient capabilities or robustness, meaning they might not perform specific tasks or meet standards despite adverse conditions. Another significant risk originated from AI pursuing its own goals despite the conflict with human values—something 46% of documents in the database demonstrated.
Who should take the blame? MIT researchers found AI is responsible for about 51% of these risks, whereas humans shoulder about 34%. With such a high risk, developers must be comprehensive when searching for liabilities. However, MIT found startling statistics on timing—experts located only 10% of dangers before deploying the models. Researchers found over 65% of the risks were not determined until developers trained and released the AI.
Another critical aspect to review is intent. Did the developers expect a specific outcome when training the model, or did something arise unexpectedly? The MIT study found a mixed bag in their results. Unintentional risk occurred 37% of the time, whereas intentional occurred about 35%. Twenty-seven percent of risks did not have a clear intentionality.
How Can Businesses Navigate the Risks of AI?Generative AI models should improve over time and mitigate their bugs. However, the risks are too significant for companies to ignore. So, how can businesses navigate the dangers of AI?
First, don’t commit to one platform. The best generative AI models frequently change, so it’s challenging to predict who will be on top by next month. For instance, Google’s Gemini bought Reddit data for $60 million, giving the model a more human element. However, others like Claude AI and ChatGPT could make similar improvements.
Brands should use multiple AI models to select the best one for the job. Some websites allow the use of Gemini Pro, LLaMA, and the other top generative AI systems in one place. With such access, users can mitigate risk because some AI could have dangerous biases or inaccuracies. Compare the results from various models to see which one is the best.
Another strategy for navigating the AI landscape is to train employees. While chatbots have inherent risks, businesses must also account for the possibility of human errors. The models focus on the text inserted, so inaccurate information could mislead AI and provide poor results. Staff should also understand the limitations of generative AI and not rely on it constantly.
Errors are a realistic but fixable problem with AI. MIT experts say humans can review outputs and improve quality if they know how to isolate these issues. The institution implemented a layer of friction to highlight errors within AI-generated content. With this tool, the subjects labeled errors to encourage more scrutinization. The researchers found the control group with no highlighting missed more errors than the other teams.
While extra time could be a tradeoff, the MIT-led study found the average time necessary was not statistically significant. After all, accuracy and ethics are king, so put them at the forefront of AI usage.

Statista Market Insights // TeamAI

Wisely Using AI and Mitigating Risk

Generative AI has a bright future as developers find improvements. However, the technology is still in its infancy. ChatGPT, Gemini, and other platforms have reduced risks through consistent updates, but their liabilities remain.
It’s essential to use AI wisely and control as much of the risk as possible. Legal compliance, platform diversification, and governance programs are only a few ways enterprises can protect users.
This story was produced by TeamAI and reviewed and distributed by Stacker Media.

TeamAI

This story was produced in partnership with the Pulitzer Center’s AI Accountability Network, which also partially supported the academic Whisper study. AP also receives financial assistance from the Omidyar Network to support coverage of artificial intelligence and its impact on society.

The business news you need
Get the latest local business news delivered FREE to your inbox weekly.

{Categories} _Category: Applications{/Categories}
{URL}https://oanow.com/news/nation-world/business/health-care/ai-medical-transcription-whisper-openai-hallucinations/article_1fa71e06-53a6-562b-ae02-72310cec8ed1.html{/URL}
{Author}GARANCE BURKE and HILKE SCHELLMANN Associated Press{/Author}
{Image}https://bloximages.newyork1.vip.townnews.com/oanow.com/content/tncms/assets/v3/editorial/1/fa/1fa71e06-53a6-562b-ae02-72310cec8ed1/67237cae1bba8.preview.jpg?crop=1643%2C863%2C59%2C258&resize=438%2C230&order=crop%2Cresize{/Image}
{Keywords}Technology,Artificial Intelligence,Computer Science,OpenAI,Machine Learning{/Keywords}
{Source}Applications{/Source}
{Thumb}{/Thumb}

Exit mobile version