Jacqueline Ntaka, [email protected]
ARTIFICIAL intelligence has quickly become a staple of modern information seeking, yet the reliability of large language models (LLMs) such as ChatGPT remains an ongoing concern. Despite their sophistication, these systems still produce errors — sometimes subtle, sometimes glaring.
As their influence spreads across journalism, education, health, and research, understanding their limitations is essential for responsible use.
One of the most documented issues is the phenomenon of “hallucination”, where the model confidently generates inaccurate or entirely fabricated information. OpenAI acknowledges that hallucinations remain a fundamental challenge for all language models, noting that standard training approaches encourage guessing rather than expressing uncertainty.
Their analysis highlights cases where chatbots invent academic details — such as incorrect dissertation titles or birthdays — while presenting them with authority. This tendency stems partly from evaluation methods that penalise a model for saying “I don’t know”, rewarding confident speculation instead.
Recent research indicates that these inaccuracies are not diminishing as quickly as expected. In fact, some newer models exhibit higher hallucination rates than older versions. Internal tests revealed that GPT 4 mini hallucinated up to 79 percent of the time on certain factual tasks, significantly higher than earlier models. This raises important questions about the balance between more “advanced” reasoning capabilities and factual reliability. As models become more complex, their errors can also become more convincing, making them harder for users to detect.
Studies in specialised domains further illuminate the issue. Medical research published in 2025 found that while ChatGPT performed well at identifying diseases, drugs, and genetic information, it struggled with symptom identification, scoring as low as 49 percent. Similarly, a comparative study on LLMs used in systematic reviews showed extremely low precision rates when generating academic references, with GPT 3,5 scoring 9,4 percent and GPT 4 only slightly higher at 13,4 percent. These findings highlight that even where models sound credible, users cannot assume accuracy — particularly in technical or evidence-based fields.
Given these challenges, the question becomes: how can users effectively circumvent ChatGPT’s inaccuracies? The first strategy is verification. Treat AI-generated information as a starting point, not an authoritative source. Cross check factual statements against trusted references — especially when dealing with health, law, finance, or academic material. The model’s confident tone should not be mistaken for correctness, as multiple studies have shown that confidence and accuracy are not reliably correlated.
The second strategy is to ask the model for sources — but verify those sources independently. ChatGPT may fabricate citations, so users should confirm that referenced studies, articles, or books truly exist. This step is essential in academic and journalistic work where integrity of evidence is paramount.
A third tactic is to break complex questions into smaller, simpler components. Research indicates that LLMs perform better when the ambiguity of user queries is reduced. Asking targeted, specific questions reduces the scope for hallucination and increases the likelihood of receiving a factual response.
Finally, users should not hesitate to ask the model to express uncertainty. When prompted directly — such as by asking “How confident are you in this answer?” — The model often adjusts its tone and may highlight areas where verification is advisable. This aligns with emerging best practice guidelines that call for AI systems to reflect uncertainty rather than default to confident assertion.
l Jacqueline Ntaka is the CEO of Mviyo Technologies, a local tech company that provides custom software development, mobile applications and data analytics solutions. She can be contacted on [email protected]



