AI Chatbots vs. Dentists: A Dental Education Study
Source: bmcmededuc.biomedcentral.com
AI Chatbots in Dental Education
This study evaluated how seven AI chatbots (ChatGPT-4, ChatGPT-3.5, ChatGPT 01-Preview, ChatGPT 01-Mini, Microsoft Bing, Claude, and Google Gemini) performed on multiple-choice questions about prosthetic dentistry from the Turkish Dental Specialization Mock Exam (DUSDATA TR). The study also explored whether these chatbots could answer with accuracy similar to general practitioners.
Ten multiple-choice questions on prosthetic dentistry were taken from a private educational institution's preparatory exam. Two groups were created: general practitioners (Human Group, N=657) and the AI chatbots. Each question was manually entered into the chatbots, and their answers were noted. Correct answers were marked as “1,” and incorrect ones as “0.” The consistency and accuracy of chatbot answers were analyzed using Fisher’s exact test and Cochran’s Q test. Statistical significance was set at p<0.05.
A statistically significant difference was observed in the accuracy rates of chatbot answers (p<0.05). ChatGPT-3.5, ChatGPT-4, and Google Gemini failed to correctly answer questions 2, 5, 7, 8, and 9, whereas Microsoft Bing missed questions 5, 7, 8, and 10. Notably, none of the chatbots correctly answered question 7. General practitioners had the highest accuracy, especially on question 10 (80.3%) and question 9 (44.4%). Chatbot answers were consistent over time despite accuracy variations (p>0.05). However, Bing had the most incorrect answers.
AI in Dentistry
The study indicates that AI chatbot performance varies significantly and lacks consistency in answering exam questions related to prosthetic dentistry. Therefore, more improvements are needed before implementation. Artificial intelligence (AI) technology and its dental applications have grown significantly in recent years. Natural Language Processing (NLP) and machine learning are widely used in dentistry as AI advances. Large Language Models (LLMs) are among NLP's most notable developments. LLMs are sophisticated deep-learning models trained on large datasets to predict linguistic relationships and generate context-aware answers.
AI is rapidly entering healthcare and is becoming more important in dentistry because it can improve accuracy and efficiency. While still in its early stages, it is a vital part of modern life, influencing nearly every sector. Researchers are incorporating AI at all levels in medicine to improve patient care. Each dental specialty can benefit from AI through better care, diagnosis, and time savings in clinical and administrative tasks. More research is needed in dental education, as the accuracy and reliability of current AI systems have yet to be definitively proven.
Chatbot Models
Several LLM-based programs function as AI-driven chatbot models, including ChatGPT-4, ChatGPT 01-Preview, ChatGPT 01-Mini, Microsoft Bing, Claude 3.5 Sonnet, and Google Gemini. These chatbots are made to mimic human-like text conversations, which improves user interaction through clear and natural answers. OpenAI's ChatGPT is based on Generative Pretrained Transformer (GPT) models, specifically GPT-3.5 and GPT-4. These models use reinforcement learning from human feedback and transformer architecture to improve response quality.
GPT-4, which OpenAI released, is reportedly more reliable, creative, and able to handle complex commands better than its predecessors. Similarly, Google’s Gemini, which launched operates as a conversational AI model for interactive applications. Microsoft launched its Bing Chat AI chatbot, which uses the GPT-4 language model to improve user engagement and information retrieval. ChatGPT is one of the most widely used chatbot models. Claude AI, developed by Anthropic PBC, has gained active monthly users and claims to offer greater precision than ChatGPT. These AI chatbots facilitate discussions and provide information on a wide range of subjects, including healthcare.
Study Objectives
AI-powered chatbot models allow users and AI programs to exchange questions and answers interactively. While studies have shown the potential of these systems for education, their overall performance is still uncertain. There is limited research in dentistry on AI chatbots, particularly regarding how accurately and reliably they answer dental specialization exam questions. This study evaluated AI chatbot performance in prosthetic dentistry using ten questions from a simulated Turkish Dental Specialization Exam.
This exam was selected because of its comprehensive structure, which ensures general practitioners have the knowledge and skills needed to practice safely and effectively in their field. Thus, this study aimed to determine if different AI chatbots are as competent as general practitioners and to compare their performance. Because only a few studies have addressed this topic, the findings should fill a significant gap regarding the reliability of AI technologies in educational assessments. The study's null hypothesis was that different AI chatbots could answer ten multiple-choice questions about prosthetic dentistry from the Dental Specialization Exam with the same level of accuracy as general practitioners, with no significant differences between the models.
Methodology
This study analyzed multiple-choice questions from DUSDATA TR, a private institution in Turkey that provides preparatory training for the Dental Specialization Examination (DUS). Because DUSDATA TR operates within a closed system, its questions are not publicly available. The Dental Specialization Examination (DUS) in Türkiye includes ten prosthetic dentistry questions. Questions prepared and administered by a private preparatory course were reviewed, and those containing figures were excluded because the questions mainly assess knowledge and comprehension, according to Bloom's taxonomy.
From the remaining items, a selection was made among the knowledge-level questions that were answered by the highest number of participants. The participants included 657 general dentists who took the DUS exam between 2020 and 2023. Anonymized DUS examination results from DUSDATA TR provided the human group's question-based success rates. No personal, demographic, or identifiable information was collected. Only anonymized response data were recorded in an electronic database. The relevant ethics committee confirmed that informed consent was unnecessary because the study involved no intervention and had no consequences for incorrect answers. All procedures followed the ethical principles of the Declaration of Helsinki.
Two main groups were involved: dentists (Human Group, N=657) who were practicing general dentists who had taken the DUS and AI Chatbots which included ChatGPT-4, ChatGPT-3.5, ChatGPT-01-Preview, ChatGPT-01-Mini, Microsoft Bing, Claude, and Google Gemini. There was no available data about the dentists' educational background or clinical experience. Responses were coded “1” (correct) or “0” (incorrect) and were recorded anonymously. Each question was manually entered into the AI chatbot models without additional training data, prompt engineering, or contextual guidance. To ensure a standardized evaluation, all chatbot interactions were conducted under identical conditions using the same question wording.
Prior studies indicate that large language models, especially ChatGPT, may produce inconsistent results when asked the same question at different times or in repeated trials. Furthermore, models like Google's chatbot have been observed to generate multiple response drafts for one query, which could cause variability. Each chatbot was given the entire set of questions three times at different intervals to assess temporal consistency. The outputs from each round were recorded, and consistency was assessed by comparing the three sessions. This repeated measures approach helped identify stable versus variable response patterns within each model.
All questions were presented three times to maintain consistency and minimize variability. Each chatbot received the questions in order, and the chat window was refreshed before each entry to avoid previous responses influencing the next. Generated answers were copied into a separate spreadsheet for analysis. Descriptive statistics, including frequency and percentages, were calculated. Data analysis was conducted using IBM SPSS Statistics. Fisher's exact test compared the chatbot models' ability to answer multiple-choice questions. Cochran’s Q test was used to analyze the repeatability of categorical variables. All statistical tests used a 95% confidence interval, and p-values less than 0.05 were considered statistically significant.
Results
The distribution of responses from AI chatbots and general dentists is in Table 1. There was a statistically significant difference in the accuracy of chatbot responses to each question (p<0.05) (Table 2). Table 2 presents the distributions of correct and incorrect answers from AI chatbots and general dentists across ten multiple-choice questions. Statistical testing showed a significant difference in accuracy between chatbots and human practitioners for each question (p<0.05 for all), indicating that response performance varied significantly between the groups.
Question 1 (p<0.001): All chatbot models except Microsoft Bing and Google Gemini answered this correctly with 100% accuracy. Only 58.2% of human participants chose the right answer. This shows that most AI models performed better than humans on this question. Question 2 (p<0.001): All major chatbot models except ChatGPT 01-Preview and 01-Mini failed to answer correctly, with ChatGPT-3.5, ChatGPT-4, and Google Gemini scoring 0%. Human respondents also struggled, achieving only 45.2% accuracy. The difference between the groups remained statistically significant. Question 3 (p<0.001): ChatGPT-3.5 and ChatGPT 01-Preview achieved perfect scores (100%), and ChatGPT-4 also performed well (88.9%). Humans scored lower (45.8%), showing that some chatbot models performed significantly better than general dentists. Question 4 (p=0.001): Most chatbot models, including ChatGPT-3.5, Claude 3.5 Sonnet, and 01-Preview, gave 100% correct answers. Human participants performed better (76.1%) here compared to prior questions, but the AI models still had the edge, with a statistically significant difference in accuracy. Question 5 (p<0.001): This was a challenging question for most chatbots: ChatGPT-3.5, ChatGPT-4, Microsoft Bing, Google Gemini, and Claude 3.5 Sonnet all failed (0–11.1% accuracy). ChatGPT 01-Mini and ChatGPT 01-Preview had moderate success (100% and 66.7%, respectively). Human respondents had a near-equal split between correct and incorrect answers (49.1% correct). The wide performance variability resulted in a statistically significant difference between the groups. Question 6 (p<0.001): ChatGPT-3.5 and ChatGPT 01-Preview performed flawlessly (100% correct), while Claude 3.5 Sonnet and ChatGPT 01-Mini failed completely. Human performance was moderate (51.9% accuracy). The variance in AI performance compared to humans yielded a significant result. Question 7 (p<0.001): All chatbot models failed this question entirely (0% correct). Humans performed better, with 56.1% accuracy. This highlights a critical weakness in all evaluated chatbot models, showing human expertise was superior. Question 8 (p<0.001): Most chatbot models failed, with only Claude 3.5 Sonnet (11.1%), ChatGPT 01-Mini (66.7%), and ChatGPT 01-Preview (100%) giving correct answers. Humans outperformed most models with 58.8% accuracy. The significant p-value reflects these substantial performance differences. Question 9 (p<0.001): Similar to Q8, nearly all chatbot models failed except Claude 3.5 Sonnet, which achieved 88.9% accuracy. Human performance was relatively lower at 44.4%, but the difference in distribution still reached statistical significance. Question 10 (p<0.001): All chatbot models except Microsoft Bing performed well, with multiple achieving 100% accuracy. Human performance peaked here at 80.3%, the highest across all questions. Despite high accuracy on both sides, the difference in proportions remained statistically significant due to Bing’s poor performance and other group variances.
All ten questions showed statistically significant differences between chatbot and human responses (p<0.05), with variability across models and items. Some chatbot models (newer versions like ChatGPT 01-Preview and Claude 3.5 Sonnet) occasionally outperformed humans, but none were consistently superior. Question 7 revealed a systematic failure among all chatbot systems, while human practitioners maintained a stable performance, especially on clinical reasoning-based questions. ChatGPT-3.5, ChatGPT-4, and Google Gemini failed to answer Questions 2, 5, 7, 8, and 9 correctly. Microsoft Bing chatbots failed to correctly answer Questions 5, 7, 8, and 10. No chatbots could answer Question 7 correctly. General dentists demonstrated a higher accuracy on Questions 10 (80.3%) and 9 (44.4%).
Cochran's Q test assessed the consistency of chatbot responses over time for binary outcomes (correct vs. incorrect answers). A non-significant p-value (p>0.05) indicates stable performance, suggesting repeatability and reliability. ChatGPT-3.5 showed perfect temporal consistency (Q=0.000, p=1.000), giving identical answers across repeated trials. This indicates a stable response pattern. Microsoft Bing, Google Gemini, ChatGPT 01-Preview, and ChatGPT 01-Mini also showed strong consistency with high p-values. Claude 3.5 Sonnet and ChatGPT-4 had relatively lower p-values (0.246 and 0.165, respectively), but still above 0.05. Despite ChatGPT-4 having a lower p-value than ChatGPT-3.5, it still provided consistent responses across time points. However, ChatGPT-3.5 showed the highest repeatability of all models. The Cochran’s Q test results confirm that all chatbot models provided stable and repeatable outputs, making them suitable for use where consistency is essential.
The results showed significant differences between AI chatbot responses and general practitioners' responses, so the null hypothesis was rejected.
AI in Education
Artificial intelligence is widely used in education, such as creating course materials, providing language translations, making recommendations for educators, designing assessment tasks, and evaluating student performance. Similarly, AI applications help students by answering questions, summarizing texts, and helping with assignments. However, AI-generated responses may not always be accurate because they can be based on incomplete or incorrect data, which can lead to misinformation. Since 2022, AI-powered chatbots using natural language processing (NLP) have become more popular in education. Though chatbot technology goes back to 1966, current versions can understand commands, handle complex requests, and even produce spoken responses.
Several studies have evaluated how well AI chatbots answer complex medical questions. For instance, two studies analyzed how AI chatbots performed on the United States Medical Licensing Exam (USMLE). One study by Kung et al. evaluated ChatGPT's performance on the USMLE without training. The chatbot achieved an accuracy rate exceeding 50% across all exam sections. Another study by Singhal et al. assessed the performance of another AI chatbot model (Flan-PaLM, Google) on the USMLE, reporting an accuracy rate. Revilla-León M. et al. compared the performance of ChatGPT-3.5, ChatGPT-4, and human dentists on a 50-question multiple-choice exam for Implant Dentistry Certification. Both ChatGPT-3.5 and ChatGPT-4 passed the exam, with ChatGPT-4 scoring significantly higher than ChatGPT-3.5 and human dentists. A recent study also evaluated ChatGPT-4's accuracy in answering prosthodontics questions and reported a low reliability rate. These results suggest that AI chatbots can process complex medical and clinical information, but their accuracy differs by application.
A survey-based study administered a questionnaire focused on artificial intelligence and provided recommendations for enhancing AI in dental education. In a study involving two ChatGPT versions, questions were used from the European Certification in Implant Dentistry examination. AI models performed successfully overall, and the ChatGPT-4.0 version showed higher accuracy compared to the 3.5 version and licensed dentists. In a comparative study, questions were administered. The study highlights the potential of AI as a supportive tool in dental education. Eraslan et al. assessed the performance of AI-based chatbots—ChatGPT-3.5, Gemini Advanced, Claude Pro, Microsoft Copilot, and Perplexity—in answering prosthodontics-related questions from the Dental Specialty Examination (DUS). Among the models, Microsoft Copilot achieved the highest accuracy rate and performed better than Perplexity. These findings support the evidence that large language models are effective tools in dental assessments.
Similar to the study, Eraslan et al.’s research used questions from a standardized specialty exam and analyzed AI performance. The focus on comparative evaluation across chatbot platforms highlights a complementary dimension. While comparative studies highlight the potential of current chatbot models in dental assessments, AI tools still have limitations in functioning as dependable educational resources. Continued AI advancements are expected to enhance their integration in dental education.
This study is the first to evaluate seven AI chatbots answering multiple-choice questions about prosthetic dentistry using questions for the Turkish Dental Specialization Examination provided by DUSDATA TR. ChatGPT-4, ChatGPT-3.5, and Claude outperformed general practitioners on Questions 1, 3, 4, and 10. ChatGPT 01-Preview and ChatGPT 01-Mini performed better than general practitioners on Questions 1, 3, 5, and 8. None of the AI chatbots could correctly answer Question 7, raising concerns about their limitations in interpreting nuanced clinical scenarios. Microsoft Bing had the most incorrect responses, though there was no statistically significant difference in consistency over time. Bing's integration with web search engines and exposure to irrelevant information may have led to indecisive answers. Despite the limitations, results highlight the utility of AI chatbots in dental education and clinical decision-making. The technologies must consistently produce relevant responses to be integrated into healthcare settings reliably.
Interestingly, human participants had low success rates on Questions 2 and 9, suggesting these questions were difficult. Chatbots consistently failing the same questions reflects this difficulty. However, human knowledge from experience and education is more valuable, showing where AI still lags. AI chatbots may lack the ability to navigate multiple-choice exams. Multiple-choice formats force AI to select a fixed option, increasing error. Open-ended questions rely on narrative construction, where AI models perform better. Multiple-choice formats pose a greater challenge for AI systems than flexible tasks. Prior research and this study highlight unresolved challenges that must be addressed before trusting AI chatbots.
Limitations
AI's main limitation is generating seemingly reliable but incorrect responses. In healthcare, a major concern is AI chatbots giving misleading advice, which is a recognized issue. AI chatbots can revolutionize healthcare, but every AI response must be verified with primary sources. Safe AI integration requires oversight. AI is an emerging field that enables computers to perform human tasks. AI has become integral, influencing sectors. In medicine, researchers explore AI to enhance care. In dentistry, every specialty benefits from AI in diagnostic precision, treatment planning, and efficiency. Every clinician must assess AI systems. A recent review found AI systems lack human diagnostic accuracy and clinical intuition. This study only evaluated AI chatbot responses to multiple-choice questions from dental exams, overlooking AI's broader capabilities.
Therefore, there was no comprehensive assessment of chatbot competence. No details were provided about question difficulty or human performance benchmarks. Expanding evaluations beyond multiple-choice questions to reflect clinical reasoning is supported. Multiple-choice items were selected for objectivity and minimizing human error. However, AI hallucination (generating incorrect responses) is important. Chatbot accuracy depends on training data quality. It's unclear if AI models access subscription-based dental databases. Domain-specific AI systems may outperform general models by offering reliable responses. Chau et al. argue AI tools will be adopted for clinical purposes. The ethical implications of AI integration must be carefully considered. Ethical guidelines should ensure AI complements human judgment. Global organizations have started to address ethical challenges. Rephrasing questions can reduce biases. Future studies should investigate question format impact on performance. Another limitation is language. As this study used Turkish-language exam items, results may differ in other languages due to training data variations. Multilingual assessments should be included.
This study used a single-response sampling method, preventing detailed response stability evaluation. The limited number of items and excluded image-based questions restrict the generalizability of the findings. This study is a pilot investigation, laying groundwork for comprehensive research including text-based and visual questions across dental disciplines.
Conclusions
Based on the findings, these conclusions were drawn: There was a statistically significant difference between the accuracy rates of responses given by different chatbots for each question. Responses from different chatbot models remained consistent over time, indicating stability. Although there was no significant difference in time-based consistency, Bing provided the most incorrect responses. This study highlights varying AI chatbot performance on multiple-choice questions about prosthetic dentistry and the need for research to assess reliability.