AI Chatbot vs. Human Evidence Synthesis
Source: bmcmedresmethodol.biomedcentral.com
Artificial intelligence is being used more in research with the rise of large language models, potentially making research processes faster. A study compared how well chatbots and humans did at answering questions for evidence synthesis during a scoping review. The study analysed and compared responses from two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions that were based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis from a scoping review about digitally supported interaction among healthcare workers.
The results showed that chatbots and humans were similar in how correct their answers were judged to be. However, the chatbots were better at understanding the context of the original text and gave more complete, though longer, answers. The human answers didn't add new content or interpret the text as often. Among the chatbots, ZenoChat's answers were rated the highest, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 in third place. The completeness and correctness of an answer were positively correlated with correct contextualisation. Chatbots using large language models could help speed up qualitative evidence synthesis. Chatbot applications in research are likely to grow with the rapid development and refinement of chatbots.
Since ChatGPT was launched in November 2022, large language model chatbots have gained attention from the public, politicians, and scientists. Their usefulness and challenges have been discussed in areas like education and healthcare. These advanced language models are trained on large amounts of data to mimic human conversation. Their scale, pre-training, contextual understanding, and flexibility distinguish them from previous machine learning tools. The use of artificial intelligence with large language models offers new possibilities for analysing complex datasets, making predictions, and supporting literature reviews, especially in healthcare. Its use in research is expanding because it could speed up the research process and improve transparency.
Evidence synthesis, such as systematic reviews, can take over a year from the literature search to publication. Large language models could speed up the creation of evidence-based guidelines, which could improve medical practice. ChatGPT has performed well in reproducing specific themes in qualitative research, but less well in establishing interpretative themes and creating depth when coding inductively. Qualitative research tools like MAXQDA and ATLAS.ti have added artificial intelligence tools with OpenAI to help users in different stages of their research. Chatbots could improve research by helping with tasks like writing search strings and summaries. ChatGPT is a well-known example, but there are many other chatbots that use large language models.
Several alternatives have appeared on the market in recent years. Although they use similar core technology, their different training and fine-tuning processes can lead to different response generation and capabilities. This study aims to compare the accuracy, completeness, and relevance of chatbot-generated answers to questions about pre-coded article excerpts. The chatbot responses are compared to human responses and across different chatbots. This will improve the understanding of how chatbots are used and how well they perform in evidence synthesis. A randomised and blinded survey-based process was used to evaluate how well chatbots support evidence synthesis. As part of a scoping review on digitally supported interaction between healthcare workers, 407 articles were manually analysed for data extraction.
The analysis scheme was based on Greenhalgh et al.’s NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies). This framework contains seven categories with specific questions on each: the condition, the technology, the value proposition, the adopters, the organization, the wider system, and the evolution and adaptation over time. It was made into a coding framework of seven codes with questions such as: Why was the technology introduced? What are the key features of the technology? What is the technology’s positive and negative value proposition? What changes in staff roles, practices, and routines are implied? What is the organization’s readiness for technology-supported change? What is the context for program rollout? How much scope is there for adapting the technology and the service over time?
The 407 articles covered four settings: hospital, ambulant, intersectoral, and others. The study selected 39 articles about the ambulant setting and identified those with coded entries for each code. A researcher chose five articles for each of the six codes (condition, technology, value proposition, adopters, organization, and wider system) to ensure diverse representation while maintaining randomness. The code “the evolution and adaptation over time” was only used in two articles within the subset, and both were included. Different coded parts of the same code in one article were combined into one text passage of varying length and complexity.
The text passages were given to two researchers and four chatbots, who were asked to answer the questions for each code with all relevant information. The two researchers had been actively involved in the research design and conduct. A coded text passage under each code is provided as an example along with an answer from one of the four chatbots or a human researcher. The study compared four chatbots: ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash. ChatFlash, ChatGPT 3.5, and ChatGPT 4.0 are based on GPT (Generative Pre-trained Transformer). ZenoChat uses either GPT 4.0, Sophos-2, or Mixtral, depending on the settings. Both ChatGPT 3.5 and 4, which are based on GPT 3.5 and 4, respectively, were included, as well as the ZenoChat version based on Sophos-2.
To ensure a standardised process, all chatbots received the same prompt for the same pre-coded text: “Use the following paragraph to extract in academic style in bullet points the answers – if any answers are provided – to the following questions:”, followed by one to four questions about the specific code. There was no word count limit because the texts varied in complexity and information density. The prompt was developed through testing different prompt designs. The chatbots generated answers in November 2023. Digitally supported randomization in Microsoft Excel was used to select three text passages for each code, resulting in 20 text passages for the survey. The text passages came from articles about the condition, the technology, the value proposition, the adopters, the organization, the wider system, and the evolution and adaptation over time.
Each text passage had four chatbot-written responses and two human-written ones. A survey was created to evaluate the responses using a consistent evaluation framework for each question. The length of the response was measured on a Likert 1–3 scale: 1 – too short, 2 – appropriate length, 3 – too long. Completeness and correctness were measured on a Likert 1–3 scale: 1 – complete/correct, 2 – partially complete/partially correct, 3 – significant part(s) are missing/content is displayed incorrectly. Three more questions evaluated the correct identification of the context (correct/incorrect) and whether the answer included any addition of new content (yes/no) and/or an interpretation beyond the original text (yes/no). The surveys were set up as Google Forms, and respondents could provide open-ended feedback on both the response and the original text passage.
Word counts were used to measure the length of the responses, and the word ratio was calculated by dividing the word count of the answers by that of the original text. Six independent survey participants from the Bavarian Research Center for Digital Health and Social Care were recruited. They had a background in social sciences and professional expertise ranging from nursing to physiotherapy. These individuals were familiar with the scoping review and the aim of the research, but were not involved in deriving answers. They received training before reviewing the text passages to align their understanding and reduce subjective interpretation. Each rater was assigned a text passage with the corresponding answers using random sequence allocation. Raters were blinded to the identities of the answer's author. In total, 120 text passages were reviewed. Raters were not shown the ratings of other raters. To reduce recognition bias, the formatting of answers was standardized, and introductory phrases were omitted. The research process is summarised in Figure 1.
Quantitative data from the surveys were exported to Microsoft Excel and analysed using descriptive and comparative statistical methods in Stata. Statistical difference was determined using the non-parametric Kruskal–Wallis test at a significance level of p<0.05. Interrater reliability was calculated using Cohen’s Kappa. Qualitative data were transferred to a Microsoft Excel document and analysed inductively. Raters had a fair Cohen’s kappa inter-rater reliability of 0.30 (0.12–0.51), with a standard error of 0.11 (0.07–0.12). Cohen’s kappa for context was lower (κ=0.18±0.09) than the other variables, with a Cohen’s kappa between 0.27±0.11 for correctness and 0.39±0.10 for length.
Across the dataset, the correctness of the responses was similar, but other categories showed statistically different results. Chatbot answers were generally seen to demonstrate better recognition of the context (chatbot: 92.42% vs. human: 84.85%) and were longer, with a mean word ratio of 0.45±0.50 compared to 0.21±0.26 for humans. Chatbot answers were also perceived as more complete than human answers (chatbot: 79.73% vs. human: 52.65%). However, human answers were considered superior in not including interpretation (human: 97.35% vs. chatbot: 81.44%) or adding material not in the original text (human: 97.73% vs. chatbot: 81.82%).
Among all chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Responses considered too long had a word ratio of 0.84±0.65, compared to 0.27±0.28 for adequate lengths and 0.10±0.11 for those considered too short. ChatGPT 3.5’s responses were longer (mean word ratio: 0.58±0.70) than ChatFlash’s (0.34±0.36) and ZenoChat’s (0.38±0.37). Human responses were shorter, with word ratios of 0.16±0.14 (Researcher A) and 0.25±0.34 (Researcher B). Raters noted that they marked answers as too long if they had redundant, irrelevant, or too detailed information. When responses were rated incomplete, raters sometimes noted the expected content.
The completeness evaluation of chatbots didn't differ significantly, with completeness between 76.52% (ZenoChat) and 81.82% (ChatFlash and ChatGPT 3.5). All chatbots’ answers were considered more complete than those of human researchers, with scores of 49.24% (Researcher A) and 56.06% (Researcher B). ZenoChat’s answers were rated as more correct (81.82%) than ChatGPT 3.5 (68.18%), ChatGPT 4.0 (68.18%), and the human researchers (A: 68.94% and B: 70.45%). ChatFlash’s (77.27%) answers were not evaluated significantly differently than those of either chatbots or humans.
ChatFlash’s responses were perceived to have a better understanding of the context than ChatGPT 3.5 (94.70% vs. 87.88% correct). Researcher B’s responses (86.26% correct) were perceived to show less context than ZenoChat (93.94% correct) and ChatFlash (94.70% correct), with Researcher A’s context additionally inferior to ChatGPT 4.0 (83.33% vs. 93.18% correct). ChatGPT 3.5 and ChatGPT 4.0 were evaluated as containing more additions than ZenoChat and ChatFlash, with 29.55% and 22.72% of answers containing an addition versus 8.33% and 12.12%, respectively. Researcher A showed a slightly higher percentage (4.55%) of addition than Researcher B (0.00%). Researcher A’s responses were perceived to contain less addition (4.55%) than ChatGPT 3.5, ChatGPT 4.0, and ChatFlash. Researcher B’s responses (0.00%) also contained less addition than ZenoChat.
ChatGPT 4.0’s responses were evaluated as containing more interpretation than ZenoChat and ChatFlash, with 32.58% of answers containing an interpretation versus 5.30% and 13.66%, respectively. ZenoChat also contained fewer interpretations than ChatFlash and was the only chatbot to not provide more interpretation in its answers than the human researchers (A: 5.30%, B: 3.79%). Raters highlighted sentences they saw as containing interpretation, which included words such as ‘potentially’, ‘suggesting’, ‘pointing to’, ‘could’, ‘appears to be’, ‘indicating’, ‘may reflect’, and ‘may lead’.
Some raters acknowledged that some interpretation was needed to correctly answer the question, from recognising abbreviations to showing broader contextual understanding. One rater noted difficulty in answering the question because it required interpretation to grasp the right aspects, indicating that the system needs to perform semantic or interpretative tasks to answer the question.
Correlational analysis showed a moderate positive correlation between correctness and completeness (ρ=0.63), correctness and context (ρ=0.56), and length and word ratio (ρ=0.56). The correlation between variables in human responses was higher than for chatbots regarding correctness and completeness (ρ=0.71 vs. ρ=0.60) and correctness and context (ρ=0.72 vs. ρ=0.44). There was a low positive correlation between context and completeness (ρ=0.46) and between interpretation and addition (ρ=0.35). A low negative correlation was found between completeness and length (ρ=−0.33) and correctness and addition (ρ=−0.35).
In human responses, there was a low negative correlation between length and correctness (ρ=−0.34) and a moderate negative correlation between length and completeness (ρ=−0.66). Chatbot answers showed a low negative correlation between addition and correctness. Chatbots were considered better at recognising context and providing more complete, longer summaries, while humans were seen as less likely to add or interpret material. ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 equally in third place.
Statistical analysis showed a positive correlation between correct contextualisation and completeness, and between correct contextualisation and correctness. Qualitative feedback showed that longer answers often had redundant information and raised the question of the role of interpretation in answering the question effectively in evidence syntheses. Correct contextualisation and the absence of addition and interpretation were important factors in the research setup.
While Hamilton et al. found ChatGPT to have limited contextual understanding, this study found a clearer understanding of the context by the chatbots than in the human researchers’ answers. The distinction between recognising context and interpreting content to answer a question is complex. This is closely linked to the debate over what constitutes the addition of new content versus the interpretation of the original text’s content. Precise and specific instructions as prompts are essential to improve the accuracy and relevance of chatbot responses.
In this study, chatbots tended to give more extensive answers than humans, likely due to their understanding of the exercise. Humans aimed for brevity to speed up subsequent analysis, but chatbots didn't have this knowledge. Chatbots are designed to assume user expectations from the prompt and tend to follow instructions instead of seeking clarification and responding according to their own skills and limitations. Recent research has shown that chatbots like ChatGPT can generate discharge and patient summaries of medical histories and summarise scientific literature with quality and accuracy comparable to or exceeding traditional methods. This shows their ability to extract main ideas from a text, similar to the findings of this study. However, these studies didn't compare the capacities of different chatbots.
Some studies have compared chatbots’ ability to answer complex medical examination questions. In these, ChatGPT 4.0 responded more accurately and concisely than Bard, but scored worse than the medical reference group. This suggests that a chatbot's ability to answer questions depends on the availability of an accurate reference text. Chatbots can be used in scenarios requiring idea extraction from a resource, such as content analysis for qualitative research. Also, the performance of chatbots in evidence syntheses depends on adequate prompting, training, and parameters of the respective chatbots. This is visible in the variations in results between different chatbots. Further investigation is needed regarding this aspect.
The study showed that AI-powered chatbots can improve the research process and help humans conduct reviews. However, human oversight and correction are essential. It is important to recognise the potential and be aware of the shortcomings of chatbots, which can include biases, non-disclosure of training data, incorrect information, and nonsensical responses. A major strength of this study is the comparison between different chatbots and their performance against human researchers, as most studies only compare chatbots against each other.
Despite this strength, the study has constraints. Four chatbots were selected from a growing number of chatbots using different large language models. Generalising the findings should be done with caution because the text parts used to elicit answers were preselected by human researchers for topical relevance. Eliciting responses from the full text might yield different outcomes. A custom-developed metric was used to assess performance, which has not been formally tested.
Future research should focus on refining the prompt to better match the implicit human understanding of the context and the specific objective anticipated from the chatbot’s use to improve the quality of the performance of the chatbots. This includes researchers assessing their underlying assumptions and intentions, defining the criteria for evaluating responses, and employing prompt engineering methods to refine the prompt. Longitudinal studies are crucial to offer insights into how chatbot capabilities and performance are changing as the underlying large language models of chatbots quickly evolve.
It is also recommended to assess the chatbots’ capabilities to answer questions when receiving the full text instead of curated parts. This might assess the chatbots’ abilities and help identify additional information that might have escaped the human’s judgment. This study demonstrates that chatbots can provide complete and correct answers to questions on a given text and can be useful in accelerating research processes, especially in qualitative evidence synthesis. Given the speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will likely continue to expand.