GPT-4 beat ChatGPT-3.5 on emergency medicine board questions, but Anki tuning did not add much

A 2025 JMIR AI study found GPT-4 far outperformed ChatGPT-3.5 on 598 emergency medicine board questions, while custom Anki tuning showed no clear accuracy gain.

Published : 1 January 2025

In one sentence

On 598 emergency medicine board-style questions, GPT-4 clearly outperformed ChatGPT-3.5, while a custom version tuned with Anki-style material did not significantly improve accuracy over standard GPT-4.

What the researchers did

As large language models became popular study tools, a reasonable question emerged: can a model customized with a learner's flashcards outperform the default version on high-stakes exam questions? This study tested that idea in emergency medicine using board-style items.

The researchers evaluated three systems on 598 questions from an emergency medicine board review source. One was ChatGPT-3.5, one was GPT-4, and the third was a custom GPT-4 configuration tuned with content drawn from Anki-style study material. The goal of the custom setup was to see whether adding domain-specific flashcard knowledge would boost exam performance beyond the base model.

The study compared overall accuracy across the three systems. In other words, could the models pick the correct answer to multiple-choice board questions, and did the Anki-informed setup produce a meaningful edge over default GPT-4?

This is a useful design because it separates two claims that are often mixed together. Claim one: newer general-purpose models are better than older ones. Claim two: personal or domain-specific study material, such as Anki decks, can make an already strong model substantially better at exams. The paper lets us inspect both.

What they found

The first result was straightforward: GPT-4 performed much better than ChatGPT-3.5 on the emergency medicine question set. That fits a broader pattern seen across many benchmark studies, where newer frontier models show clear gains on professional and academic tasks.

The more interesting result was the one that did not appear. The custom GPT-4 system, despite being tuned with Anki-related material, did not significantly outperform standard GPT-4. In practical terms, extra flashcard-style customization did not yield a reliable accuracy bump on this board-style exam set.

That matters because it challenges a common intuition. Many users assume that if they upload more personal notes, flashcards, or review materials into an AI system, exam accuracy should rise in a noticeable way. This study suggests the relationship is not that simple. Once the base model is already strong, adding study material may help with style, wording, or familiarity, but not necessarily with measurable correctness.

The result also hints that emergency medicine board performance may depend more on the core reasoning and knowledge already present in the base model than on lightweight customization with supplementary study resources.

What this means for learners and educators

For learners, the practical message is clear: if you are choosing between model generations, model quality matters more than superficial customization. Upgrading from 3.5-level performance to GPT-4-level performance may matter more than wrapping a weaker or equivalent model in your own flashcards.

For educators, the study offers a caution against overclaiming AI personalization. There may be real value in using Anki decks or course notes to steer explanations, examples, or terminology. But this paper suggests that such tuning does not automatically translate into better exam-answering accuracy.

For people who already use Anki, the result should not be read as "flashcards do not work." Human learning and model evaluation are different things. Anki may still help students remember content effectively, even if feeding flashcard-like material into a language model does not noticeably raise the model's score on a question bank.

In other words, the study is more about the limits of AI customization than the limits of retrieval practice.

Limitations and what we don't know yet

The findings come from one exam domain, one question source, and one style of custom tuning. Different prompting strategies, retrieval systems, or more structured fine-tuning might produce different results. So this should not be treated as the final word on all Anki-plus-AI workflows.

Board-style multiple-choice accuracy is also only one outcome. A customized system might still offer benefits not captured here, such as better explanations, closer alignment with a curriculum, or more useful study dialogue for a particular learner.

The paper summary provided in your prompt emphasizes the lack of a significant accuracy gain, but not every implementation detail. So practical replication would still depend on exactly how the custom system was constructed and how tightly the evaluation matched real-world use.

Still, the main takeaway is robust enough to matter: stronger base models can produce major gains, while attaching flashcard-derived context to an already strong model may not automatically improve test accuracy in a meaningful way.