Is ChatGPT reliable in answering questions about fertility?
Answers given by ChatGPT to questions about fertility yield high-quality information with little evidence of commercial bias, according to a study presented at ESHRE 2023.
“It is known that people seeking fertility-related information rely heavily on online sources such as clinic websites, consumer advocacy organizations, patient support groups, and social media,” said researchers K Beilby and K Hammarberg from Monash University in Melbourne, Australia.
“Our findings suggest that ChatGPT may be a useful tool for patients seeking factual and unbiased information regarding fertility and fertility treatment,” they noted.
Beilby and Hammarberg used 10 common patient questions as prompts: three related to fertility awareness (impact of female/male age on fertility and fertile window in the menstrual cycle), one to the chance of success with in vitro fertilization (IVF), one to elective egg freezing, one to the benefits of add-ons, one to polycystic ovarian syndrome (PCOS) and pregnancy, one to choosing a fertility clinic, and one to how many IVF cycles should be attempted.
Two independent experts used a scoring matrix (range, 0‒7, where higher scores indicate higher quality) to score the quality of the information generated by ChatGPT. They rated the text against human answers based on how well it corresponded (0‒3), evidence of commercial bias or controversial claims (no=1, yes=0), use of accurate proportions/statistics, and whether it was stated that medical advice should be sought (yes=1, no=0).
Scores given by the two experts were closely aligned, with a difference of only 1 point for one of the answers. A discussion was carried out to resolve such discrepancy. [ESHRE 2023, abstract O-089]
Notably, none of the answers achieved the maximum of 7, but six out of 10 received a score of 5 or more, while three received a score of 3‒4. Only one answer scored <3, and this was in response to the question about the benefits of add-ons. This question was also the one in which the response had evidence of commercial bias and one of two that made a controversial claim.
Reasons for caution
This study is limited by the unvalidated scoring method used by the experts, and this approach is exploratory in nature. However, the use of expert evaluation is normal when assessing the performance of machine learning models and often used to improve their parameters and performance.
“People seeking fertility-related information rely on internet sources when deciding on reproductive planning and assisted conception,” the researchers said. “The quality of information within the commercial landscape of infertility treatment is poor.”
ChatGPT is a language model that uses deep learning to generate human-like text from the given prompts. It generates answers by predicting the next word in the sequence based on patterns learned from training data.
“The training data for GPT-3 is not curated, but a snapshot of the Web, which includes all kinds of information, including biases that may exist within sources,” according to the researchers.