Assessing the potential of artificial intelligence to enhance colonoscopy adenoma detection in clinical practice: a prospective observational trial
Article information
Abstract
Background/Aims
This study aimed to evaluate the effectiveness of the GI Genius (Medtronic) module in clinical practice, focusing on the adenoma detection rate (ADR) during colonoscopy. Computer-aided polyp detection (CADe) systems using artificial intelligence have been shown to improve adenoma detection in controlled trials. However, the effectiveness of these systems in clinical practice has recently been questioned.
Methods
This single-center prospective observational study was conducted at the University Hospital of Southern Denmark and included all individuals referred for colonoscopy between November 2020 and January 2021. The primary outcome was ADR, comparing patients examined with CADe to those examined without it. The selection of patients to be examined with the CADe module was completely random.
Results
A total of 502 patients were analyzed (318 in the control group and 184 in the CADe group). The overall ADR was 32.1% with a slight increase in the CADe group (34.7% vs. 30.5%). Multivariable analysis showed a very modest and statistically insignificant increase in ADR (risk ratio, 1.12; 95% confidence interval, 0.88–1.43).
Conclusions
The use of CADe in clinical practice did not increase ADR with statistical significance when compared to colonoscopy without CADe. These findings suggest that the impact of CADe systems in everyday clinical practice are modest.
INTRODUCTION
Colonoscopy has the potential to prevent colorectal cancer (CRC) by identification and removal of precancerous lesions—colorectal adenomas. The benchmark diagnostic quality measure for colonoscopy is the adenoma detection rate (ADR), which is defined as the proportion of colonoscopies with at least one detected adenoma.1 Higher ADRs are associated with a reduced risk of post-colonoscopy interval cancer.2 ADR is dependent on patient age, sex, reason for investigation, and modifiable procedural and operator-related factors such as the quality of bowel preparation, operator experience, recognition error fatigue, and time-of-day.3,4 Real-time computer-aided polyp detection (CADe) has evolved to improve the diagnostic quality of colonoscopies. Using artificial intelligence (AI), CADe devices are trained to process colonoscopy images and superimpose them in real time on an endoscopy display, thus highlighting potential lesions.
Two recent systematic reviews have produced somewhat different results on the impact of CADe devices on enhancing the ADR.5,6 When analyzing data from randomized controlled trials (RCTs), an improved ADR of 44.0% vs. 35.9% (risk ratio [RR],1.24; 95% confidence interval [CI], 1.16–1.33) was found.6 When focusing on the implementation of CADe modules in clinical practice and including only observational studies, a difference in ADR of 36.3% vs. 35.8% (RR, 1.13; 95% CI, 1.01–1.28) was found.5 Although both systematic reviews documented improved ADRs owing to the use of CADe, the effect was markedly lower when tested in clinical practice. This discrepancy requires further investigation to ensure that the results obtained in highly controlled environments of RCTs can be translated into clinical practice.
Potential barriers may be encountered when the results of RCTs are replicated in real-world clinical practice. In RCTs, controlled patient, procedural, and examiner selection is instituted to optimize the trial’s ability to test the intervention at hand. In observational trials, such optimization is not feasible, and the performance of the intervention is tested as it would perform in clinical practice.5 However, these trials have limitations. Among the 12 clinical practice studies included in the systematic review, only one study used a concurrent comparator group, whereas the other 11 studies used historical controls.7 The use of historical controls can reduce the comparability between the two groups in question. This could be attributed to changes in the workflow, personnel, education, and bowel preparation regimens. We aimed to further clarify the extent to which CADe affects the ADR when tested in everyday clinical practice, in contrast to the strictly controlled and standardized environments of RCTs. This is the first Scandinavian study to use a prospectively defined concurrent comparator group.
METHODS
Population
All individuals referred for colonoscopy at the Department of Surgical Gastroenterology, Esbjerg Hospital, University Hospital of Southern Denmark, between November 2020 and January 2021 were eligible for inclusion. The study size was not based on a priori power calculation but sought to include all examined patients within this timeframe. The limits of the inclusion period were determined by the 3-month period during which the CADe module was available to the department. Indications for colonoscopy included CRC screening, post-polypectomy surveillance, post-CRC surgery surveillance, hereditary CRC surveillance, and symptomatic patients under evaluation for CRC. Follow-up colonoscopies for inflammatory bowel disease were not included in this study. The exclusion criteria were age less than 18 years and contraindications for biopsy or polyp removal.
Design
This was a single-center, prospective, observational study. Colonoscopies were performed in one of three adjacent endoscopy suites. A CADe module was installed at one of the three sites. The two endoscopists of the day rotated between the three suites, ensuring that the next available patient was investigated in the next available suite by the next available endoscopist. In effect, the assignment of patients to endoscopists and suites was completely random. With this pseudo-randomization set-up, the selection of patients examined using the CADe module was also completely random. The endoscopy suites were identical, except for the presence of the CADe module. Patients and investigators were not blinded to whether the GI Genius module was applied during colonoscopy.
The primary outcome was ADR, defined as the rate of colonoscopies with at least one histologically verified adenoma out of all endoscopies performed. We compared the ADR between patients with and without CADe. Adenomas were defined according to the Vienna classification.8
AI (CADe)
The CADe module (GI Genius; Medtronic) is based on a convolutional neural network that has been trained and validated using 2684 videos of histologically confirmed polyps from 840 patients.9 GI Genius receives and analyses the digital image from the endoscopy processor and highlights possible polyps using a green box on the endoscopy display monitor with a latency that is not perceivable (1.52±0.08 μs).10
Colonoscopy procedure
Bowel preparation was performed using either split-dose 1+1 L polyethylene glycol solution (MoviPrep; Norgine) for screening colonoscopies or split-dose picosulfate solution (Picoprep; Ferring Pharmaceuticals) for non-screening colonoscopies. Bowel preparation quality was assessed using the Boston bowel preparation scale (BBPS).11 All procedures were performed using a flexible, high-definition colonoscope (Evis Exera 190; Olympus) under fentanyl and midazolam sedation at the discretion of the endoscopist. Endoscopy was conducted between 9 am and 3 pm by endoscopists at the Department of Surgical Gastroenterology. The performing endoscopist ranged from senior house officer to consultant. All identified polyps were removed, except for diminutive (<5 mm) NBI International Colorectal Endoscopic type 1 rectal hyperplastic polyps. All the removed polyps were sent for histopathological analysis.
Data collection
The collected data included patient age, sex, indication for colonoscopy, use of CADe, intubation and withdrawal time (including time for polypectomy and any additional procedures), number of polyps sent for histopathological analysis, BBPS, and number of histologically confirmed adenomas.8 Data were recorded on individual Case Report Forms at the time of endoscopy, except for histological confirmation of adenomas, which was collected from the local electronic pathological register.
Statistical analysis
Patients with missing data on group assignments or polyp removal were excluded from the analysis. Statistical analyses were performed using Stata/IC (ver. 16.0; Stata Corp). Statistical significance was set at a two-sided p<0.05. Clinical and demographic characteristics between the CADe and the control group were compared using χ2 test for categorical variables and either two-sample t-test or the Mann-Whitney U-test for continuous variables depending on the normality of data.
For the subgroup analysis, the endoscopists were divided into two groups. Group 1 included House Officers, Senior House Officers, and Specialist Registrars, while group 2 included certified nurse endoscopists, Senior Registrars, and Consultants. Endoscopist experience was based on the education level rather than the number of colonoscopies, as not all endoscopists had reliable estimates of the number of colonoscopies performed. The endoscopists in group 2 had completed their surgical training and obtained an appropriate specialist qualification (minimum 6 years of training), whereas the endoscopists in group 1 underwent surgical training.
A binomial general linear model was used for multivariable analysis (complete cases only), with ADR as the outcome variable. The exposure variable was the CADe. The confounding variables were age, sex, BBPS, and whether the indication was colorectal screening or not.
Trial registration
Data handling was approved by the Region of Southern Denmark (Journal ID 20/53828). Ethical board approval was waived following correspondence with the regional Research Ethics Committees. Patients were included only after providing written informed consent.
The reporting follows the Strengthening the Reporting of Observational studies in Epidemiology guideline.12
Ethical statement
Ethical board approval was waivered due to the non-interventional design. Patients were included only after written informed consent was obtained.
RESULTS
A total of 536 patients were included in the study (Fig. 1). Thirty-four patients were excluded from analysis due to unknown group assignment or lack of polyp detection information. Data from 502 patients were available for analysis (318 in the regular colonoscopy group and 184 in the CADe group). There were no significant differences in the baseline characteristics (Table 1) between the two groups.
The procedural and diagnostic differences were not significant between the two groups (Table 2). The overall ADR was 32.1% (161/502); 34.7% in the CADe group against 30.5% in the control group (Table 2). The median BBPS was six in both groups and the withdrawal time was 9.49 minutes in the CADe group vs. 9.53 minutes in the control group (Table 2).
The ADR for screening colonoscopies was 47%, with 42.1% (16/38) in the CADe group and 49.3% (36/73) in the control group (p=0.47). No significant differences were noted in ADR based on endoscopist experience. The ADRs in group 1 were 40.0% (28/70) in the CADe group and 33% (31/93) in the control group. In group 2, the ADRs were 32.1% (36/112) in the CADe group and 29.5% (66/224) in the control group.
In the multivariable binomial model (Table 3), the use of CADe was not associated with increased odds of adenoma detection (RR, 1.12; 95% CI 0.88–1.43, p=0.37).
DISCUSSION
This is one of the first prospective studies to investigate the effects of computer-aided adenoma detection in everyday clinical practice using a prospective comparator group.
Based on more than 500 consecutive patients who were pseudo-randomized to undergo colonoscopy with or without CADe, the overall ADR was slightly higher when CADe was used (34.7 vs. 30.5%). In multivariable analysis, a RR of 1.12 underscored the relatively minimal impact of CADe in clinical practice. Although not statistically significant, the point estimate was in concordance with the combined RR found in the systematic review on clinical practice studies of 1.13, and as hypothesized, it was considerably lower than what has been reported in RCTs.5,6 A lack of statistical significance was expected with point estimates close to zero, and the precision of the estimate was supported by a confidence limit ratio of 1.6.13 With an absolute risk difference in ADR between the two groups of 4.2, the number needed to treat was 24. Thus, CADe should need to be applied in 24 colonoscopies to identify one additional patient with at least one adenoma.
The overall ADR of 32% reflects a relatively low number of screening individuals compared with that of other studies.14-16 The ADR increased to 47% when only patients investigated as part of the Danish National Colorectal Cancer Screening Program were included. This proportion is comparable to the average ADR of 47% for screening colonoscopies in the region of Southern Denmark, where the hospital is located,17 indicating that the diagnostic quality of the examinations reflects standard clinical practice. The ADR was lower than that reported in many RCTs, which is an inherent feature of a study performed as part of clinical practice where patients are not excluded based on poor bowel cleansing or inability to reach the cecum. The ADR was higher, and the effect of CADe was more pronounced in less experienced endoscopists. However, when subdividing the results based on the level of experience, caution should be exercised in interpretation, as the observed results may be influenced by the relatively small sample size. Further real-world clinical data are necessary to assess the differences in performance of CADe modules, as RCTs have indicated improvement in ADR for both experienced and less experienced colonoscopists.18,19
The main strength of this study was the CADe evaluation in clinical practice compared to a more specific patient population enrolled in RCTs. By extension, patients were not excluded based on incomplete bowel preparation, resulting in a more realistic patient cohort compared with that of other studies.20,21 In addition, all endoscopists were included regardless of their level of expertise, thus mimicking real-world clinical practice. The median BBPS for both the CADe and control groups was “6”, and there was no difference in intubation or withdrawal times. Patients were not excluded based on the minimum withdrawal time or BBPS score, which has been implemented in some RCTs. Instead, withdrawal times and acceptable bowel cleansing were left to the endoscopist’s discretion, reflecting standard practice. The observed mean withdrawal time was 9.51 minutes (including the time for adenoma removal), which is in line with international guidelines that recommend a minimum withdrawal time of 6 minutes.22 The median BBPS of six in both groups is classified as good/adequate but not excellent,11 and the clinical practice design inherently leads to a lower BBPS compared to that in RCTs, where excellent bowel cleansing is an inclusion criterion. This could potentially result in a lower ADR and may explain why CADe demonstrated a markedly lower impact on ADR when evaluated in clinical practice. It raises the question of how CADe could be expected to perform better if the mucosa is not adequately visualized. The second major strength of this study was the inclusion of a prospective control group instead of a historical group. The use of historical cohorts was determined to be the “most significant limitation” by the authors of a systematic review on real-world CADe use.5
Our study had certain limitations. Owing to the observational study design and to imitate real-life clinical practice, blinding was not performed. This may have introduced investigator bias, because endoscopists may perform differently if they are aware of the application of CADe. We had no dropouts but excluded 6% of the included patients because of insufficient case report forms; no systemic reason was suspected, and it is unlikely to affect the results of the study. Some RCTs on CADe have utilized a tandem endoscopy approach in which each patient was examined both with and without CADe. However, this set-up was not feasible while maintaining the clinical practice setting used in this study. This would have resulted in paired data, which would have been better suited for the technical evaluation of whether CADe is capable of detecting adenomas that are visualized but not detected by the endoscopist. The low number of screening colonoscopies prevented any conclusions from being drawn on the observed difference in ADR related to the use of CADe in this specific subgroup of patients. In addition, the number of included patients limited the conclusions drawn from the other subgroup analyses.
Hassan et al.6 concluded that the benefit of CADe was the enhanced detection of small, diminutive adenomas (<5 mm), likely because of the higher miss rate of polyps by endoscopists for smaller polyps compared to larger ones. Our dataset lacked sufficient information to evaluate the significance of lesion size, yet there is no apparent rationale to anticipate differences between real-world data and findings from RCT in this respect. If this is the case, the clinical benefit of CADe may be even smaller, as many diminutive polyps would not likely evolve to colorectal cancer. However, it could also be speculated that CADe could be useful in the detection of sessile serrated lesions, which are difficult to identify, especially during suboptimal bowel cleansing. The main factor in diagnosing lesions is the ability of the endoscopist to visualize the mucosa. This can be hampered by insufficient bowel preparation, endoscopic withdrawal in less than six minutes, and negligence of the endoscopist. CADe is not beneficial in such cases because it can only examine the mucosa that has already been visualized by the endoscopist. This is probably the primary reason for the limited clinical effects of CADe.
In conclusion, the use of CADe in everyday clinical practice did not increase the ADR compared with colonoscopy without CADe. The observed RR of 1.12 for ADR in favor of CADe aligns with previous real-world studies, indicating that CADe may not be as beneficial in enhancing ADR as previously suggested in RCTs.
Notes
Conflicts of Interest
The authors have no potential conflicts of interest.
Funding
None.
Acknowledgments
The authors thank Medtronic for providing the GI Genius endoscopy module. Additionally, we extend our heartfelt thanks to the doctors and endoscopy nurses in the Department of Gastrointestinal Surgery at the University Hospital of Southern Denmark whose support and cooperation greatly facilitated the successful completion of this research project. Medtronic plc supplied the GI Genius artificial intelligence endoscopy module for a period of 3 months. They did not influence the study design or content.
Author Contributions
Conceptualization: RK, MP; Data curation: SNR, SU; Methodology: all authors; Project administration: RK; Supervision: RK, MP; Writing–original draft: SNR; Writing–review & editing: SU, RK, MP.