Validation of a novel, low-fidelity virtual reality simulator and an artificial intelligence assessment approach for peg transfer laparoscopic training

0
Validation of a novel, low-fidelity virtual reality simulator and an artificial intelligence assessment approach for peg transfer laparoscopic training

Based on the current regulations, the Regional Ethics Committee of the University of Pecs Medical School certified the research to be conducted without ethical authorisation (9749-PTE 2023). Informed consent was obtained from all subjects. All methods were carried out in accordance with relevant regulations. All data were collected anonymously, and the study was conducted in accordance with the Helsinki Declaration.

To objectively measure the training ability of the novel, self-developed, low-fidelity VR simulator in basic laparoscopic skills, the peg transfer test was used in the simulator and in a conventional FLS All-In-One Trainer System (PN: 50306 Limbs & Things inc., Savannah, GA, USA) as a control. Furthermore, to objectively validate the novel, automatic, AI-based assessment software, the peg transfer test results were evaluated by the software, and as a control with the standard, manual evaluation process were compared.

The development of a peg transfer VR simulator

The VR simulator used in the course was a proprietary development that ran on Meta Quest 2 (Facebook Technologies, LLC, Menlo Park, CA, 94,025, USA). The Unity framework (Unity Technologies, 30 3rd Street, San Francisco, CA, USA) was used for surgical simulation software development, similar to previous studies35,36. For the VR simulator, the objects of the traditional FLS trainer (rubber ring, dissector and peg board) were three-dimensional (3D) models created by our research team using a maximum of 20,000 polygons per model and a maximum of 4 k resolution for texture. As recommended by the framework documentation, the interactions of the objects in the VR simulation were implemented using Unity’s physics system. Each object in the simulation had a mesh with a mesh collider and a rigid body used to set the weight and constraints on the external forces. The physics of the system implemented mesh colliders to detect collisions and overlap of objects and used rigid body information to determine which object would be repealed. Based on the collision data, the software could calculate critical events of the simulation, such as grabs, collisions or drops. To make the user experience as realistic as possible, additional 3D-printed hardware elements, such as grasps for the VR controller, a metal rod representing the laparoscopic device and a pelvic trainer box, were added to the system. Figure 1 shows the hardware elements of the simulator and the peg transfer playground. The complete list of the parts used and the blueprint of the simulator can be found in the data repository.

Figure 1
figure 1

VR simulator hardware and the playground for the peg transfer exercise. Panel (A): The pelvic trainer used for the simulation of the FLS peg transfer task. Part a.) represents the box created with a 3D-printing technique. Part b.) indicates the 3D-printed grippers attached to the VR controllers. Part c.) is the Oculus Quest 2 VR goggles and part d.) denotes the metal rods used for the simulation of laparoscopic devices. Panel (B): Virtual operation theatre used in the simulation. Panel (C): FLS peg transfer tasks, as seen in the VR goggles (side view). Panel (D): FLS peg transfer tasks, as seen in the VR goggles (top view). The models and images are the sole property of the University of Pecs.

VR simulation testing protocol

The course protocol had three parts, starting with initial testing (pre-course test) of FLS-naive students on the conventional trainer. The students were randomised using the RANDBETWEEN function of Microsoft Excel (v16.0.16502.42308 Microsoft Corp., Redmond, WA, USA) into the VR simulator group or control group (traditional trainer) to perform the practice training. In the second part of the protocol, the students were instructed to participate in the practice training, which consisted of the peg transfer exercise performed four times in two weeks during the preset dates either in the VR simulator or in the conventional trainer based on the randomisation. In the third part of the protocol, the students were instructed to perform the peg transfer exercise in the traditional FLS trainer (post-course test). For both the pre- and post-course tests, the students had two attempts, and the better results were considered to avoid unintentional mistakes irrespective of technical skills. The test protocol was established in agreement with the University of Pecs Medical School’s graduate surgical training plan.

The testing was performed on 65 third-year medical students as an extension of the ‘Basics of Surgery’ obligatory course. For the participants, the only exclusion criterion was any type of previous laparoscopic experience. The mentor during the training protocol was a surgeon with more than 10 years of experience in both theoretical and practical surgical education, operatively aided by two skilled technicians.

Throughout the training protocol, the practice and the testing exercise were the same regardless of the VR or conventional training device—that is, to accomplish the peg transfer exercise performed using two Maryland dissectors according to the FLS specifications. Each object had to be lifted from the peg, transferred to the other clamp in mid-air without using the board or pegs for assistance and then placed on any of the empty pegs on the other side of the pegboard. This procedure had to be repeated in another direction. Five students were excluded because they either did not attend the post-course test or only partly participated in the practice session.

Following the practice period, the VR group participants were asked to complete a five-point Likert scale questionnaire about their subjective impressions of the VR simulator and what they thought of it compared with the conventional FLS trainer. The questionnaire was created using Google Forms (Google LLC, Amphitheatre Parkway, Mountain View, CA, USA), and it was administered anonymously.

The pre- and post-course tests were recorded, and the videos were evaluated by six professionals who had more than three years of experience in surgical education. Each video was evaluated by two professionals. If there was any incoherence between the two evaluators, a third professional was brought in, and all three evaluators decided on the questionable event or pitfall. The evaluation of the pre- and post-course tests was based on the assessment criteria of the FLS peg transfer test (details in Fig. 2).

Figure 2
figure 2

The protocol followed in the study. n: number of participants; i: number of attempts the task has been completed during practice by each participant.

Development of AI-based automatic assessment software

To establish an automatic assessment system, AI-based software was developed with the same evaluation criteria used in a standard assessment method (Fig. 2). During the automatic evaluation, the algorithm works frame by frame through the exercise video, using the trained AI models to detect relevant objects and determine their position on the image. Based on this data and the information gained, the software analyses the execution of the exercise and determines the result. The evaluation is automated, objective and reproducible.

During the peg transfer exercise, the objects that had to be detected in the video were the pegs, the Maryland dissectors and the rubber rings, which had to be transferred, while the event to be detected was the grab (when one Maryland dissector grabs a rubber ring). Henceforth, they are referred to as the objects of the study.

A YOLOv8 AI model was trained using supervised learning with a dataset consisting of 22,675 images for training/validation with a split of 12.5%, and 24,628 images were used for testing in the PyTorch framework37,38. The optimal number of epochs was determined based on the learning curves, initially using the 300 epochs proposed in the YOLOv8 documentation. For our custom training dataset, the optimal number of epochs was 100. To generate the training dataset, videos of the training sessions during the course were used. The exercises were randomly selected, and the recordings were split into frames on which the necessary annotation preparation was performed using CVAT software39. On the online CVAT interface, the annotator manually annotates the images with bonding boxes for the seen objects. The datasets can be exported in the appropriate format from the online interface. The following training datasets were used: 14,223 annotated images for grab, 1790 annotated images for dissectors and rubbers and 4142 annotated images for pegs. The validation dataset consisted of 1778 annotated images for grabs, 224 annotated images for dissectors and rubbers and 518 annotated images for pegs. To evaluate the models, the following test datasets were used: 6532 annotated images for grab, 11,195 annotated images for dissectors and rubbers and 6901 annotated images for pegs.

The following algorithm was developed for the automatic evaluation of the peg transfer tests (Fig. 3):

  1. (1)

    The videos were read and split into frames using the OpenCV library.

  2. (2)

    Inference was performed using the trained YOLOv8 model on the frames to determine the type and location of the objects. The capability of object detection is a key element for exercise evaluation, although it does not directly measure task accuracy. Purposing a comprehensive evaluation, a custom algorithm was developed that determined the presence of errors and the overall exercise duration.

  3. (3)

    False positive and false negative detections were filtered out using a sliding average, making the individual states (e.g. grab, drop, rubber–peg contact) significantly more stable.

  4. (4)

    Based on the model results, the exercise time was determined and divided into sessions. A session lasts from the moment a rubber ring is picked up until it is released.

  5. (5)

    The sessions were validated considering the pitfalls. The pitfalls are errors such as missing handover, invalid handover and invalid pickup or release. A missed handover was determined based on the fact that the rubber was grabbed using different instruments at the beginning and the end of the session. By examining the faulty pickup (picked up by the opposite instrument after dropping), an invalid handover could be ruled out—that is, it could be determined that the handover occurred in mid-air.

  6. (6)

    The results were determined from the errors and the measured exercise duration.

  7. (7)

    The results were saved into a csv file for easy comparison.

Figure 3
figure 3

Representation of the developed AI-based evaluation algorithm. Panel (A): Input image Panel (B): Visualisation of exercise validation (current session and objects on the image). The models and images are the property of the University of Pecs.

The algorithm was run on all pre- and post-course videos, and the results were compared with the conventional rating data accessed by the professionals to obtain interrater reliability (professionals’ consensus vs. AI-based algorithm).

For both the standard and AI-based algorithm assessments, the time needed for the evaluation was recorded for every exercise separately. The automatic evaluations were performed on a PC workstation with an Intel® Core i9-9900KF processor (Intel Corporation, Santa Clara, CA, USA), 3.60 GHz and NVIDIA GeForce RTX 2080 Ti 12 GB graphic card (NVIDIA Corporation, Santa Clara, CA, USA). A graphical processing unit was used for inference.

Validation of the effectiveness of the developed VR simulator and AI-based automatic assessment software

To validate the effectiveness of the developed VR simulator, the pre- and post-training test results were compared between the VR and control groups. The developed algorithm’s reliability was assessed in several steps. To justify the accuracy of object detection, model evaluation on the test datasets was performed, and the mean average precision metric was determined. The results of the automated evaluation were compared with the experts’ manual assessments to determine accuracy. Through this, it could be verified whether the automated assessment correctly evaluated the exercise or produced false-positive or false-negative results. A false-positive result is when the algorithm incorrectly considers an exercise as passed, and a false-negative result is when the algorithm incorrectly considers an exercise as failed, while the expert evaluator does not.

Statistical analysis

The data for the pre- and post-course tests were collected from the experts and the algorithm in the same structure. For the analysis of the between-group test results, the chi-squared test was used. To test normality, the Shapiro–Wilks test was performed, and it showed that the examined variables did not follow a normal distribution. Thus, the Mann–Whitney U test was used to compare the independent groups (exercise durations in the VR and control groups). The Wilcoxon signed-rank test was applied to compare the completion times of the study groups. Cohen’s Kappa test was used to determine inter-rater reliability.

Jamovi (version 2; Sydney, Australia) was used for the statistical analysis40,41. The significance level was set to p < 0.05. For chart building, Origin (Origin, version number (e.g. ‘Version 2022’); OriginLab Corporation, Northampton, MA, USA.) was used.

Ethical approval

Based on the current regulations, the University of Pecs Medical School Regional Ethics Committee certified the research to be conducted without ethical authorisation (9749-PTE 2023). Informed consent was obtained from all subjects. All methods were carried out in accordance with relevant regulations. All of the data were collected anonymously, and the study was conducted in accordance with the Helsinki Declaration.

link

Leave a Reply

Your email address will not be published. Required fields are marked *