The creation of image datasets for training deep neural networks mainly consists of data acquisition, data selection, and data labeling. Data acquisition is often limited, and data delivery is impaired by privacy regulations, especially in the medical imaging domain. Another major obstacle is the costly and time-intensive data labeling, which often requires medical professionals. Synthetic data may offer numerous benefits, including the ability to augment datasets with diverse and realistic images where real data is limited [1,2]. This reduces the costs and labor associated with annotating real images. Synthetic data also provides an ethical alternative to using sensitive patient data without compromising patient privacy or requiring ad hoc ethical committee approval for any specific project.
Our project aims to design, implement, and test artificial intelligence tools for the massive generation of realistic synthetic data with a threefold objective:
1. Enriching existing datasets with the final goal of enhancing the performance of machine learning models in the field of medical imaging;
2. Providing a cost-effective alternative to the labor-intensive task of collecting and annotating real medical data by generating pairs based on user-defined classes (image, label);
3. Generating synthetic datasets that mimic the characteristics of real-world medical data, totally preserving the patient's privacy.
In our project, we focus on three different data modalities: 3D medical images obtained from Cone Beam Computed Tomography (CBCT), high-resolution pyramidal images obtained with microscopy (confocal images and WSI), and mammographic (X-rays) images.
In previous scientific collaborations, our groups have developed (i) deep learning algorithms to enhance 2D annotations of the Inferior Alveolar Canal (an osseous canal crossing the mandible) in CBCT scans, making them suitable for the training of 3D segmentation models [3], (ii) and generative models for the creation of synthetic pairs of dermoscopic images and segmentation masks [1].
In this proposal, the application of generative algorithms will be pushed further by designing algorithms that are able to generate an entire set of CBCT scans paired with ground-truth labels. In addition to 3D volumes, the algorithms will be tested in high-resolution image scenarios, specifically targeting WSI histological images and confocal data, and on X-rays data.
Specifically, we intend to collect a significant amount of real data in the context of maxillofacial surgery, prostate cancer, and breast cancer and develop machine-learning techniques for the generation of synthetic datasets. The quality of generated data will be both qualitatively evaluated by clinical experts in the field and quantitatively assessed by measuring the performance of state-of-the-art automatic classification and segmentation algorithms when trained on such generated data, with the final goal of employing such models in daily clinical practice.