Automatic detection of microaneurysms in optical coherence tomography images of the retina using convolutional neural networks and transfer learning

The overall block diagram of the proposed method is shown in Figure 1 and the details are explained in the following subsections.

Figure 1

General functional diagram of the proposed method (SLO = Scanning Laser Ophthalmoscopy).

Image registration and preparation of OCT strips

For classification purposes, we need a dataset consisting of OCT image bands with appropriate labels including AD, normal, abnormal, and vessel. To our knowledge, such labeled datasets are not yet available for OCT images, so we had to create them. AMs and vessels are difficult objects to detect in OCT images, so to prepare the dataset of OCT image bands with appropriate labels, precise registration is performed to align OCT images and FA photographs. To do this, the dataset and the method proposed in32 is used for accurate registration of OCT and FA images using SLO photographs as intermediate images. In32, the dataset includes 36 pairs of FA and SLO images from 21 subjects with diabetic retinopathy, where the SLO image pixels are perfectly matched to the OCT B-scans. FA, OCT and SLO images are captured via the Heidelberg Spectralis HRA2/OCT device. Additionally, FA and SLO images are the same size as 768 × 768 pixels, and FA images were captured with two different fields of view (30 and 55°). In this method, after preprocessing, retinal vessel segmentation is applied to extract blood vessels from FA and SLO images. Next, a global registration is used based on the Gaussian model for the curved surface of the retina and for this purpose, a global rigid transformation is first applied to the FA vessel map image using a method based on the features to align it with the SLO vessel map photograph and then the transformed image is registered again globally considering the Gaussian model for the curved surface of the retina to improve the accuracy of the previous step. Then, a non-rigid local transformation is performed to perfectly register two images.

After that, as shown in Fig. 2, Using associated FAs, OCT bands are created from OCT B-scans in four labels including AD, normal, abnormal, and vessel. FA images are combined only once to create OCT bands for the training process. In the test process, FA images are no longer needed. In our dataset, AD, normal, abnormal, and vascular classes include 87, 100, 72, and 131 bands of OCT images, respectively. In this study, the scale factor of OCT images in the x direction is equal to 0.0115. So, in the OCT image, the value of x per pixel is 0.0115 mm. On the other hand, as indicated in33, MA has a maximum outer diameter of 266 µm. Applying this to our data set, the maximum external diameter of the MA is calculated to be around 23.2 pixels. Therefore, here, the width of the OCT band is considered to be 31 pixels, which is slightly larger than 23 pixels. Images are cropped to contain only retinal layers while other pixels are removed. For this purpose, first the segmentation method presented in34 is used to detect retinal nerve fiber layer (RNFL) and retinal pigment epithelium (RPE) layer, after which the OCT B-scan is cropped to include the highest part of NFL and the lowest part of RPE. This process is illustrated in Fig. S1 extra online. This is why the images have different heights. The process of collecting the dataset from the OCT strips and testing the B-scans is depicted in Supplementary Figs. S2 and S3 online. Note that when used as input for CNNs, the image will scale to a new dimension of 150×150×3. The dataset is publicly available at https://misp.mui. ac.ir/en/four-classes-dataset-oct-image-strips-png-format-%E2%80%8E-1.

Figure 2
Figure 2

The process of creating OCT bands for MA, normal, abnormal and vessel using matching FA. (a) The red circle indicates MA in FA. (b) B-scan Matches the green line in (a). (vs) Cropped ROI of (b). (DF) Creation of band for the normal class. (gI) Creating a tape for the abnormal class. (II) Creation of a band for the class of ship (in color).

Organizing Training, Validation, and Testing Data

The dataset is organized into training, validation, and test folders, each of which includes MA, normal, abnormal, and vessel image folders. Twenty percent of the data set is allocated to the test set and is not used in the training process. This is called the hold-out method. To validate each CNN, the Keras tuner Bayesian optimization tuner class35 is used to run the search on the search space. The search space includes the learning rate, momentum, and number of units in the first dense layer. The number of trials and epochs in the validation process are considered to be 10 and 80, respectively. The hyperparameters tuned for each CNN are listed in Supplementary Table S1 online. Fifteen percent of the dataset is allocated to the validation set. Since the dataset of our work is small compared to common deep learning task datasets and this may lead to overfitting, data augmentation technique is applied. Using this technique, certain transformations including rotation, zooming, horizontal flipping, rescaling, and shifting are applied to the images in the dataset at each training epoch by the image data generator. This helps the model not memorize the images and therefore not overfit.

Stacked generalization set

The overall structure of the classifier presented in this research is shown in Fig. 3. As can be seen, the stacked generalization set (stack) of four pretrained CNNs on the ImageNet dataset is used. The stacking set has two levels, namely 0 and 1. The elements and the formation process of each level are elaborated in the following sections.

picture 3
picture 3

Overall structure of the stacked generalization set.

Stacking Set Level 0

CNNs used in this stacking set include VGG16, VGG1936Exception37and InceptionV338. These CNNS or so-called basic learners form level 0 of the stacking set. The basic architecture of these basic learners is shown in Fig. 4. Here the image size for CNN input is 150×150×3 and average pooling is used. To use these networks, the last layer is removed and then, flattened, batch normalize, dense, stall, and dense layers are added at the end one after another. The added suppression layer has the factor of 0.35, and the last added dense layer has 4 units with the Softmax activation function to deal with our 4-class classification problem. The Relu activation function is considered for the first added dense layer, while the number of its units is determined using the Keras tuner in the validation process. Then, by freezing the previous layers, only the added layers are trained with the data from the training and validation records for 5 epochs. In this step, the Adam optimizer with the learning rate of 0.0001 is selected to train the newly added layers in each CNN. Now the added layers have the initial weights.

Figure 4
number 4

Basic structure of CNNs used in the stacking set39.

After validating each CNN, each network is trained for 100 epochs using full train and validation data. In this second round of the training process, some of the later layers are trained and the previous layers remain frozen, so their weights are not adjusted.

Stochastic gradient descent with moment (SGDM) is considered as the optimizer. The learning rate, momentum values, and number of trainable layers for CNNs are listed in Supplementary Table S1 online. In addition, categorical cross-entropy is used as a loss function that needs to be minimized in both learning processes.

In the training process, two callbacks are used to stop early and save the model most efficiently. Therefore, if there is no improvement in model performance (loss function minimization) for a certain number of epochs (patience parameter), the learning process is stopped before the number maximum number of epochs is reached. In this work, the patience parameter is equal to 25 and the loss function minimization is monitored for early stopping. Also, since the model obtained by completing the training epochs or stopping early (the last model trained) is not necessarily the best model, saving the model with the highest accuracy and loading it will solve this problem.

Level 1 stacking set

After training all base learners by the training data set, a meta learner is introduced as a level 1 part of the stacking set and trained to achieve higher accuracy by combining the learners results basic trained. In this study, the MLP classifier is used as a meta-learner for level 1 of the stacking set. This meta-learner is trained on the set of training and validation data by taking as input the outputs (predictions) of the basic learners trained in the previous step. For this purpose, the basic learner predictions are stacked and reshaped to be used as input tensors for the MLP model. In fact, the basic learners from the previous step were trained directly by the training dataset, and the MLP model is indirectly trained by the training dataset. To apply the obtained model to the new test images, the test B-scans have to be split into bands and the resulting bands have to be fed into the stacking set model to be classified into one of the mentioned classes. The MLP classifier includes 100 hidden layers and the maximum iteration parameter is set to 300.