Semantic Segmentation from Remote Sensor Data and the Exploitation of Latent Learning for Classification of Auxiliary Tasks

Bodhiswatta Chatterjee
Immersive and Creative Technologies Lab, Department of Computer Science and Software Engineering
Gina Cody School of Engineering and Computer Science
Concordia University, Montreal, Quebec, Canada
Charalambos Poullis
Immersive and Creative Technologies Lab, Department of Computer Science and Software Engineering
Gina Cody School of Engineering and Computer Science
Concordia University, Montreal, Quebec, Canada
Summary of contributions:
  1. ICTNet: Top-performing network architecture on two benchmark datasets (INRIA, AIRS)
  2. Relation between classification and reconstruction accuracy for 3D reconstruction
  3. A technique for externalizing the latent knowledge in a network trained for a binary classification (i.e. building/non-building) to perform sub-classification of the negative label (i.e. non-building → roads, cars, trees, low vegetation, clutter, etc.)
  4. Average F1-scores of ICTNet (which was trained on binary classification of buildings using only the INRIA benchmark dataset) on the entire ISPRS benchmark dataset (which contains 5 additional classes for which no fine-tuning or further training was performed): building 81.84%, road 62.21%, low_vegetation 42.30%, tree 25.88%, clutter 14.97%, car 9.12%, . Download PDF with detailed breakdown of scores for all 212 images.

Results on the randomly selected images (first column) from the ISPRS benchmark dataset (Pottsdam). The second column shows the multi-label ground-truth. The third column shows the building predictions from the pretrained ICT-Net. Note that the ISPRS dataset was not used during its training/validation. The fourth column is the sub-classification of the negative label (i.e. non-buildings) using the proposed technique. The fifth column shows the result of overlaying the building predictions (third column) on the sub-classification (fourth column). The sixth column shows the F1 scores for Building, Road, Car, Tree, Low Vegetation, and Clutter. These results are generated from a network training on building/non-building classification. No additional training/fine-tuning was performed. Using the learned weights for the binary classification we sub-classify the negative label by clustering the activation values at the penultimate layer.





In this paper we address three different aspects of semantic segmentation from remote sensor data using deep neural networks. Firstly, we focus on the semantic segmentation of buildings from remote sensor data and propose ICT-Net: a novel network with the underlying architecture of a fully convolutional network, infused with feature re-calibrated Dense blocks at each layer. Uniquely, the proposed network combines the localization accuracy and use of context of the U-Net network architecture, the compact internal representations and reduced feature redundancy of the Dense blocks, and the dynamic channel-wise feature re-weighting of the Squeeze-and-Excitation(SE) blocks. The proposed network has been tested on the INRIA and AIRS benchmark datasets and is shown to outperform all other state of the art by more than 1.5% and 1.8% on the Jaccard index, respectively.

Secondly, as the building classification is typically the first step of the reconstruction process, we investigate the relationship of the classification accuracy to the reconstruction accuracy. A comparative quantitative analysis of reconstruction accuracies corresponding to different classification accuracies confirms the strong correlation between the two. We present the results which show a consistent and considerable reduction in the reconstruction accuracy.

Finally, we present the simple yet compelling concept of latent learning and the implications it carries within the context of deep learning. We posit that a network trained on a primary task (i.e. building classification) is unintentionally learning about auxiliary tasks (e.g. the classification of road, tree, etc) which are complementary to the primary task. Although embedded in a trained network, this latent knowledge relating to the auxiliary tasks is never externalized or immediately expressed but instead only knowledge relating to the primary task is ever output by the network. We experimentally prove this occurrence of incidental learning on the pre-trained ICT-Net and show how sub-classification of the negative label is possible without further training/fine-tuning. We present the results of our experiments and explain how knowledge about auxiliary and complementary tasks - for which the network was never trained - can be retrieved and utilized for further classification. We extensively tested the proposed technique on the ISPRS benchmark dataset which contains multi-label ground truth, and report an average classification accuracy (F1 score) of 54.29% (SD=17.03) for roads, 10.15% (SD=2.54) for cars, 24.11% (SD=5.25) for trees, 42.74% (SD=6.62) for low vegetation, and 18.30% (SD=16.08) for clutter.


latent learning; sub-classification of negative label; interpretability of semantic segmentation networks; building classification; building reconstruction; classification accuracy; reconstruction accuracy; relationship between classification and reconstruction accuracy;

System Overview

1 gives an overview of our first two contributions. Firstly, an orthorectified RGB image is fed forward into the neural network to produce a binary (building/non-building) classification map. Next, the binary classification map becomes the input to the reconstruction process. Due to the fact that it is extremely difficult to acquire building blueprints or CAD models for such large areas, and depth/3D information is not available for the images of the benchmark we posit that the building boundaries extracted from the binary classification map and refined, can serve as a proxy to the accuracy of the reconstruction. This is justified since the extracted boundaries are extruded in order to create the 3D models for the buildings. Therefore, the building boundaries are extracted, refined, and are used for the comparative analysis and evaluation of the accuracy of the reconstruction.

The diagram gives an overview of our first two contributions. Firstly, we focus on the building classification and propose a novel network architecture which outperforms state-of-the-art on benchmark datasets and is currently top-ranking. Secondly, we investigate the relation between the classification accuracy and the reconstruction accuracy and conduct a comparative quantitative analysis which shows a strong correlation but also a consistent and considerable decrease of the reconstruction accuracy when compared to the classification accuracy.

Network Overview

The proposed network architecture is distinct and combines the strengths of the U-Net architecture, Dense blocks, and Squeeze-and-Excitation (SE) blocks. This results is improved prediction accuracy and it has been shown to outperform other state-of-the-art network architectures such as the ones proposed in [9] which have a much higher number of learning parameters on the INRIA benchmark dataset. Figure 2 shows a diagram of the proposed feature recalibrated Dense block with 4 convolutional layers and a growth rate k=16 used by the ICT-Net. The proposed network has 11 feature recalibrated dense blocks with [4,5,7,10,12,15,12,10,7,5,4] number of convolutional layers in each dense block. Perhaps the closest architecture to the one proposed in the paper named "The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation" which uses 103 convolutional layers. If SE blocks are introduced at the output of every layer this will cause a vast increase in the number of parameters which will hinder the training. In contrast, in our work we have chosen to include an SE block only at the end of every Dense block in order to re-calibrate the accumulated feature-maps of all preceding layers. Thus, the variations in the information learned at each layer in the form of the features maps which are weighted by the SE block according to their importance as determined by the loss function.

Proposed feature recalibrated Dense block with 4 convolutional layers and a growth rate k = 16 used by the ICT-Net. c stands for concatenation.

Latent Learning

Intuitively, we posit that in order for a network to learn task τ with labels υ, the network has to learn auxiliary tasks with associated complementary labels not υ. Thus, one can exploit the latent learning occurring for further sub-classification of the negative label. In order to prove the above hypothesis, ground truth for the sub-classes of the negative label is required. In our experiments we use four orthophoto images from the ISPRS benchmark dataset . For each image, a labeled image is also provided as ground-truth showing the manually assigned per-pixel classification into six classes: (a) buildings (blue), (b) roads (white), (c) trees (green), (d) red (clutter), (e) low vegetation/natural ground (cyan), (f) cars (yellow). The 'clutter' class contains areas for which a class could not be assigned e.g. water, vertical walls, etc. The 'low vegetation/natural ground' class contains areas on the ground covered by vegetation other than trees such as low bushes, grass, etc. Using the four pairs of images, we first estimate a probabilistic model for each sub-class as shown in 3. Next, using these models as probability priors, we make sub-classification predictions at each point by estimating the maximum likelihoods as shown in 4.

Probabilistic model estimation. We infer the activation values for each feature map at the penultimate layer for four images of size 640x640. Multi-label ground-truth corresponding to the four input images is used for aggregate the activation values based on their label. Finally, for each label and for each feature map of the penultimate layer, a probabilistic model Ν(μ,σ) is estimated. Analysis of the histograms of the activation values shows that it resembles a Gaussian distribution.

Sub-classification of the negative label. The pre-trained ICT-Net is used to infer the activation values at each feature map of the penultimate layer. A classification image Ck is generated for each feature map k using maximum likelihood estimation. Finally, the classification aggregator combines all the K classification images to produce the final classification labels.

Classification aggregator. There is one classification image per feature map. Each classification image contains pairs of for each pixel. The pairs are converted to one-hot vectors (e.g. vec[label] = probability) and aggregated together. The label with the highest probability is assigned to each pixel.