On Building Classification from Remote Sensor Imagery Using Deep Neural Networks and the Relation Between Classification and Reconstruction Accuracy Using Border Localization as Proxy

Bodhiswatta Chatterjee
Immersive and Creative Technologies Lab, Department of Computer Science and Software Engineering
Gina Cody School of Engineering and Computer Science
Concordia University, Montreal, Quebec, Canada
Charalambos Poullis
Immersive and Creative Technologies Lab, Department of Computer Science and Software Engineering
Gina Cody School of Engineering and Computer Science
Concordia University, Montreal, Quebec, Canada

		
		

PAPER

SOURCE CODE

SUPPLEMENTAL MATERIAL

Abstract

Convolutional neural networks have been shown to have a very high accuracy when applied to certain visual tasks and in particular semantic segmentation. In this paper we address the problem of semantic segmentation of buildings from remote sensor imagery. We present ICT-Net: a novel network with the underlying architecture of a fully convolutional network, infused with feature re-calibrated Dense blocks at each layer. Uniquely, the proposed network combines the localization accuracy and use of context of the U-Net network architecture, the compact internal representations and reduced feature redundancy of the Dense blocks, and the dynamic channel-wise feature re-weighting of the Squeeze-and-Excitation(SE) blocks. The proposed network has been tested on INRIA's benchmark dataset and is shown to outperform all other state-of-the-art by more than 1.5\% on the Jaccard index.

Furthermore, as the building classification is typically the first step of the reconstruction process, in the latter part of the paper we investigate the relationship of the classification accuracy to the reconstruction accuracy. A comparative quantitative analysis of reconstruction accuracies corresponding to different classification accuracies confirms the strong correlation between the two. We present the results which show a consistent and considerable reduction in the reconstruction accuracy. The source code and supplemental material is publicly available below.

Keywords

building classification; building reconstruction; classification accuracy; reconstruction accuracy; relationship between classification and reconstruction accuracy;

System Overview

Figure 1 summarizes the pipeline followed in the paper. Firstly, an orthorectified RGB image is fed forward into the neural network to produce a binary (building/non-building) classification map. Next, the binary classification map becomes the input to the reconstruction process. Due to the fact that it is extremely difficult to acquire building blueprints or CAD models for such large areas, and depth/3D information is not available for the images of the benchmark we posit that the building boundaries extracted from the binary classification map and refined, can serve as a proxy to the accuracy of the reconstruction. This is justified since the extracted boundaries are extruded in order to create the 3D models for the buildings. Therefore, the building boundaries are extracted, refined, and are used for the comparative analysis and evaluation of the accuracy of the reconstruction.

The diagram summarizes the work presented in this paper. Firstly, we focus on the building classification and propose a novel network architecture which outperforms state-of-the-art on benchmark datasets and is currently top-ranking. Secondly, we investigate the relation between the classification accuracy and the reconstruction accuracy and conduct a comparative quantitative analysis which shows a strong correlation but also a consistent and considerable decrease of the reconstruction accuracy when compared to the classification accuracy.

Network Overview

The proposed network architecture is distinct and combines the strengths of the U-Net architecture, Dense blocks, and Squeeze-and-Excitation (SE) blocks. This results is improved prediction accuracy and it has been shown to outperform other state-of-the-art network architectures such as the ones proposed in [9] which have a much higher number of learning parameters on the INRIA benchmark dataset. Figure 2 shows a diagram of the proposed feature recalibrated Dense block with 4 convolutional layers and a growth rate k=12 used by the ICT-Net. The proposed network has 11 feature recalibrated dense blocks with [4,5,7,10,12,15,12,10,7,5,4] number of convolutional layers in each dense block. Perhaps the closest architecture to the one proposed in the paper named "The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation" which uses 103 convolutional layers. If SE blocks are introduced at the output of every layer this will cause a vast increase in the number of parameters which will hinder the training. In contrast, in our work we have chosen to include an SE block only at the end of every Dense block in order to re-calibrate the accumulated feature-maps of all preceding layers. Thus, the variations in the information learned at each layer in the form of the features maps which are weighted by the SE block according to their importance as determined by the loss function.

Proposed feature recalibrated Dense block with 4 convolutional layers and a growth rate k = 12 used by the ICT-Net. c stands for concatenation.