Method (expand all | collapse all) | Mean Accuracy (IOU) | |
---|---|---|
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (Feb 2018) | 89.0 | |
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at \url{https://github.com/tensorflow/models/tree/master/research/deeplab}. |
||
Multi-Scale Context Intertwining for Semantic Segmentation (ECCV 2018) | 88.0 | |
Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information,
and global scene context. In today’s age of pre-trained deep networks and their powerful convolutional features,
state-of-the-art semantic segmentation approaches differ mostly in how they choose to combine together these
different kinds of information. In this work, we propose a novel scheme for aggregating features from different
scales, which we refer to as Multi-Scale Context Intertwining (MSCI). In contrast to previous approaches, which
typically propagate information between scales in a one-directional manner, we merge pairs of feature maps in a
bidirectional and recurrent fashion, via connections between two LSTM chains. By training the parameters of the
LSTM units on the segmentation task, the above approach learns how to extract powerful and effective features
for pixel-level semantic segmentation, which are then combined hierarchically. Furthermore, rather than using
fixed information propagation routes, we subdivide images into super-pixels, and use the spatial relationship
between them in order to perform image-adapted context aggregation. Our extensive evaluation on public benchmarks
indicates that all of the aforementioned components of our approach increase the effectiveness of information
propagation throughout the network, and significantly improve its eventual segmentation accuracy.
|
||
ExFuse: Enhancing Feature Fusion for Semantic Segmentation (Apr 2018) | 87.9 | |
Modern semantic segmentation frameworks usually combine low-level and high-level features from pre-trained backbone convolutional models to boost performance. In this paper, we first point out that a simple fusion of low-level and high-level features could be less effective because of the gap in semantic levels and spatial resolution. We find that introducing semantic information into low-level features and high-resolution details into high-level features is more effective for the later fusion. Based on this observation, we propose a new framework, named ExFuse, to bridge the gap between low-level and high-level features thus significantly improve the segmentation quality by 4.0% in total. Furthermore, we evaluate our approach on the challenging PASCAL VOC 2012 segmentation benchmark and achieve 87.9% mean IoU, which outperforms the previous state-of-the-art results. |
||
Rethinking Atrous Convolution for Semantic Image Segmentation (Jun 2017) | 86.9 | |
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark. |
||
Feedforward semantic segmentation with zoom-out features (Dec 2014) | 64.4 | |
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set. |