Multimodal Scene Understanding

Book description

Multimodal Scene Understanding: Algorithms, Applications and Deep Learning presents recent advances in multi-modal computing, with a focus on computer vision and photogrammetry. It provides the latest algorithms and applications that involve combining multiple sources of information and describes the role and approaches of multi-sensory data and multi-modal deep learning. The book is ideal for researchers from the fields of computer vision, remote sensing, robotics, and photogrammetry, thus helping foster interdisciplinary interaction and collaboration between these realms.

Researchers collecting and analyzing multi-sensory data collections – for example, KITTI benchmark (stereo+laser) - from different platforms, such as autonomous vehicles, surveillance cameras, UAVs, planes and satellites will find this book to be very useful.

  • Contains state-of-the-art developments on multi-modal computing
  • Shines a focus on algorithms and applications
  • Presents novel deep learning topics on multi-sensor fusion and multi-modal deep learning

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. List of Contributors
  6. Chapter 1: Introduction to Multimodal Scene Understanding
    1. Abstract
    2. 1.1. Introduction
    3. 1.2. Organization of the Book
    4. References
  7. Chapter 2: Deep Learning for Multimodal Data Fusion
    1. Abstract
    2. 2.1. Introduction
    3. 2.2. Related Work
    4. 2.3. Basics of Multimodal Deep Learning: VAEs and GANs
    5. 2.4. Multimodal Image-to-Image Translation Networks
    6. 2.5. Multimodal Encoder–Decoder Networks
    7. 2.6. Experiments
    8. 2.7. Conclusion
    9. References
  8. Chapter 3: Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
    1. Abstract
    2. 3.1. Introduction
    3. 3.2. Overview
    4. 3.3. Methods
    5. 3.4. Results and Discussion
    6. 3.5. Conclusion
    7. References
  9. Chapter 4: Learning Convolutional Neural Networks for Object Detection with Very Little Training Data
    1. Abstract
    2. Acknowledgement
    3. 4.1. Introduction
    4. 4.2. Fundamentals
    5. 4.3. Related Work
    6. 4.4. Traffic Sign Detection
    7. 4.5. Localization
    8. 4.6. Clustering
    9. 4.7. Dataset
    10. 4.8. Experiments
    11. 4.9. Conclusion
    12. References
  10. Chapter 5: Multimodal Fusion Architectures for Pedestrian Detection
    1. Abstract
    2. Acknowledgement
    3. 5.1. Introduction
    4. 5.2. Related Work
    5. 5.3. Proposed Method
    6. 5.4. Experimental Results and Discussion
    7. 5.5. Conclusion
    8. References
  11. Chapter 6: Multispectral Person Re-Identification Using GAN for Color-to-Thermal Image Translation
    1. Abstract
    2. Acknowledgements
    3. 6.1. Introduction
    4. 6.2. Related Work
    5. 6.3. ThermalWorld Dataset
    6. 6.4. Method
    7. 6.5. Evaluation
    8. 6.6. Conclusion
    9. References
  12. Chapter 7: A Review and Quantitative Evaluation of Direct Visual–Inertial Odometry
    1. Abstract
    2. 7.1. Introduction
    3. 7.2. Related Work
    4. 7.3. Background: Nonlinear Optimization and Lie Groups
    5. 7.4. Background: Direct Sparse Odometry
    6. 7.5. Direct Sparse Visual–Inertial Odometry
    7. 7.6. Calculating the Relative Jacobians
    8. 7.7. Results
    9. 7.8. Conclusion
    10. References
  13. Chapter 8: Multimodal Localization for Embedded Systems: A Survey
    1. Abstract
    2. 8.1. Introduction
    3. 8.2. Positioning Systems and Perception Sensors
    4. 8.3. State of the Art on Localization Methods
    5. 8.4. Multimodal Localization for Embedded Systems
    6. 8.5. Application Domains
    7. 8.6. Conclusion
    8. References
  14. Chapter 9: Self-Supervised Learning from Web Data for Multimodal Retrieval
    1. Abstract
    2. Acknowledgements
    3. 9.1. Introduction
    4. 9.2. Related Work
    5. 9.3. Multimodal Text–Image Embedding
    6. 9.4. Text Embeddings
    7. 9.5. Benchmarks
    8. 9.6. Retrieval on InstaCities1M and WebVision Datasets
    9. 9.7. Retrieval in the MIRFlickr Dataset
    10. 9.8. Comparing the Image and Text Embeddings
    11. 9.9. Visualizing CNN Activation Maps
    12. 9.10. Visualizing the Learned Semantic Space with t-SNE
    13. 9.11. Conclusions
    14. References
  15. Chapter 10: 3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery
    1. Abstract
    2. 10.1. Introduction
    3. 10.2. Pose Estimation for Wide-Baseline Image Sets
    4. 10.3. Dense 3D Reconstruction
    5. 10.4. Scene Classification
    6. 10.5. Scene and Building Decomposition
    7. 10.6. Building Modeling
    8. 10.7. Conclusion and Future Work
    9. References
  16. Chapter 11: Decision Fusion of Remote-Sensing Data for Land Cover Classification
    1. Abstract
    2. 11.1. Introduction
    3. 11.2. Proposed Framework
    4. 11.3. Use Case #1: Hyperspectral and Very High Resolution Multispectral Imagery for Urban Material Discrimination
    5. 11.4. Use Case #2: Urban Footprint Detection
    6. 11.5. Final Outlook and Perspectives
    7. References
  17. Chapter 12: Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision
    1. Abstract
    2. 12.1. Introduction
    3. 12.2. Related Work
    4. 12.3. Generalized Distillation with Multiple Stream Networks
    5. 12.4. Experiments
    6. 12.5. Conclusions and Future Work
    7. References
  18. Index

Product information

  • Title: Multimodal Scene Understanding
  • Author(s): Michael Ying Yang, Bodo Rosenhahn, Vittorio Murino
  • Release date: July 2019
  • Publisher(s): Academic Press
  • ISBN: 9780128173596