Self-Supervised Learning from Web Data for Multimodal Retrieval
Raul Gomez⁎,†; Lluis Gomez†; Jaume Gibert⁎; Dimosthenis Karatzas† ⁎Eurecat, Centre Tecnològic de Catalunya, Unitat de Tecnologies Audiovisuals, Barcelona, Spain†Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain
Abstract
Self-supervised learning from multimodal image and text data allows deep neural networks to learn powerful features with no need of human-annotated data. Web and social media platforms provide a virtually unlimited amount of this multimodal data. In this work we propose to exploit this free available data to learn a multimodal image and text embedding, aiming to leverage the semantic knowledge learned in the text domain and transfer ...
Get Multimodal Scene Understanding now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.