Filter by type:

Sort by year:

Motion Estimation via Robust Decomposition with Constrained Rank

German Ros, Jose M. Alvarez and Julio Guerrero
IEEE T-IV 2017


In this work, we address the problem of outlier detection for robust motion estimation by using modern sparse-low-rank decompositions, i.e., Robust PCA-like methods, to impose global rank constraints. Robust decompositions have shown to be good at splitting a corrupted matrix into an uncorrupted low-rank matrix and a sparse matrix, containing outliers. However, this process only works when matrices have relatively low rank with respect to their ambient space, a property not met in motion estimation problems. As a solution, we propose to exploit the partial information present in the decomposition to decide which matches are outliers. We provide evidences showing that even when it is not possible to recover an uncorrupted low-rank matrix, the resulting information can be exploited for outlier detection. To this end we propose the Robust Decomposition with Constrained Rank (RD-CR), a proximal gradient based method that enforces the rank constraints inherent to motion estimation. We also present a general framework to perform robust estimation for stereo Visual Odometry, based on our RD-CR and a simple but effective compressed optimization method that achieves high performance. Our evaluation on synthetic data and on the KITTI dataset demonstrates the applicability of our approach in complex scenarios and it yields state-of-the-art performance.


title = {Motion Estimation via Robust Decomposition with Constrained Rank},
author = {German Ros and Jose Alvarez and Julio Guerrero},
journal = {{IEEE Transactions on Intelligent Vehicles}},
year = {2017},


Joint coarse-and-fine reasoning for deep optical flow

Victor Vaquero, German Ros, Francesc Moreno-Noguer, Antonio M. Lopez, Alberto Sanfeliu
V. Vaquero, G. Ros, F. Moreno-Noguer, A. M. and A. Sanfeliu. Joint coarse-and-fine reasoning for deep optical flow, 2017 IEEE International Conference on Image Processing, 2017, Beijing


We propose a novel representation for dense pixel-wise estimation tasks using CNNs that boosts accuracy and reduces training time, by explicitly exploiting joint coarse-and-fine reasoning. The coarse reasoning is performed over a discrete classification space to obtain a general rough solution, while the fine details of the solution are obtained over a continuous regression space. In our approach, both components are jointly estimated, which proved to be beneficial for improving estimation accuracy. Additionally, we propose a new network architecture, which combines coarse and fine components by treating the fine estimation as a refinement built on top of the coarse solution, and therefore adding details to the general prediction. We apply our approach to the challenging problem of optical flow estimation and empirically validate it against state-of-the-art CNN-based solutions trained from scratch and tested on large optical flow datasets.

  author    = {Victor Vaquero and
               German Ros and
               Francesc Moreno-Noguer and
               Antonio M. Lopez and
               Alberto Sanfeliu},
  title     = {Joint coarse-and-fine reasoning for deep optical flow},
  journal   = {International Conference on Image Processing},
  year      = {2017}

From Virtual to Real World Visual Perception using Domain Adaptation – The DPM as Example

Antonio M. Lopez, Jiaolong Xu, Jose L. Gomez, David Vazquez, German Ros
arXiv preprint,


Supervised learning tends to produce more accurate classifiers than unsupervised learning in general. This implies that training data is preferred with annotations. When addressing visual perception challenges, such as localizing certain object classes within an image, the learning of the involved classifiers turns out to be a practical bottleneck. The reason is that, at least, we have to frame object examples with bounding boxes in thousands of images. A priori, the more complex the model is regarding its number of parameters, the more annotated examples are required. This annotation task is performed by human oracles, which ends up in inaccuracies and errors in the annotations (aka ground truth) since the task is inherently very cumbersome and sometimes ambiguous. As an alternative we have pioneered the use of virtual worlds for collecting such annotations automatically and with high precision. However, since the models learned with virtual data must operate in the real world, we still need to perform domain adaptation (DA). In this chapter we revisit the DA of a deformable part-based model (DPM) as an exemplifying case of virtual- to-real-world DA. As a use case, we address the challenge of vehicle detection for driver assistance, using different publicly available virtual-world data. While doing so, we investigate questions such as: how does the domain gap behave due to virtual-vs-real data with respect to dominant object appearance per domain, as well as the role of photo-realism in the virtual world.

  author    = {Antonio M. Lopez and
               Jiaolong Xu and
               Jose L. Gomez and
               David V{\'{a}}zquez and
               Germ{\'{a}}n Ros},
  title     = {From Virtual to Real World Visual Perception using Domain Adaptation
               - The {DPM} as Example},
  journal   = {CoRR},
  volume    = {abs/1612.09134},
  year      = {2016},
  url       = {},
  timestamp = {Mon, 02 Jan 2017 11:09:15 +0100},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}

CARLA: An Open Urban Driving Simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, Vladlen Koltun


We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. The simulation platform supports flexible specification of sensor suites and environmental conditions. We use CARLA to study the performance of three approaches to autonomous driving: a classic modular pipeline, an endto-end model trained via imitation learning, and an end-to-end model trained via reinforcement learning. The approaches are evaluated in controlled scenarios of increasing difficulty, and their performance is examined via metrics provided by CARLA, illustrating the platform’s utility for autonomous driving research.

title = {{CARLA}: {An} Open Urban Driving Simulator},
author = {Alexey Dosovitskiy and German Ros and Felipe Codevilla and Antonio Lopez and Vladlen Koltun},
booktitle = {Proceedings of the 1st Annual Conference on Robot Learning},
pages = {1--16},
year = {2017}}

The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes

German Ros, Laura Sellart, Joanna Materzyńska, David Vazquez, Antonio Lopez
(short oral) International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. 2016


Vision-based semantic segmentation in urban scenarios is a key functionality for autonomous driving. Recent revolutionary results of deep convolutional neural networks (DC-NNs) foreshadow the advent of reliable classifiers to perform such visual tasks. However, DCNNs require learning of many parameters from raw images; thus, having a sufficient amount of diverse images with class annotations is needed. These annotations are obtained via cumbersome, human labour which is particularly challenging for semantic segmentation since pixel-level annotations are required. In this paper, we propose to use a virtual world to automatically generate realistic synthetic images with pixel-level annotations. Then, we address the question of how useful such data can be for semantic segmentation – in particular, when using a DCNN paradigm. In order to answer this question we have generated a synthetic collection of diverse urban images, named SYNTHIA, with automatically generated class annotations. We use SYNTHIA in combination with publicly available real-world urban images with manually provided annotations. Then, we conduct experiments with DCNNs that show how the inclusion of SYNTHIA in the training stage significantly improves performance on the semantic segmentation task.


author={German Ros and Laura Sellart and Joanna Materzynska and David Vazquez and Antonio Lopez},
title={ {SYNTHIA}: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes},

Street-View Change Detection with Deconvolutional Networks

Simon Stent, Pablo F. Alcantarilla, German Ros, Roberto Arroyo, Riccardo Gherardi
International Conference on Robotics Science and Systems (RSS), Michigan, USA. 2016


We propose a system for performing structural change detection in street-view videos captured by a vehicle mounted monocular camera over time. Our approach is motivated by the need for more frequent and efficient updates in the large-scale maps used in autonomous vehicle navigation. Our method chains a multi-sensor fusion SLAM and fast dense 3D reconstruction pipeline, which provide coarsely registered image pairs to a deep deconvolutional network for pixel-wise change detection. To train and evaluate our network we introduce a new urban change detection dataset which is an order of magnitude larger than existing datasets and contains challenging changes due to seasonal and lighting variations. Our method outperforms existing literature on this dataset, which we make available to the community, and an existing panoramic change detection dataset, demonstrating its wide applicability.


title={Street-View Change Detection with Deconvolutional Networks},
author={Pablo Alcantarilla and Simon Stent and German Ros and Roberto Arroyo and Riccardo Gherardi},
booktitle={Robotics: Science and Systems (RSS), Michigan, USA}

Training Constrained Deconvolutional Networks for Road Scene Semantic Segmentation

German Ros, Simon Stent, Pablo F. Alcantarilla, Tomoki Watanabe
Arxiv preprint 2016


In this work we investigate the problem of road scene semantic segmentation using Deconvolutional Networks (DNs). Several constraints limit the practical performance of DNs in this context: firstly, the paucity of existing pixel-wise labelled training data, and secondly, the memory constraints of embedded hardware, which rule out the practical use of state-of-the-art DN architectures such as fully convolutional networks (FCN). To address the first constraint, we introduce a Multi-Domain Road Scene Semantic Segmentation (MDRS3) dataset, aggregating data from six existing densely and sparsely labelled datasets for training our models, and two existing, separate datasets for testing their generalisation performance. We show that, while MDRS3 offers a greater volume and variety of data, end-to-end training of a memory efficient DN does not yield satisfactory performance. We propose a new training strategy to overcome this, based on (i) the creation of a best-possible source network (S-Net) from the aggregated data, ignoring time and memory constraints; and (ii) the transfer of knowledge from S-Net to the memory-efficient target network (T-Net). We evaluate different techniques for S-Net creation and T-Net transferral, and demonstrate that training a constrained deconvolutional network in this manner can unlock better performance than existing training approaches. Specifically, we show that a target network can be trained to achieve improved accuracy versus an FCN despite using less than 1\% of the memory. We believe that our approach can be useful beyond automotive scenarios where labelled data is similarly scarce or fragmented and where practical constraints exist on the desired model size. We make available our network models and aggregated multi-domain dataset for reproducibility.


title = {Training Constrained Deconvolutional Networks for Road Scene Semantic Segmentation},
author = {German Ros and Simon Stent and Pablo F. Alcantarilla and Tomoki Watanabe},
journal = {{arXiv preprint abs/1604.01545}},
year = {2016},

Unsupervised Image Transformation for Outdoor Semantic Labelling

German Ros, Jose M. Alvarez
IEEE, Intelligent Vehicles Symposium, Seoul, Korea. 2015


Semantic labelling of urban images is a crucial component towards autonomous driving. The accuracy of current methods is highly dependent on the training set being used and drops drastically when the distribution in the test image does not match the expected distribution of the training set. This situation will inevitably occur, as for instance, when the illumination changes from daytime to dusk. To address this problem we propose a fast unsupervised image transformation approach following a global color transfer strategy. Our proposal generalizes classical one-to-one color transfer schemes to the more suitable one-to-many scheme. In addition, our approach can naturally deal with the temporal consistency of video streams to perform a coherent transformation. We demonstrate the benefits of our proposal in two publicly available datasets using different state-of-the-art semantic labelling frameworks.


title={Unsupervised image transformation for outdoor semantic labelling},
author={Ros, German and Alvarez, Jose M},
booktitle={Intelligent Vehicles Symposium (IV), 2015 IEEE},

3D-Guided Multiscale Sliding Window for Pedestrian Detection

Alejandro Gonzalez, Gabriel Villalonga, German Ros, David Vazquez, Antonio Lopez
Iberian Conference on Pattern Recognition and Image Analysis, Santiago de Compostela, Spain. 2015


The most relevant modules of a pedestrian detector are the candidate generation and the candidate classification. The former aims at presenting image windows to the latter so that they are classified as containing a pedestrian or not. Much attention has being paid to the classification module, while candidate generation has mainly relied on (multiscale) sliding window pyramid. However, candidate generation is critical for achieving real-time. In this paper we assume a context of autonomous driving based on stereo vision. Accordingly, we evaluate the effect of taking into account the 3D information (derived from the stereo) in order to prune the hundred of thousands windows per image generated by classical pyramidal sliding window. For our study we use a multi-modal (RGB, disparity) and multi-descriptor (HOG, LBP, HOG+LBP) holistic ensemble based on linear SVM. Evaluation on data from the challenging KITTI benchmark suite shows the effectiveness of using 3D information to dramatically reduce the number of candidate windows, even improving the overall pedestrian detection accuracy.


author=”Gonz{\’a}lez, Alejandro and Villalonga, Gabriel and Ros, German and V{\’a}zquez, David and L{\’o}pez, Antonio M.”,
editor=”Paredes, Roberto and Cardoso, S. Jaime and Pardo, M. Xos{\’e}”,
chapter=”3D-Guided Multiscale Sliding Window for Pedestrian Detection”,
title=”Pattern Recognition and Image Analysis: 7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015, Proceedings”,
publisher=”Springer International Publishing”,

Fast and Robust Fixed-Rank Matrix Recovery

German Ros, Julio Guerrero, Angel Sappa, Daniel Ponsa, Antonio Lopez
Arxiv preprint 2015


We address the problem of efficient sparse fixed-rank (S-FR) matrix decomposition, i.e., splitting a corrupted matrix M into an uncorrupted matrix L of rank r and a sparse matrix of outliers S. Fixed-rank constraints are usually imposed by the physical restrictions of the system under study. Here we propose a method to perform accurate and very efficient S-FR decomposition that is more suitable for large-scale problems than existing approaches. Our method is a grateful combination of geometrical and algebraical techniques, which avoids the bottleneck caused by the Truncated SVD (TSVD). Instead, a polar factorization is used to exploit the manifold structure of fixed-rank problems as the product of two Stiefel and an SPD manifold, leading to a better convergence and stability. Then, closed-form projectors help to speed up each iteration of the method. We introduce a novel and fast projector for the SPD manifold and a proof of its validity. Further acceleration is achieved using a Nystrom scheme. Extensive experiments with synthetic and real data in the context of robust photometric stereo and spectral clustering show that our proposals outperform the state of the art.

Code: github


title = {Fast and Robust Fixed-Rank Matrix Recovery},
author = {German Ros and Julio Guerrero and Angel Sappa and Daniel Ponsa and Antonio Lopez},
journal = {{arXiv preprint abs/1503.03004}},
year = {2015},

Vision-based Offline-Online Perception Paradigm for Autonomous Driving

German Ros, Sebastian Ramos, Manuel Granados, Amir Bakhtiary, David Vazquez, Antonio M. Lopez
IEEE, Winter Conference on Applications of Computer Vision, Hawaii, USA. 2015


Autonomous driving is a key factor for future mobility. Properly perceiving the environment of the vehicles is essential for a safe driving, which requires computing accurate geometric and semantic information in real-time. In this paper, we challenge state-of-the-art computer vision algorithms for building a perception system for autonomous driving. An inherent drawback in the computation of visual semantics is the trade-off between accuracy and computational cost. We propose to circumvent this problem by following an offline-online strategy. During the offline stage dense 3D semantic maps are created. In the online stage the current driving area is recognized in the maps via a re-localization process, which allows to retrieve the pre-computed accurate semantics and 3D geometry in real-time. Then, detecting the dynamic obstacles we obtain a rich understanding of the current scene. We evaluate quantitatively our proposal in the KITTI dataset and discuss the related open challenges for the computer vision community.


title={Vision-based offline-online perception paradigm for autonomous driving},
author={Ros, German and Ramos, Sebastian and Granados, Manuel and Bakhtiary, Amir and Vazquez, David and Lopez, Antonio M},
booktitle={Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on},

VSLAM pose initialization via Lie-groups and Lie-algebras optimization

German Ros, Julio Guerrero, Angel D. Sappa, Daniel Ponsa, Antonio M. Lopez
IEEE, International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany. 2013


We present a novel technique for estimating initial 3D poses in the context of localization and Visual SLAM problems. The presented approach can deal with noise, outliers and a large amount of input data and still performs in real time in a standard CPU. Our method produces solutions with an accuracy comparable to those produced by RANSAC but can be much faster when the percentage of outliers is high or for large amounts of input data. On the current work we propose to formulate the pose estimation as an optimization problem on Lie groups, considering their manifold structure as well as their associated Lie algebras. This allows us to perform a fast and simple optimization at the same time that conserve all the constraints imposed by the Lie group SE(3). Additionally, we present several key design concepts related with the cost function and its Jacobian; aspects that are critical for the good performance of the algorithm.

Code: github


title={VSLAM pose initialization via Lie groups and Lie algebras optimization},
author={Ros, German and Guerrero, Juan and Sappa, Angel Domingo and Ponsa, Daniel and Lopez, Antonio M},
booktitle={Robotics and Automation (ICRA), 2013 IEEE International Conference on},

Fast and Robust L1-averaging-based Pose Estimation for Driving Scenarios

German Ros, Julio Guerrero, Angel D. Sappa, Daniel Ponsa, Antonio M. Lopez
In the 24th British Machine Vision Conference (BMVC), Bristol, UK. 2013


Robust visual pose estimation is at the core of many computer vision applications, being fundamental for Visual SLAM and Visual Odometry problems. During the last decades, many approaches have been proposed to solve these problems, being RANSAC one of the most accepted and used. However, with the arrival of new challenges, such as large driving scenarios for autonomous vehicles, along with the improvements in the data gathering frameworks, new issues must be considered. One of these issues is the capability of a technique to deal with very large amounts of data while meeting the realtime constraint. With this purpose in mind, we present a novel technique for the problem of robust camera-pose estimation that is more suitable for dealing with large amount of data, which additionally, helps improving the results. The method is based on a combination of a very fast coarse-evaluation function and a robust ℓ1-averaging procedure. Such scheme leads to high-quality results while taking considerably less time than RANSAC. Experimental results on the challenging KITTI Vision Benchmark Suite are provided, showing the validity of the proposed approach.

Code: github, github


title={Fast and Robust ℓ1-averaging-based Pose Estimation for Driving Scenarios},
author={Ros, German and Guerrero, Julio and Sappa, Angel D and Ponsa, Daniel and L{\’o}pez-Pe{\~n}a, Antonio M},

booktitle={British Machine Vision Conference (BMVC)},

Visual SLAM for Driverless Cars: A Brief Survey

German Ros, Angel Sappa, Daniel Ponsa, Antonio M. Lopez
In IEEE Workshop on Navigation, Perception, Accurate Positioning and Mapping for Intelligent Vehicles, Alcalá de Henares, Spain. 2012


This paper presents a brief survey of Visual SLAM methods in the context of urban ground vehicles. For this, we have reviewed relevant works, which present interesting ideas applicable to future designs of VSLAM schemes for urban scenarios. Our analysis aims to provide a global picture of stateof-the-art VSLAM systems, using a simple taxonomy based on an identified standard pipeline. This helps to show the global and consistent structure behind the different aspects that form these approaches. In addition, we also bring to the reader a set of useful tools, such as freely available code and datasets, that could help develop and test new VSLAM systems.


title={Visual slam for driverless cars: A brief survey},
author={Ros, German and Sappa, A and Ponsa, Daniel and Lopez, Antonio M},
journal={Intelligent Vehicles Symposium (IV) Workshops},


Articulated Particle Filter for Hand Tracking

German Ros, Jesus Martinez del Rincon, Gines Garcia-Mateos
In the 21st International Conference on Pattern Recognition, Tsukuba Science City, Japan. 2012


This paper proposes a new version of Particle Filter, called Articulated Particle Filter —ArPF—, which has been specifically designed for an efficient sampling of hierarchical spaces, generated by articulated objects. Our approach decomposes the articulated motion into layers for efficiency purposes, making use of a careful modeling of the diffusion noise along with its propagation through the articulations. This produces an increase of accuracy and prevent for divergences. The algorithm is tested on hand tracking due to its complex hierarchical articulated nature. With this purpose, a new dataset generation tool for quantitative evaluation is also presented in this paper.

Website: url


title={Articulated particle filter for hand tracking},
author={Ros, German and del Rincon, Jesus Martinez and Mateos, Gin{\’e}s Garc{\i}a},
booktitle={Pattern Recognition (ICPR), 2012 21st International Conference on},

Realidad aumentada basada en características naturales: Un enfoque práctico

German Ros, Gines Garcia Mateos
Paperback – EAE-Publishing, May 31, 2012

MS-222 toxicity in juvenile seabream correlates with diurnal activity, as measured by a novel video-tracking

Luisa M. Vera, German Ros, Gines Garcia Mateos, F. Javier Sanchez-Vazquez
Journal on Aquaculture 307 (2010), pp. 29–34, 2010. Elsevier


Fish are frequently exposed to anaesthetics since their use is necessary in several aquaculture procedures. The aim of this study was to investigate the existence of day–night differences in the toxicity and effectiveness of a common fish anaesthetic (MS-222) in juvenile gilthead seabream (Sparus aurata), determining the induction time of anaesthesia and subsequent recovery by a novel video-recording system. Our results showed that MS-222 toxicity was significantly higher at ML (mid-light) (LC50= 85.5 mg/L) than at MD (mid-darkness) (LC50= 107.6 mg/L) (trimmed Spearman-Karber method). In addition, when fish were exposed to a sublethal but effective MS-222 concentration (65 mg/L), 7 min passed before a 50% reduction in swimming activity was observed at ML compared to the 9 min required at MD. As regards recovery, fish showed activity levels similar to basal levels 10 min after MS-222 removal at ML, but only 6 min at MD. These results indicated that both toxicity and effectiveness were higher during the day than at night, coinciding with the diurnal activity pattern displayed by seabream, which should be taken into account when designing and applying daily protocols for anaesthesia in aquaculture.


title={MS-222 toxicity in juvenile seabream correlates with diurnal activity, as measured by a novel video-tracking method},
author={Vera, LM and Ros-Sanchez, German and Garcia-Mateos, Gines and Sanchez-Vazquez, F Javier},

A new taxonomy and graphical representation for visual fish analysis with a case study

German Ros, Gines Garcia Mateos, Luisa M. Vera, F. Javier Sanchez-Vazquez
In the Workshop on Visual Observation and Analysis of Animal and Insect Behavior (VAIB) in the 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. 2010


The wide variety of problems in fish ethology ranges from small larva motion analysis to open water tracking of big adult fish. In order to cover this variety of problems, and to develop a new systematic way to tackle them, it is necessary to perform an adequate classification attending to their properties. In this paper, we define a new taxonomy for visual fish analysis, and propose an easy-to-read representation based on star charts. Each problem is described by a chart, allowing to compare among them. Also, we present a case study of visual fish location in tanks in day and night conditions. The problem is solved by using adaptive background models and a median based locator. It has been applied to real zebrafish and gilt-head sea bream (GSB) experiments, achieving a high accuracy and robustness.


title={A new taxonomy and graphical representation for visual fish analysis with a case study},
author={Ros German and Garcia-Mateos, Gines and Vera, Luisa M and Sanchez-Vazquez, Francisco J},
booktitle={In the Workshop on Visual Observation and Analysis of Animal and Insect Behavior (VAIB) in the 20th International Conference on Pattern Recognition (ICPR)},