Curriculum
Computer Science and Innovation for Societal Challenges, XXXV series
Grant sponsor
Dip. di Matematica, UNIPD
Supervisor
Lamberto Ballan
Co-supervisor
Gianluca Campana
Project: Prediction of Activities and Visual Concepts Under Complex and Changing Conditions
Full text of the dissertation book can be downloaded from: https://www.research.unipd.it/handle/11577/3473495
Abstract: In the last years, all the computer vision dramatically changed because of the deep learning systems that have overtaken in most of the tasks the performances of the previous models establishing a new way of thinking about vision problems. Despite the success of traditional computer vision tasks, our systems are still a long way from the general visual intelligence of people. In this dissertation, I will discuss my findings on different problems related to the visual prediction of activities and visual concepts under complex and changing conditions. A core problem of visual intelligence is the capability of anticipating future events on videos given the current state of knowledge and, to achieve such predictive capabilities, specific vision systems have to be developed for encoding current representations and creating hypotheses of future scenarios. In this dissertation, I will discuss different directions I proposed for reaching predictive capabilities of vision systems based on semantic label smoothing of future actions, representing videos with slow and fast temporal scales, predicting latent goals, and prototyping future action representations. Another challenge of visual intelligence is related to recognizing unknown visual concepts that are not previously seen by the visual system. In this context, in this dissertation I will discuss my work on open-set recognition where the vision model has to detect unknown classes not seen during training, maintaining the recognition capability on previously seen categories. Another core task related to the prediction of visual concepts is the representation learning problem where the model has to learn good representations from visual input, without any supervision. In this context, in this dissertation, I will discuss how vision transformers can learn efficiently good representations on small datasets by designing self-supervised tasks based on spatial relations of input patches.