Skip to content

Züri ML #33: Directions in Convolutional Neural Networks Research

Photo of Julian Zilly
Hosted By
Julian Z. and Yannic K.
Züri ML #33: Directions in Convolutional Neural Networks Research

Details

Do Deep Neural Networks Suffer from Crowding?

Anna Volokitin, ETH Zürich

Abstract: Crowding is a visual effect suffered by humans, in which an object that can be recognised in isolation can no longer be recognised when other objects are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks (DNNs) for object recognition. We investigate whether this effect is also present in both Convolutional Neural Networks and a multi-scale extension, called eccentricity-dependent networks. Eccentricity-dependent networks have recently been proposed for modeling the feedforward path of the primate visual cortex, and have scale invariance built into the architecture. We show that the eccentricity-dependent network trained on objects in isolation can recognize objects in clutter under certain conditions, whereas the standard convolutional networks cannot.

CNNs in Video analysis - An overview, biased to fast methods

Michael Gygli, ETH Zürich

Automatic video analysis has become increasingly popular in the recent past. This talk will focus on new developments in using Convolutional Neural Networks (CNN) and present recent advances to create fast automatic video analysis algorithms that can be used in production systems. The talk consists of three parts. First, I will discuss C3D, a spatio-temporal neural network that is widely used for video analysis tasks such as action recognition. In comparison to competing approaches, C3D directly operates on raw pixel inputs, allowing it to run at close to 400 FPS on a modern GPU. Second, I will present my recent method for shot boundary detection with fully convolutional CNNs. Its model architecture is similar to C3D, but more compact and fully convolutional in time. Thanks to these changes, the shot detection runs at more than 230x-real-time speed (5800 FPS), thus it can analyze full-length movies in less than half a minute. Finally, the presentation closes with our approach to automatically find highlights in videos. Our system first detects shots, which are then scored by a combination of C3D, audio features and a feed-forward neural network (FNN).

References:

Learning Spatiotemporal Features with 3D Convolutional Networks (C3D)( https://arxiv.org/abs/1412.0767 )

Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks ( https://arxiv.org/abs/1705.08214 )

Video2GIF: Automatic Generation of Animated GIFs from Video ( https://arxiv.org/abs/1605.04850 )

Photo of Zurich Machine Learning and Data Science group
Zurich Machine Learning and Data Science
See more events