Züri ML #33: Directions in Convolutional Neural Networks Research


Details
Do Deep Neural Networks Suffer from Crowding?
Anna Volokitin, ETH Zürich
Abstract: Crowding is a visual effect suffered by humans, in which an object that can be recognised in isolation can no longer be recognised when other objects are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks (DNNs) for object recognition. We investigate whether this effect is also present in both Convolutional Neural Networks and a multi-scale extension, called eccentricity-dependent networks. Eccentricity-dependent networks have recently been proposed for modeling the feedforward path of the primate visual cortex, and have scale invariance built into the architecture. We show that the eccentricity-dependent network trained on objects in isolation can recognize objects in clutter under certain conditions, whereas the standard convolutional networks cannot.
CNNs in Video analysis - An overview, biased to fast methods
Michael Gygli, ETH Zürich
Automatic video analysis has become increasingly popular in the recent past. This talk will focus on new developments in using Convolutional Neural Networks (CNN) and present recent advances to create fast automatic video analysis algorithms that can be used in production systems. The talk consists of three parts. First, I will discuss C3D, a spatio-temporal neural network that is widely used for video analysis tasks such as action recognition. In comparison to competing approaches, C3D directly operates on raw pixel inputs, allowing it to run at close to 400 FPS on a modern GPU. Second, I will present my recent method for shot boundary detection with fully convolutional CNNs. Its model architecture is similar to C3D, but more compact and fully convolutional in time. Thanks to these changes, the shot detection runs at more than 230x-real-time speed (5800 FPS), thus it can analyze full-length movies in less than half a minute. Finally, the presentation closes with our approach to automatically find highlights in videos. Our system first detects shots, which are then scored by a combination of C3D, audio features and a feed-forward neural network (FNN).
References:
Learning Spatiotemporal Features with 3D Convolutional Networks (C3D)( https://arxiv.org/abs/1412.0767 )
Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks ( https://arxiv.org/abs/1705.08214 )
Video2GIF: Automatic Generation of Animated GIFs from Video ( https://arxiv.org/abs/1605.04850 )

Züri ML #33: Directions in Convolutional Neural Networks Research