Breaking Through The Challenges of Scalable Deep Learning for Video Analytics


Details
Agenda
- Use case Introduction & tool selection
- NLP & Entity Analytics
- Deep Learning Video Analytics
- Deployment tool selection and tips
Speakers
- Steven Flores is a cognitive engineer at Comp Three Inc. in San Jose. He leverages state-of-the-art methods in AI and machine learning to deliver novel business solutions to clients. In 2012, Steven earned his Ph.D. in applied math from the University of Michigan, and in 2017, he completed a postdoc in mathematical physics at the University of Helsinki and Aalto University.
- Luke Hosking is a software engineer at Comp Three Inc. in San Jose. He works with clients to enable machine learning applications by building data management solutions which integrate data from disparate systems. Luke most recently came from the Healthcare Technology world, where he was the technical leader of a start-up that improved care outcomes by connecting previously isolated clinical applications. He is a general technologist who has been working with computing since the days of Internet Radio tuners.
Abstract
When developing a machine learning system, the possibilities are limitless. However, with the recent explosion of Big Data and AI, there are more options than ever to filter through. Which technologies to select, which model topologies to build, and which infrastructure to use for deployment, just to name a few. We have explored these options for our faceted refinement system for video content system (consisting of 100K+ videos) along with their many roadblocks. Three primary areas of focus involve natural language processing, video frame sampling, and infrastructure deployment.
We use natural language processing on the video transcripts to extract verbally mentioned entities. Entity extraction, at a high level, can be thought of as identifying nouns in text, along with their type within a taxonomy. For instance, being able to extract George Washington, first as a name, and second as the first president of the United States. There are a number of general purpose solutions available for this but none for when the text of the video is domain specific. Therefore we've had to develop custom models for entity extraction for different client domains which we will describe.
The second set of challenges involve extracting entities through visual means. That is, grabbing frames from the video (every 5 seconds) and then using object detection models approaches to identify the visual entities in the videos. For example, a video of cats with no narration or words such as cat or feline should still be grouped together with cat videos. Building a collection of object detection models in conjunction with third party services provides decent coverage, but again domain specific images, such as real estate, require that we build custom models (using TensorFlow). Additionally, intelligent sampling of video frames is critical for performance. Therefore we developed heuristics to sample enough frames as to not miss critical visual elements and also not sample every frame in the video which would be computationally infeasible for a large numbers of videos.
Finally, there are a number of options today for deploying machine learning solutions and models. We evaluated a number of options such as Google Cloud Machine Learning, GPU machines in the cloud, as well as building our own dedicated GPU machine in-house. We will finish by outlining the benefits and challenges with each deployment approach and the answer as to which we ultimately used.

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics