Skip to content

This event was canceled

Details

DataTalks #32: Optimizing Elements of BERT & AutoGAN-Distiller āœ”ļøšŸ§ 

Our 32nd DataTalks meetup will be held online and will focus on optimizing and distilling neural networks.

š—­š—¼š—¼š—ŗ š—¹š—¶š—»š—ø: https://us02web.zoom.us/j/89415600010?pwd=bEpoOHIwV3pRbEFWL0RrT0NkR2dSUT09

š—”š—“š—²š—»š—±š—®:
šŸ”· 18:00 - Opening words
šŸ”¶ 18:10 - 19:00 – schuBERT: Optimizing Elements of BERT – Zohar Karnin, Principal Applied Scientist at AWS
šŸ”“ 19:00 - 19:50 – GAN Distillation with AutoGAN-Distiller – Yoav Ramon, ML Engineer at Hi Auto

---------------------

š˜€š—°š—µš˜‚š—•š—˜š—„š—§: š—¢š—½š˜š—¶š—ŗš—¶š˜‡š—¶š—»š—“ š—˜š—¹š—²š—ŗš—²š—»š˜š˜€ š—¼š—³ š—•š—˜š—„š—§ – š—­š—¼š—µš—®š—æ š—žš—®š—æš—»š—¶š—», š—£š—æš—¶š—»š—°š—¶š—½š—®š—¹ š—”š—½š—½š—¹š—¶š—²š—± š—¦š—°š—¶š—²š—»š˜š—¶š˜€š˜ š—®š˜ š—”š—Ŗš—¦

Transformers (Vaswani et al., 2017) have gradually become a key component for many state-of-the-art natural-language-representation models. A recent Transformer-based model — BERT (Devlin et al., 2018) — achieved state-of-the-art results on various natural-language-processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0.

This model however is computationally prohibitive and has a huge number of parameters. In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focus on reducing the number of parameters yet our methods can be applied towards other objectives such as FLOPs or latency.

We show that much more efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions rather than reducing the number of Transformer encoder layers. In particular, our schuBERT gives 6.6% higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers while having the same number of parameters.

š—£š—®š—½š—²š—æ š—¹š—¶š—»š—ø: https://www.amazon.science/publications/schubert-optimizing-elements-of-bert

š—•š—¶š—¼: Zohar is a Principal Research Scientist at Amazon.

---------------------

š—šš—”š—” š——š—¶š˜€š˜š—¶š—¹š—¹š—®š˜š—¶š—¼š—» š˜„š—¶š˜š—µ š—”š˜‚š˜š—¼š—šš—”š—”-š——š—¶š˜€š˜š—¶š—¹š—¹š—²š—æ – š—¬š—¼š—®š˜ƒ š—„š—®š—ŗš—¼š—», š— š—Ÿ š—˜š—»š—“š—¶š—»š—²š—²š—æ š—®š˜ š—›š—¶ š—”š˜‚š˜š—¼

GANS can get extremely big and get up to 1200 GFLOPS (One billion floating-point operations). For reference, MobileNET contains 0.5 GFLOPS.

This is why in many cases we want to lower the number of parameters of our GANs in order to save costs when running on cloud or being able to run those networks on edge devices. The problem is that classical methods, like pruning or model-distillation, that work well with other networks don't work well with GANs. AutoGAN-Distiller (Yonggan Fu et al.) is the first time that a practical way to lower the number of parameters this GAN, and is doing that with constrained Auto-ML techniques.

In my lecure I will talk about this research and also tell about a project I did that involved distilling Mel-GAN, a vocoder that is being used for real-time Text-To-Speech generation.

š—£š—®š—½š—²š—æ š—¹š—¶š—»š—ø: https://arxiv.org/pdf/2006.08198v1.pdf

š—„š—²š—½š—¼: https://github.com/TAMU-VITA/AGD

š—•š—¶š—¼: Yoav Ramon is an ML Engineer and first worker at Hi Auto, A newly founded startup.

---------------------

š—­š—¼š—¼š—ŗ š—¹š—¶š—»š—ø: https://us02web.zoom.us/j/89415600010?pwd=bEpoOHIwV3pRbEFWL0RrT0NkR2dSUT09

Members are also interested in