DataTalks #32: Optimizing Elements of BERT & AutoGAN-Distiller āļøš§
Details
DataTalks #32: Optimizing Elements of BERT & AutoGAN-Distiller āļøš§
Our 32nd DataTalks meetup will be held online and will focus on optimizing and distilling neural networks.
š„š²š“š¶šššæš®šš¶š¼š» ššš¶š»š“ ššµš² š¹š¶š»šø š¶š šŗš®š»š±š®šš¼šæš!
š„š²š“š¶šššæš®šš¶š¼š» š¹š¶š»šø: https://floor28.co.il/event/f00dbba5-7f1a-4588-8fe4-a221e41b363a
šš“š²š»š±š®:
š· 18:00 - Opening words
š¶ 18:10 - 19:00 ā schuBERT: Optimizing Elements of BERT ā Zohar Karnin, Principal Applied Scientist at AWS
š“ 19:00 - 19:50 ā GAN Distillation with AutoGAN-Distiller ā Yoav Ramon, ML Engineer at Hi Auto
---------------------
šš°šµšššš„š§: š¢š½šš¶šŗš¶šš¶š»š“ šš¹š²šŗš²š»šš š¼š³ ššš„š§ ā šš¼šµš®šæ šš®šæš»š¶š», š£šæš¶š»š°š¶š½š®š¹ šš½š½š¹š¶š²š± š¦š°š¶š²š»šš¶šš š®š ššŖš¦
Transformers have gradually become a key component for many state-of-the-art natural language representation models. The recent transformer based model BERT, achieved state-of-the-art results on various natural language processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0. This model however is computationally prohibitive and has a huge number of parameters.
In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focus on reducing the number of parameters yet our methods can be applied towards other objectives such FLOPs or latency.
We show that much efficient light models can be obtained by reducing algorithmically chosen correct architecture design dimensions rather than the common choice reducing the number of Transformer encoder layers. In particular, our methods uncovers the usefulness of a non-standard design choice for multi-head attention layers making them much more efficient. By applying our findings, our schuBERT gives 6.6% higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers while having the same number of parameters.
š£š®š½š²šæ š¹š¶š»šø: https://www.aclweb.org/anthology/2020.acl-main.250.pdf
šš¶š¼: Zohar Karnin received his Ph.D in computer science from the Technion, Israel Institute of Technology at 2011. His research interests are in the area of large scale and online machine learning algorithms. He is currently a Principal Scientist in Amazon AWS AI leading the science for multiple efforts in SageMaker, an environment for machine learning development.
---------------------
ššš” šš¶ššš¶š¹š¹š®šš¶š¼š» šš¶ššµ šššš¼ššš”-šš¶ššš¶š¹š¹š²šæ ā š¬š¼š®š š„š®šŗš¼š», š š šš»š“š¶š»š²š²šæ š®š šš¶ šššš¼
GANS can get extremely big and get up to 1200 GFLOPS (One billion floating-point operations). For reference, MobileNET contains 0.5 GFLOPS.
This is why in many cases we want to lower the number of parameters of our GANs in order to save costs when running on cloud or being able to run those networks on edge devices. The problem is that classical methods, like pruning or model-distillation, that work well with other networks don't work well with GANs. AutoGAN-Distiller (Yonggan Fu et al.) is the first time that a practical way to lower the number of parameters this GAN, and is doing that with constrained Auto-ML techniques.
In my lecure I will talk about this research and also tell about a project I did that involved distilling Mel-GAN, a vocoder that is being used for real-time Text-To-Speech generation.
š£š®š½š²šæ š¹š¶š»šø: https://arxiv.org/pdf/2006.08198v1.pdf
š„š²š½š¼: https://github.com/TAMU-VITA/AGD
šš¶š¼: Yoav Ramon is an ML Engineer and first worker at Hi Auto, A newly founded startup.
---------------------
š„š²š“š¶šššæš®šš¶š¼š» ššš¶š»š“ ššµš² š¹š¶š»šø š¶š šŗš®š»š±š®šš¼šæš!
š„š²š“š¶šššæš®šš¶š¼š» š¹š¶š»šø: https://floor28.co.il/event/f00dbba5-7f1a-4588-8fe4-a221e41b363a
