Skip to content

Details

🚀 Abu Dhabi Machine Learning Meetup

Looking forward to organize the next ADML Abu Dhabi Machine Learning Meetup. Tentative theme:

Large-Scale, High-Throughput LLM Inference: The Art & Science of Building AI Systems at Scale

This meetup will bring together engineers, researchers, founders, and practitioners to discuss:

• High-throughput LLM inference architectures
• Batch processing and large-scale data enrichment
• vLLM, SGLang, TensorRT-LLM, Ray, and modern inference stacks
• GPU utilization, scheduling, batching, and cost optimization
• Structured extraction and derived data systems
• Production lessons learned from deploying AI at scale
• Open-source and frontier model ecosystems
• Building AI-native companies in the UAE and beyond

Whether you’re working on AI products, research, data platforms, inference infrastructure, or simply interested in learning from others in the ecosystem, we’d love to have you join us.

📍 Abu Dhabi
🗓 June 24, 2026
🌇 Evening event: 6 - 9 PM

If you’re interested in attending, speaking, sponsoring, hosting, or helping organize the meetup, please reach out!

Talk 1: SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Abstract:
Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Bio:
Shengkun Tang is a PhD student of Machine Learning in MBZUAI, under the supervision of Prof. Zhiqiang Shen. He was a research intern in Alibaba Qwen Pretraining Team, exploring the structured pruning and knowledge distillation to obtain powerful small LLMs. His research focuses on building efficient machine learning systems. He is interested in improving the full pipeline of modern foundation models, including inference, training, data efficiency, and novel efficiency architectures.

Related topics

Events in Abu Dhabi, AE
Artificial Intelligence
Deep Learning
Machine Learning
Natural Language Processing
Data Science

You may also like