Name: High Performance Inferencing for LLMS
Start: 2025-11-01T09:00:00-07:00
End: 2025-11-01T10:30:00-07:00

Inferencing has become ubiquitous across cloud, regional, edge, and device environments, powering a wide spectrum of AI use cases spanning vision, language, and traditional machine learning applications. In recent years, Large Language Models (LLMs), initially developed
for natural language tasks, have expanded to multi-modal applications including vision speech, reasoning and planning each demanding distinct service-level objectives (SLOs).
Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.

This talk focuses on system-level optimization techniques that maximize token throughput , achieve user experience metrics and inference service-provider efficiency. We review several recent innovations including KV caching, Paged/Flash/Radix Attention, Speculative Decoding, P/D Disaggregation and KV Routing, and explain how these mechanisms enhance performance by reducing latency, memory footprint, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM, SGLang, Hugging Face TGI, and NVIDIA NIM, which form the backbone of large-scale public and private LLM
serving platforms.
The use of GPU Training, Inference and Analysis clusters with Multi-Instance-GPU's (MIG),and Federated Models with QML applications is now become practical.
Attendees will gain a practical understanding of the challenges in delivering scalable, low latency LLM inference, and of the architectural and algorithmic innovations driving next generation high-performance inference systems.

Prakash

Divya Roy

Taposh Roy

Silicon Valley Quantum & Advanced Computing Group

Silicon Valley Quantum Computing Group

Technology

Cloud Computing

High Scalability Computing

Statistical Computing

Artificial Intelligence

Machine Learning

Quantum Computing

Quantum Physics

Quantum Mechanics

Data Science

Ravishankar Ravindran has over 24 years of experience contributing to advanced data networking products and research. He is an accomplished Professional currently on advisory role to eOTF as Technical Director. Previous to this, he was the Telco Architect at F5, he led the system architecture and design for F5’s Telco Cloud platform, supporting 5G vRAN and Core workloads. His work included active participation in standards development, particularly in the O-RAN Alliance's Working Group 6 (Cloud Architecture and Orchestration), and contributions to the Nephio project under the Linux Foundation Networking (LFN), focusing on Kubernetes-based domain orchestration for Telco use cases spanning RAN, Core, and Transport networks. Previously, as Chief Architect at Corning Inc., he focused on the architecture and design of disaggregated RAN (CU/DU) for third-party cloud platforms and contributed to the integration of vDU with third-party O-RUs based on O-RAN’s Open Fronthaul (O-FH) specifications, with a focus on the M-plane. Prior to that, he served as Chief Architect at Sterlite Technologies (STL), where he was responsible for the end-to-end design of multi-tier RAN Intelligent Controllers (RICs), aimed at optimizing large-scale RAN systems through xApps such as Mobile Load Balancing, Traffic Steering, and Dynamic Spectrum Sharing. Before this, he led the Future and Network Theory Lab at Futurewei (Huawei Technologies) as a Principal Researcher, focusing on efficient networking for cloud robotics, autonomous vehicles, and drone systems. His research emphasized next-generation networking requirements, including information-centric networking (ICN), software-defined networking (SDN), and network virtualization—particularly addressing challenges in mobility, content distribution, and content-centric routing protocols. Prior to this role, he was part of the CTO Office at Nortel, where he was a member of the Advanced Technology Group, working on research areas such as control plane routing protocols for IP/(G)MPLS, L2/L3 Virtualization services, scheduling problems in 4G wireless, and end-to-end QoE/QoS engineering for multimedia services. He later served as a Technology Advisor at Avaya. Ravindran has been an active contributor to numerous standardization bodies including the IETF, ITU, ATIS, the O-RAN Alliance, and LFN’s Nephio. He participated in the ITU’s Focus Group on IMT-2020, helping to define early standards for 5G. He holds a Ph.D. in Electrical Engineering from Carleton University, has served as an editor for the Springer Photonic Network Communications (PNET) journal, and has been part of technical program committees for top-tier conferences. Ravindran is a (co-)inventor on over 90 granted and filed U.S. patents (with additional patents pending) and has authored over 50 peer-reviewed papers in IEEE and ACM venues.

Ravishankar Ravindran

High Performance Inferencing for LLMS

Online event

Share

Silicon Valley Quantum & Advanced Computing Group

High Performance Inferencing for LLMS

Silicon Valley Quantum & Advanced Computing Group

Details

Related topics

You may also like