Generative AI Paper Reading: Next Token Prediction Towards Multimodal Survey

Name: Generative AI Paper Reading: Next Token Prediction Towards Multimodal Survey
Start: 2025-05-19T17:30:00-07:00
End: 2025-05-19T19:30:00-07:00

Hosted By

Matt W.

Generative AI Paper Reading: Next Token Prediction Towards Multimodal Survey

Details

Join us for a paper discussion on "Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey"
Exploring unified frameworks for multimodal understanding and generation through next-token prediction (NTP) paradigms.

## Featured Paper

"Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey" (Chen et al., 2024)
arXiv Paper

Discussion Topics

## Multimodal Tokenization

Discrete vs. continuous tokenization strategies (VQVAE, CLIP, HuBERT)
Tradeoffs between reconstruction fidelity and computational efficiency
Challenges in temporal alignment for video/audio and spatial modeling for images

## Model Architectures

Compositional Models: External encoders/decoders (e.g., CLIP for vision, Whisper for audio)
Unified Models: End-to-end NTP frameworks (e.g., VAR, Transfusion, Moshi)
Hybrid approaches balancing modality-specific and shared components

## Training Objectives

Discrete token prediction (DTP) vs. continuous token prediction (CTP)
Alignment strategies for cross-modal pretraining (e.g., contrastive learning, reconstruction loss)
Instruction tuning and preference alignment (RLHF/DPO) for human-centric outputs

## Performance Benchmarks

Vision: 42% accuracy gain on biomedical literature, 58% latency reduction in legal docs
Audio: 37% improvement in technical manual comprehension
Cross-Modal: Robustness in integrating tabular data with text (e.g., financial reports)

## Implementation Challenges

Memory overhead (22–38% increase vs. traditional RAG)
Privacy-preserving techniques for medical/legal data
Hardware optimization for hybrid CPU-GPU workflows

## Future Directions

Scaling laws for multimodal NTP models
Federated structurization for distributed training
Neuromorphic hardware integration for real-time video analysis

## Key Technical Insights

Tokenization: Vector quantization (VQ) enables discrete representation of continuous data (images, audio).
Inference Optimization: Adaptive attention masking for causal/semi-causal processing.
Unified Architectures: Models like Unified-IO and Emu3 demonstrate joint understanding/generation capabilities.

---
Silicon Valley Generative AI has two meeting formats.

1. Paper Reading - Every second week we meet to discuss machine learning papers. This is a collaboration between Silicon Valley Generative AI and Boulder Data Science.

2. Talks - Once a month we meet to have someone present on a topic related to generative AI. Speakers can range from industry leaders, researchers, startup founders, subject matter experts and those with an interest in a topic and would like to share. Topics vary from technical to business focused. They can be on how the latest in generative models work and how they can be used, applications and adoption of generative AI, demos of projects and startup pitches or legal and ethical topics. The talks are meant to be inclusive and for a more general audience compared to the paper readings.

If you would like to be a speaker please contact:
Matt White

Events in

Silicon Valley Generative AI – A GenAI Collective Member

See more events

Silicon Valley Generative AI – A GenAI Collective Member

public group

Every 2 weeks on Monday

Online event

Link visible for attendees

Silicon Valley Generative AI – A GenAI Collective Member

public group

Generative AI Paper Reading: Next Token Prediction Towards Multimodal Survey

FREE