[Paper Reading]: How much do language models memorize?

Name: [Paper Reading]: How much do language models memorize?
Start: 2025-06-18T19:00:00-07:00
End: 2025-06-18T21:00:00-07:00

Hosted By

Kate A. and SupportVectors AI L.

[Paper Reading]: How much do language models memorize?

Details

This week, we will walk through and discuss the paper: How much do language models memorize?
[https://arxiv.org/pdf/2505.24832]

Abstract of the paper:
We propose a new method for estimating how much a model ``knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: \textit{unintended memorization}, the information a model contains about a specific dataset, and \textit{generalization}, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point ``grokking'' begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from 500K to 1.5B parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

-------------------------

We are a group of applied AI practitioners and enthusiasts who have formed a collective learning community. Every Wednesday evening at PM PST, we hold our research paper reading seminar covering an AI topic. One member carefully explains the paper, making it more accessible to a broader audience. Then, we follow this reading with a more informal discussion and socializing.

You are welcome to join this in person or over Zoom. SupportVectors is an AI training lab located in Fremont, CA, close to Tesla and easily accessible by road and BART. We follow the weekly sessions with snacks, soft drinks, and informal discussions.

If you want to attend by Zoom, the Zoom registration link will be visible once you RSVP. Note that we have had to change and add security to the Zoom link to prevent Zoom bombing.

Events in Fremont, CA Artificial Intelligence

Deep Learning Machine Intelligence Machine Learning Data Science