Saturday, June 15, 2024

LLM Memorization



Michael has a premium accountClick to upgrade to Premium

Michael Bommarito

 (He/Him)  2nd degree connection
Adviser, entrepreneur, educator, investor.
Greater Lansing  Contact info

 


"LLMs don't memorize things. They *grok*. They pattern match. But they can't replicate their input data."

It's an argument I hear daily.

It's a nice idea, well-reinforced and repeated by some specific groups.

The problem is that it's just not true.

Base models trained on next token loss minimization have been designed to memorize from the start.

It's not even a newly-discovered phenomenon, with results from Google dating back to 2022 (presented at ICLR 2023). As Carlini et al. report in their abstract:

"Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model."

It's never been a secret or a surprise that bigger models inevitably memorize popular content.

So next time someone tells you that models can't memorize, help teach them.

While not all models memorize all input data, this ability emerges surprisingly quickly and reliably as models scale or examples appear more frequently.

Here are a few more resources on the topic:

1. Emergent and Predictable Memorization in Large Language Models. Biderman et al., 2023. https://lnkd.in/dPthWVw8

2. Quantifying Memorization Across Neural Language Models. Carlini et al., 2022. https://lnkd.in/dZc_XDMH

3. Patronus CopyrightCatcher. https://lnkd.in/dWcGBM6k

4. leeky, open source library. https://lnkd.in/dvFPng3j

No comments:

Post a Comment