Context Caching
Context Caching is a technique that lets developers 'save' the state of a massive prompt (like a whole book or codebase) so they don't have to pay to re-upload it for every question. This matters because it makes chatting with massive documents 90% cheaper and much faster..
Why it Matters
it makes chatting with massive documents 90% cheaper and much faster.
How It Works
- 1
It involves storing the Key-Value (KV) states of the transformer's attention mechanism in GPU memory or disk.
- 2
When a new request shares the same prefix as the cached data, the model skips computing those layers, significantly reducing Time-to-First-Token (TTFT).
Real-World Example
If you upload a 500-page legal contract to Gemini 1.5 Pro and ask 50 different questions about it, Context Caching ensures you only pay for processing the contract once, rather than 50 times.