What is prompt Engineering?

It is a technique of controlling model behavior by shaping context. Since, LLM has no long-term memory and no goals so you need to be more specific while asking.

It only predicts the next most likely token for a given sequence of tokens.

What is a token?

Token is the smallest unit of text a LLM can process. Text is broken into tokens before being sent to LLM. A token can be a whole word, part of word like in 'engine' "eng" and "ine", punctuation like comma or dot

For example:

"Prompt caching is useful" might be split like

["Prompt", " caching", " is", " useful"]

Ideal Structure of Prompt

Every prompt is processed as layers of instructions, whether you give it in layers or not. like:

1
2
3
4
5
6
[System intent]
[Constraints & rules]
[knowledge / context]
[Task]
[Input data]
[Output format]

if you don’t define any layer, LLM will fill it with a default value and process it. This is why often outputs are unpredictable.

Bad example

1
Explain redis

Here things are missing like: - audience? - depth? - format? - constraints?

Good example

1
2
3
4
5
You are a senior backend engineer.
explain Redis to a junior Rails engineer.
Focus on data strucure and use cases.
Avoid marketting language.
Use bullet-points.

Ultimate template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## IDENTITY
You are an expert {{ROLE}} with deep experience in {{DOMAIN}}.

## GOAL
Your task is to {{PRIMARY_OBJECTIVE}}.

## CONTEXT
Here is the relevant context you must use:
{{CONTEXT}}

## CONSTRAINTS
- Assume the audience is {{AUDIENCE_LEVEL}}
- Be precise and avoid unnecessary verbosity
- Prefer concrete examples over abstract explanations
- If assumptions are required, state them explicitly
- Do NOT hallucinate facts; say “unknown” when appropriate

## QUALITY BAR
A high-quality answer will:
- Be logically structured and easy to follow
- Address edge cases and trade-offs
- Explain *why*, not just *what*
- Use correct terminology for {{DOMAIN}}

## OUTPUT FORMAT
Respond using the following structure:
1. Summary (3–5 bullet points)
2. Main explanation
3. Examples (if applicable)
4. Common mistakes / pitfalls
5. Next steps or further reading

## TASK INPUT
{{INPUT}}

Save Cost Using Prompt Caching

You can reduce LLM inference costs by up to 10× by leveraging prompt caching. This optimization is implemented at the LLM provider level.

When a request is sent, the provider tokenizes the prompt and computes internal intermediate representations (often referred to as attention key/value states). These intermediate values are expensive to compute, so providers cache them. If a new request shares an identical prompt prefix with a previously cached prompt, the model can reuse these cached states instead of recomputing them—saving both compute and time.

To benefit from prompt caching, you must structure your prompts intentionally. The key rule is: Place static, reusable content at the top of the prompt, and dynamic, request-specific content at the end.

Example

1
2
3
4
5
[System rules]
[Tool specs]
[Shared knowledge]
---
[User input]

With this layout, the static prefix is more likely to remain unchanged across requests, increasing cache hit rates. For more info see this doc: Prompt caching: 10x cheaper LLM tokens, but how?

Benefits

  • Lower latency: Faster time to first token due to reused computation
  • 💸 Lower cost: Cached tokens are significantly cheaper (often ~10×)
  • 📈 Better scalability: More efficient handling of high-throughput workloads

By treating prompt structure as an optimization surface—much like query planning or caching in traditional systems—you can achieve substantial performance and cost improvements.