Introduction:
Navigating the Depths of Advanced Prompt Engineering
Welcome back to our exploration of the transformative world of prompt engineering! After delving into the basics in our beginner’s guide, it’s time to elevate our journey into the realm of advanced prompting techniques.
As we venture beyond the fundamentals, we enter a space where precision, creativity, and a deep understanding of language models (LLMs) converge. Advanced prompt engineering isn’t just about asking the right questions; it’s about orchestrating the AI to perform intricate dances of logic, creativity, and analysis. This is where true AI craftsmanship comes to life.
In this guide, we’ll unravel the more complex aspects of prompt engineering. From nuanced control of language model outputs to fine-tuning prompts for specific, sophisticated tasks, we’ll explore how to harness the full potential of these powerful AI tools. Whether it’s generating nuanced text, extracting intricate data, or even guiding AI in subtle emotional responses, advanced prompting techniques open up a world of possibilities.
Topics to Cover
1. Zero-shot Prompting
2. Few-shot Prompting
3. Chain-of-Thought Prompting
4. Self-Consistency
5. Generate Knowledge Prompting
6. Tree of Thoughts
7. Retrieval Augmented Generation
8. Automatic Reasoning and Tool-use
1. Zero-shot Prompting
- Large LLMs today, such as GPT-3&4, are tuned to follow instructions and are trained on large amounts of data; so they are capable of performing some tasks “zero-shot.”
- We tried a few zero-shot examples in the previous section. Here is one of the examples we used:
- Prompt:
Classify the text into neutral, negative or positive. Text: I think the vacation is okay.Sentiment:
Output:
Neutral
Note that in the prompt above we didn’t provide the model with any examples of text alongside their classifications, the LLM already understands “sentiment” — that’s the zero-shot capabilities at work.
2. Few-shot Prompting
- While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.
- Let’s demonstrate few-shot prompting via an example.
- Prompt:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that usesthe word whatpu is:We were traveling in Africa and we saw these very cute whatpus.To do a "farduddle" means to jump up and down really fast. An example of a sentence that usesthe word farduddle is:
Output:
When we won the game, we all started to farduddle in celebration.
We can observe that the model has somehow learned how to perform the task by providing it with just one example (i.e., 1-shot). For more difficult tasks, we can experiment with increasing the demonstrations (e.g., 3-shot, 5-shot, 10-shot, etc.).
3. Chain-of-Thought Prompting
Chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.
Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.A: Adding all the odd numbers (17, 19) gives 36. The answer is True.The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.A: Adding all the odd numbers (11, 13) gives 24. The answer is True.The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A:
Output:
Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.
Wow! We can see a perfect result when we provided the reasoning step. In fact, we can solve this task by providing even fewer examples, i.e., just one example seems enough:
Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25.
The answer is False.
The odd numbers in this group add up to an even number:
15, 32, 5, 13, 82, 7, 1.
Output:
Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.
Zero-shot COT Prompting
One recent idea that came out more recently is the idea of zero-shot CoT that essentially involves adding “Let’s think step by step” to the original prompt. Let’s try a simple problem and see how the model performs:
Prompt:
I went to the market and bought 10 apples.
I gave 2 apples to the neighbor and 2 to the repairman.
I then went and bought 5 more apples and ate 1.
How many apples did I remain with?
Output:
11 apples
The answer is incorrect! Now Let’s try with the special prompt.
Prompt:
I went to the market and bought 10 apples.
I gave 2 apples to the neighbor and 2 to the repairman.
I then went and bought 5 more apples and ate 1.
How many apples did I remain with?
Let's think step by step.
Output:
First, you started with 10 apples.
You gave away 2 apples to the neighbor and 2 to the repairman,
so you had 6 apples left.Then you bought 5 more apples,
so now you had 11 apples.Finally, you ate 1 apple,
so you would remain with 10 apples.
It’s impressive that this simple prompt is effective at this task. This is particularly useful where you don’t have too many examples to use in the prompt.
Automatic Chain-of-Thought (Auto-CoT)
When applying chain-of-thought prompting with demonstrations, the process involves hand-crafting effective and diverse examples. This manual effort could lead to suboptimal solutions. Lets propose an approach to eliminate manual efforts by leveraging LLMs with “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one. This automatic process can still end up with mistakes in generated chains. To mitigate the effects of the mistakes, the diversity of demonstrations matter. This works proposes Auto-CoT, which samples questions with diversity and generates reasoning chains to construct the demonstrations.
Auto-CoT consists of two main stages:
- Stage 1): question clustering: partition questions of a given dataset into a few clusters
- Stage 2): demonstration sampling: select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with simple heuristics
The simple heuristics could be length of questions (e.g., 60 tokens) and number of steps in rationale (e.g., 5 reasoning steps). This encourages the model to use simple and accurate demonstrations.
The process is illustrated below:
Code for Auto-CoT is available here.
4. Self-Consistency
Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by Wang et al. (2022)(opens in a new tab), self-consistency aims “to replace the naive greedy decoding used in chain-of-thought prompting”. The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning.
Let’s try the following example for arithmetic reasoning:
Prompt:
When I was 6 my sister was half my age. NowI’m 70 how old is my sister?
Output:
35
The output is wrong! How may we improve this with self-consistency? Let’s try it out. We will use the few-shot exemplars from Wang et al. 2022 (Table 17):
Prompt:
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent $15. She has $8 left.
Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:
When I was 6 my sister was half my age, so she was 3.
Now I am 70, so she is 70 - 3 = 67.
The answer is 67.
Output 2:
When the narrator was 6, his sister was half his age, which is 3.
Now that the narrator is 70, his sister would be 70 - 3 = 67 years old.
The answer is 67.
Output 3:
When I was 6 my sister was half my age, so she was 3.
Now I am 70, so she is 70/2 = 35. The answer is 35.
Computing for the final answer involves a few steps but for the sake of simplicity, we can see that there is already a majority answer emerging so that would essentially become the final answer.
5. Generated Knowledge Prompting
LLMs continue to be improved and one popular technique includes the ability to incorporate knowledge or information to help the model make more accurate predictions.
Let’s try a simple prompt:
Prompt:
Part of golf is trying to get a higher point total than others. Yes or No?
Output:
Yes.
This type of mistake reveals the limitations of LLMs to perform tasks that require more knowledge about the world. How do we improve this with knowledge generation?
First, we generate a few “knowledges”:
Prompt:
Input: Greece is larger than mexico.
Knowledge: Greece is approximately 131,957 sq km, while
Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than
Greece.
Input: Glasses always fog up.
Knowledge: Condensation occurs on eyeglass lenses when water vapor from your
sweat, breath, and ambient humidity lands on a cold surface, cools,
and then changes into tiny drops of liquid, forming a film that you see as fog.
Your lenses will be relatively cool compared to your breath,
especially when the outside air is cold.
Input: A fish is capable of thinking.
Knowledge: Fish are more intelligent than they appear. In many areas, such
as memory, their cognitive powers match or exceed those of ’higher’
vertebrates including non-human primates.
Fish’s long-term memories help them keep track of complex social relationships.
Input: A common effect of smoking lots of cigarettes in one’s lifetime
is a higher than normal chance of getting lung cancer.
Knowledge: Those who consistently averaged less than one cigarette per day
over their lifetime had nine times the risk of dying from lung cancer than
never smokers. Among people who smoked between one and 10 cigarettes per day,
the risk of dying from lung cancer was nearly 12 times higher than that of
never smokers.
Input: A rock is the same size as a pebble.
Knowledge: A pebble is a clast of rock with a particle size of 4 to 64
millimetres based on the Udden-Wentworth scale of sedimentology.
Pebbles are generally considered larger than granules
(2 to 4 millimetres diameter) and smaller than cobbles
(64 to 256 millimetres diameter).
Input: Part of golf is trying to get a higher point total than others.
Knowledge:
Knowledge 1:
The objective of golf is to play a set of holes in the least number of strokes.
A round of golf typically consists of 18 holes.
Each hole is played once in the round on a standard golf course.
Each stroke is counted as one point, and the total number of strokes is
used to determine the winner of the game.
Knowledge 2:
Golf is a precision club-and-ball sport in which competing players
(or golfers) use many types of clubs to hit balls into a series of holes
on a course using the fewest number of strokes. The goal is to complete the
course with the lowest score, which is calculated by adding up the total
number of strokes taken on each hole. The player with the lowest score wins
the game.
The next step is to integrate the knowledge and get a prediction. I reformatted the question into QA format to guide the answer format.
Prompt:
Question: Part of golf is trying to get a higher point total than others.
Yes or No?
Knowledge: The objective of golf is to play a set of holes in the least
number of strokes. A round of golf typically consists of 18 holes.
Each hole is played once in the round on a standard golf course.
Each stroke is counted as one point, and the total number of strokes is
used to determine the winner of the game.
Explain and Answer:
Answer 1 (confidence very high):
No, the objective of golf is not to get a higher point total than others.
Rather, the objective is to play a set of holes in the least number of strokes.
The total number of strokes is used to determine the winner of the game,
not the total number of points.
Question: Part of golf is trying to get a higher point total than others.
Yes or No?
Knowledge: Golf is a precision club-and-ball sport in which competing
players (or golfers) use many types of clubs to hit balls into a series of
holes on a course using the fewest number of strokes. The goal is to complete
the course with the lowest score, which is calculated by adding up the total
number of strokes taken on each hole. The player with the lowest score wins
the game.
Explain and Answer:
Answer 2 (confidence is a lot lower):
Yes, part of golf is trying to get a higher point total than others.
Each player tries to complete the course with the lowest score,
which is calculated by adding up the total number of strokes taken on
each hole. The player with the lowest score wins the game.
Some really interesting things happened with this example. In the first answer, the model was very confident but in the second not so much. I simplify the process for demonstration purposes but there are a few more details to consider when arriving at the final answer.
6. Tree of Thoughts (ToT)
For complex tasks that require exploration or strategic lookahead, traditional or simple prompting techniques fall short.
ToT maintains a tree of thoughts, where thoughts represent coherent language sequences that serve as intermediate steps toward solving a problem. This approach enables an LM to self-evaluate the progress intermediate thoughts make towards solving a problem through a deliberate reasoning process. The LM’s ability to generate and evaluate thoughts is then combined with search algorithms (e.g., breadth-first search and depth-first search) to enable systematic exploration of thoughts with lookahead and backtracking.
The ToT framework is illustrated below:
When using ToT, different tasks requires defining the number of candidates and the number of thoughts/steps. For instance, Game of 24 is used as a mathematical reasoning task which requires decomposing the thoughts into 3 steps, each involving an intermediate equation. At each step, the best b=5 candidates are kept.
To perform BFS in ToT for the Game of 24 task, the LM is prompted to evaluate each thought candidate as “sure/maybe/impossible” with regard to reaching 24. As stated, “the aim is to promote correct partial solutions that can be verdicted within few lookahead trials, and eliminate impossible partial solutions based on “too big/small” commonsense, and keep the rest “maybe””. Values are sampled 3 times for each thought. The process is illustrated below:
From the results reported in the figure below, ToT substantially outperforms the other prompting methods:
A sample ToT prompt is:
Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,then share it
with the group.Then all experts will go on to the next step, etc.
If any expert realises they're wrong at any point then they leave.
The question is...
7. Retrieval Augmented Generation (RAG)
General-purpose language models can be fine-tuned to achieve several common tasks such as sentiment analysis and named entity recognition. These tasks generally don’t require additional background knowledge.
For more complex and knowledge-intensive tasks, it’s possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of “hallucination”.
Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.
RAG takes an input and retrieves a set of relevant/supporting documents given a source . The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs’s parametric knowledge is static. RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via retrieval-based generation.
RAG generates responses that are more factual, specific, and diverse when tested on MS-MARCO and Jeopardy questions. RAG also improves results on FEVER fact verification.
This shows the potential of RAG as a viable option for enhancing outputs of language models in knowledge-intensive tasks.
More recently, these retriever-based approaches have become more popular and are combined with popular LLMs like ChatGPT to improve capabilities and factual consistency.
8.Automatic Reasoning and Tool-use (ART)
Combining CoT prompting and tools in an interleaved manner has shown to be a strong and robust approach to address many tasks with LLMs. These approaches typically require hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use.
ART works as follows:
- given a new task, it select demonstrations of multi-step reasoning and tool use from a task library
- at test time, it pauses generation whenever external tools are called, and integrate their output before resuming generation
ART encourages the model to generalize from demonstrations to decompose a new task and use tools in appropriate places, in a zero-shot fashion. In addition, ART is extensible as it also enables humans to fix mistakes in the reasoning steps or add new tools by simply updating the task and tool libraries. The process is demonstrated below:
ART substantially improves over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and exceeds performance of hand-crafted CoT prompts when human feedback is incorporated.