Systematic Review of Prompt Engineering in Large Language Models

Prompt engineering has become a crucial strategy for enhancing the capabilities of large language models (LLMs) and vision-language models (VLMs). This method employs task-specific instructions, or prompts, to improve model performance without altering the underlying model architecture. Instead of retraining the model, prompts offer a straightforward way to adapt pre-trained models for various applications, guiding the model towards the desired output with carefully crafted instructions or learned vector representations that trigger relevant knowledge. This innovative approach has proven effective in a wide range of tasks, including question-answering and commonsense reasoning. Despite its growing success, there is still a need for a comprehensive understanding and systematization of the various prompt engineering techniques. This paper aims to fill this gap by presenting a structured review of the latest developments in prompt engineering, organized by their application domains. We summarize the prompting methods, their specific uses, the models they employ, and the datasets they utilize. Additionally, we evaluate the advantages and drawbacks of each method and provide a taxonomy diagram and table that highlight the datasets, models, and key features of different prompting techniques. Through this detailed examination, we seek to offer a clearer insight into the fast-evolving field of prompt engineering, thereby aiding future research by highlighting ongoing challenges and potential avenues for advancement.

Zero-Shot Prompting

Zero-shot prompting represents a significant advancement in the use of large language models (LLMs), as introduced by Radford et al. in 2019. This approach eliminates the necessity for large datasets for training by using meticulously designed prompts to direct the model towards unfamiliar tasks. In this setup, the model is provided with a description of the task via the prompt, without any labeled data to learn specific input-output relationships. It then applies its pre-acquired knowledge to formulate responses or predictions for the new task, based solely on the instructions provided in the prompt.

Few-Shot Prompting

Few-shot prompting, as described by Brown et al. in 2020, equips models with a small set of input-output examples to facilitate task comprehension, offering a contrast to zero-shot prompting, which does not provide any examples. Introducing even a limited number of high-quality examples has been shown to boost model performance on intricate tasks beyond the levels achieved without any demonstrations. Nevertheless, this technique requires extra tokens to accommodate the examples, potentially limiting its use with longer texts. The way examples are selected and structured within the prompt greatly affects the model's behavior, with issues such as biases towards frequently occurring words still impacting the outcomes of few-shot approaches. Despite these challenges, few-shot prompting significantly enhances the ability of large pre-trained models, like GPT-3, to tackle complex tasks. However, meticulous prompt design is essential to optimize performance and reduce the risk of unintended biases.

Automatic Chain-of-Thought (Auto-CoT) Prompting

Zhang et al. in 2022 introduced the Automatic Chain-of-Thought (Auto-CoT) prompting method to address the inefficiencies and limitations of manually crafting high-quality Chain-of-Thought (CoT) examples, which is both time-consuming and often not optimal. Auto-CoT employs an automated approach that instructs Large Language Models (LLMs) with a "Let’s think step-by-step" prompt, enabling the generation of sequential reasoning chains. To account for potential inaccuracies in each generated reasoning chain, Auto-CoT employs a technique of diverse sampling. This involves generating multiple unique reasoning chains for a variety of questions, thereby creating a comprehensive set of demonstrations. This method of automated diverse sampling reduces errors and improves the efficacy of few-shot learning by eliminating the exhaustive process of manually developing reasoning chains. The implementation of Auto-CoT has shown to significantly enhance performance, achieving average accuracy improvements of 1.33% and 1.5% on arithmetic and symbolic reasoning tasks, respectively, when tested with GPT-3, thereby outperforming the traditional CoT approach.

Self-Consistency

Wang et al. in 2022 unveiled a novel decoding strategy called self-consistency, which boosts reasoning abilities beyond what is achievable with traditional greedy decoding methods in Chain-of-Thought (CoT) prompting. This strategy is particularly effective for complex reasoning tasks that may have several correct approaches. Self-consistency works by generating a variety of reasoning chains through sampling from the decoder of the language model. It then determines the most reliable answer by aggregating these diverse chains. This method leverages the insight that problems requiring deep thought often benefit from a range of reasoning paths leading to a solution. Integrating self-consistency with CoT prompting has led to marked improvements in accuracy on several benchmarks, including increases of 17.9% on GSM8K, 11.0% on SVAMP, 12.2% on AQuA, 6.4% on StrategyQA, and 3.9% on ARC-Challenge, when compared to the original CoT prompting approach.

Logical Chain-of-Thought (LogiCoT) Prompting

The capacity for logical reasoning is essential for large language models (LLMs) to tackle intricate, multi-stage challenges in various fields. While current techniques such as Chain-of-Thought (CoT) prompting facilitate sequential reasoning, they fall short in offering robust verification methods. Zhao et al., in 2023, introduced Logical Chain-of-Thought (LogiCoT) prompting, a neurosymbolic approach that incorporates elements of symbolic logic to improve reasoning processes in a logical and organized fashion. LogiCoT specifically employs the reductio ad absurdum principle, allowing for the validation of each reasoning step, thereby enhancing the model's ability to navigate complex reasoning tasks with greater accuracy and coherence.

Tree-of-Thoughts (ToT) Prompting

In 2023, Yao et al. and Long introduced the Tree-of-Thoughts (ToT) framework, designed to boost the effectiveness of prompts for tackling complex tasks that necessitate forward-looking and exploratory reasoning. Building on the foundation of Chain-of-Thought (CoT) prompting, ToT introduces a hierarchical structure to organize intermediate reasoning steps, or "thoughts", where each thought is a step closer to solving the given problem. This tree-like arrangement enables language models to navigate through problems more strategically by evaluating the efficacy of each thought in moving towards a solution. ToT combines the model's capacity to generate and assess these thoughts with search strategies such as breadth-first and depth-first search, facilitating a methodical examination of reasoning paths. This approach allows for the expansion of promising solutions and the ability to retract steps when encountering dead ends. ToT demonstrated remarkable performance improvements, notably achieving a 74% success rate in the Game of 24, a significant leap from CoT's 4%. Moreover, in tasks focused on word-level reasoning, ToT outpaced CoT, securing a 60% success rate compared to CoT's 16%.