welcome to today's paper reading today's paper is titled a survey of prompt engineering methods and large language models for different NLP tasks this paper was published on the ar-14 preprint by sham vatol and harsh dubet from the Department of computer science at New York University large language models llms have demonstrated exceptional performance across various natural language processing NRP tasks a critical aspect of enhan ing these models capabilities is prompt engineering which involves composing natural language instructions to elicit structured Knowledge from llms unlike previous state-of-art models prompt engineering does not necessitate extensive parameter retraining or fine-tuning making it accessible to individuals without a deep machine learning background this paper reviews different prompting techniques categorizing them by the NLP tasks they are applied to and discusses their performance on various data sets the study encompasses a survey of 44 research papers detailing 39 distinct prompting methods across 29 NLP tasks most of which have been published within the last 2 years the introduction of large language models llms has significantly Advanced the field of artificial intelligence these models trained on vast corpora of text documents have shown that as the number of model parameters increases so does the performance of machine learning models llms have achieved unprecedented performance on a wide array of NLP tasks attracting interest from Academia and various Industries such as medicine law and finance the current focus of research on llms is on their reasoning capacity via prompts which has opened a new field of research around prompt engineering prompt engineering is the process of creating natural language instructions to extract Knowledge from llms in an organized manner unlike earlier models it relies solely on the embedded knowledge of llms and does not require extensive parameter retraining this new field has caught everyone's attention as it allows natural language exchange between researchers and llms to achieve the goals of the underlying NLP task in this work the authors enumerate several prompting strategies and group them according to different NLP tasks they provide a taxonomy diagram tabulate the prompting techniques tried on various data sets discuss the llms employed and list potential state-of-the-art methods for each data set the survey reviews and analyzes 44 research papers the majority of which have been published in the previous two years and cover 39 prompting techniques applied on 29 different NLP tasks this section discusses various prompting methods and their impact on performance improvements the main variations include zero shot and fuse shot settings in zero shot no training data is used relying solely on the the models pre-trained knowledge in contrast fuch shot involves providing a few training examples to better understand the task key prompting strategies include Chain of Thought cot which breaks down complex problems into simpler sub problems showing significant improvements in mathematical and Common Sense reasoning tasks self-consistency uses multiple reasoning paths to find the most consistent answer achieving notable gains in mathematical and Common Sense reasoning ensemble refin ER Builds on cot and self-consistency showing better performance across various tasks in this section we standardized the categorization of data sets under various NLP tasks we Define different NLP tasks and assign data sets to these tasks based on their strong Association we also discuss various prompting methods used for these tasks noting that the performance of these methods can vary depending on the language model used our approach ensures that each data set is assigned to only one NLP task to avoid complex entanglements in performance analysis figure one presents a taxonomy diagram that categorizes various prompt engineering methods across different NLP tasks the diagram highlights key methods such as Chain of Thought cot random cot and complex cot for mathematical problem solving basic programmed language models pal and synthetic prompting for logical reasoning and contrastive cot and contrastive self consistency for common sense reasoning this taxonomy helps in understanding the application of these methods across a wide range of NLP tasks providing a structured overview of the current state of prompt engineering techniques table one provides a detailed analysis of various prompt engineering strategies applied to different data sets in the context of mathematical problem solving tasks key data sets include GSM 8K math and stamp each showcasing different prompting strategies such as p PN exploiting training pot analogical reasoning and contrastive self-consistency the table highlights the best performing prompting methods for each data set offering insights into the effectiveness of different approaches in enhancing the performance of large language models llms and mathematical problemsolving scenarios table two provides an analysis of the prompt engineering strategies used for the logical reasoning task it lists various data sets such as word sorting logical deduction and temporal sequences along with the prompting strategies like basic analogical reasoning cot and the table also mentions the large language models llms used including GPT 3. 5 turbo gp4 Palm 2L and palm 2s this analysis helps in understanding the best performing prompting methods for each data set in the context of logical reasoning tasks table 3 provides an analysis of various data sets used in The Common Sense reasoning task detailing the prompting strategies and the best performing large language models llms key data sets include reasoning about colored objects csqa and date understanding the table highlights the use of prompting strategies such as basic cot autoco and self-consistency notable llms include Palm 2L GPT 3. 5 turbo GPT 4 and UL 2-2 B this analysis is crucial for understanding the effectiveness of different prompting methods and enhancing the common sense reasoning capabilities of llms table 4 provides an analysis of prompt engineering strategies for the multihop reasoning task across various data sets key findings include the use of active prompt for strategy QA C for hotac QA and deum for coma these strategies represent the best performing methods in their respective data sets highlighting the effectiveness of specific prompting techniques in enhancing multihop reasoning capabilities table five provides an analysis of the prompt engineering strategies used for the causal reasoning task it includes data sets such as cause and effect and causal judgment and list various prompting strategies like basic cot and lot the table also mentions the llms used including GPT 3.
5 turbo GPT 4 vuna 7B and palm 2s Table 6 provides an analysis of prompt engineering for the social reasoning task using the sociala data set the table highlights the prompting strategies of Chain of Thought cat and language of thought lot and lists the large language models llms tested including GPT 3. 5 turbo GPT 4 and various vuna models 7B 13B 33b this analysis is crucial for for understanding how different prompting techniques and llms perform in tasks that require reasoning about human social interactions table 7 provides an analysis of prompt engineering strategies for the contextual question answering task across various data sets key data sets include process Bank biom RC and mash QA each tested with prompting strategies such as basic implicit R Chain of Thought cot and analogical reasoning the table highlights the best performing prompting techniques for each data set aiding in understanding the effectiveness of different strategies and contextual QA tasks table 8 provides an analysis of prompt engineering strategies for the context free question answering task it lists various data sets such as Pop QA entity q and wikidata along with the prompting strategies like basic cot and thought that have been experimented on them the table also highlights the best performing prompting strategy for each data set additionally it mentions the large language models llms used including GPT 3. 5 turbo GPT 4 and llama 2 Series which are crucial for understanding the effectiveness of different prompting techniques in context free Q8 tasks Table 9 provides an analysis of prompt engineering for the spatial question answering task it includes data sets like brick World n lvr based manipulation and natural language navigation among others the table details various prompting strategies such as cat C and basic along with the best performing large language models llms including GPT 3.
5 GPT 3. 5 turbo GPT 4 and palm 2s this analysis helps in understanding the effectiveness of different prompting methods on spatial reasoning tasks table 10 provides an analysis of prompt engineering strategies for the conversational Contex ual question answering task the data set used is cvfa and the prompting strategies include pot cot self-consistency and pal the large language models llms involve our codex gpt3 GPT 3. 5 turbo code gen code T5 Plus genen Palm and Lambda this table highlights the best performing prompting strategies for enhancing the model's ability to answer interconnected queries in a conversational form format table 11 provides an analysis of prompt engineering strategies for the dialogue system task focusing on the multi-turn conversation response ntcr data set the table lists various prompting strategies such as basic Chain of Thought cot and thought and thought thought and the corresponding large language models llms including GPT 3.
5 turbo llama 2 models and deuna this analysis is crucial for understanding in the effectiveness of different prompting techniques and enhancing the performance of dialogue systems table 12 provides an analysis of prompt engineering strategies for the code generation task it lists data sets such as code Force scraping human eval mbpp and mbcp along with various prompting strategies like analogical reasoning Chain of Thought cot and self-consistency of thought Scott the table also mentions the llms used including G GPT 3. 5 turbo GPT 4 Palm 2L and codex this analysis helps in understanding the best performing prompting techniques for generating code in different data sets table 13 provides an analysis of prompt engineering strategies for the free response task it highlights the data sets used which include creative writing and law form generation of biographies the table lists various prompting strategies such as basic cot selfcon consistency and TT additionally it mentions the llms involved in the experiments including gp4 llama 65b and llama 270b chat the table aims to identify the best techniques for generating unconstrained textual responses table 14 provides an analysis of prompt engineering techniques for the truthfulness task across various data sets the data sets include copeny vow fever and GSM the prompting strategies evaluated are s2a cot and instructed prompting the leading language models used in these experiments are llama 2- 70b chat palm 540b and GPT 3. 5 this table highlights the best performing techniques for ensuring factual communication without misinformation table 15 provides an analysis of prompt engineering strategies for the table-based truthfulness task the data set used as tab fact and the prompt strategies include basic C binder dater and chain of table the models evaluated are Palm 2s GPT 3.
5 turbo and llama 2-17 B chat the best performing strategy identified is chain of table table 16 provides an analysis of prompt engineering strategies for the table-based question answering task it includes data sets like Wiki TQ and faku and evaluates various prompting strategies such as basic Chain of Thought binder data and chain of table the table also lists the large language models used in the experiments including Palm 2s GPT 3. 5 turbo llama 2-1 17b chat and codex the stateof art soda prompting strategy identified is chain of table table 17 provides an analysis of prompt engineering strategies for table-based mathematical problem solving tasks it two data sets Tab mwp and Penguins in a table and lists various prompting strategies such as pot cot self-consistency and pal the table also includes the large language models llms used in the experiments including codex gpt3 GPT 3. 5 turbo Cod genen code T5 Plus genen Palm and Landa table 18 provides an analysis of prompt engineering strategies for the recommender system some task specifically focusing on the movie recommendation data set the table lists various prompting strategies such as basic Chain of Thought cot and chain of context and identifies the best performing technique as pot this table is crucial for understanding the optimal prompting methods for enhancing recommender system performance Table 19 provides an analysis of prompt engineering strategies for the emotion SLS sentiment understanding task it includes data sets like snars ruin names semi valve 14 and Forex and evaluates prompting strategies such as basic cot and Thor the table also lists the large language models llms used in the experiments including Palm 2s GPT 3.
5 flant 5 and gpt3 table 20 provides an analysis of prompt engineering techniques for the machine translation task it includes data sets such as Salient translation error detection Flores WMT 21 multi-domain and PDC the table lists various prompting strategies like basic cot and basic plus variations and the corresponding large language models llms used including Palm 2s GPT 3. 5 glm 130b and Thor this analysis helps in understanding the best performing prompting strategy for each data set in the context of machine translation table 21 presents the analysis of prompting strategies for the named entity recognition Neer task across various data sets the data sets include empty samples V research papers and bc5 CDR chm the prompting strategies evaluated are basic annotation guideline based and error analysis based the llms used in the experiments are GPT 3. 5 turbo GP pt4 llama 2-13 B chat and P andison chat this table highlights the best performing prompting strategies for each data set in the context of Neer table 22 provides an analysis of prompt engineering for the word sense dis ambiguation wsd task the WIC data set is utilized and various prompting strategies such as Chain of Thought cot prompt shifting PS self-consistency and model parallelism m p are explored the analysis includes performance data from several large language models including llama 2-13 B chat GPT 3.
5 turbo GPT 4 and PM bison chat table 23 provides an analysis of prompt engineering for the summarization task focusing on data sets wcp and CCDC the table lists the prompting strategies used which include basic and Coe and the language models involved specifically chat glm 2- 6B this analysis helps in understanding the best performing prompting strategies for summarizing lengthy texts into concise chunks while retaining essential information table 24 provides an analysis of prompt engineering for the paraphrasing task detailing the data sets used prompting strategies and the language models involved the data sets include qqp and soda with prompting strategies such as Chain of Thought cot prompt Source PS self-consistency and model parallelism MP being employed the language models tested include llama 2-13 B chat GPT 3. 5 turbo GPT 4 and poison chat table 25 provides an analysis of prompt engineering for The Stance detection task focusing on data sets like semi valve 2016 vast and P stance the table highlights the use of the Chain of Thought cot prompting strategy across these data sets with GPT 3. 5 turbo as the primary language model utilized this approach aims to evaluate the model's ability to determine the stance of the author on a given topic whether in favor against or neutral table 26 provides an analysis of prompt engineering strategies for the natural language inference nli task it covers two data sets K Li and meden Li and evaluates various prompting strategies including Chain of Thought cot prompt shifting PS self-consistency and majority voting MP the table lists the large language models llms used in the experiments such as llama 2-13 B chat GPT 3.
5 turbo GPT 4 and PM bison chat the best performing prompting method is highlighted for each data set table 27 provides an analysis of prompt engineering strategies for the relation extraction task focusing on the DDI data set the table lists various prompting strategies such as Chain of Thought cot prompt shifting PS self-consistency and multiple prompts MP which have been tested with large language models including llama 2-13 B chat GPT 3. 5 turbo GPT 4 and poin bison chat the best performing strategy identified is multiple prompts MP t 28 provides an analysis of prompt engineering strategies for language-based task completion it covers four data sets Alf World SCAN Web shop and Sean the table lists various prompting strategies such as act react basic cot and least to most which have been experimented with on these data sets additionally it mentions the llms used including Palm 540b gpt3 codex Lambda 137b B and UL 2-20 B the table highlights the best performing prompting method for each data set offering insights into effective strategies for language-based task completion table 29 presents an analysis of prompt engineering strategies for the multi-label text classification task across three data sets eurox unfair to us and Ledger the table lists the prompting strategies experimented with including Chain of Thought cot prompt Source p s self-consistency and model predictions MP the best performing models for these strategies are also noted such as LL 2-13 B chat GPT 3.