Zero-Shot Prompting

Zero-shot prompting is a fundamental technique in prompt engineering where a large language model (LLM) performs a task based solely on a natural language instruction without any task-specific examples or prior demonstrations. Its primary purpose is to leverage the model’s pre-trained knowledge and instruction-following capabilities to enable rapid task execution in environments where labeled training data is scarce or unavailable 31. This method matters profoundly in prompt engineering because it democratizes AI application by reducing dependency on curated datasets, enables immediate deployment across diverse domains, and scales efficiently to power production systems using models like GPT-4 and Claude 3 in real-world scenarios 316.

Overview

Zero-shot prompting emerged as a natural consequence of the scaling revolution in large language models during the late 2010s and early 2020s. As models grew from millions to billions of parameters and were trained on increasingly diverse internet-scale corpora, researchers observed emergent capabilities—the ability to perform tasks the models were never explicitly trained to do 31. This phenomenon addressed a fundamental challenge in traditional machine learning: the requirement for extensive labeled datasets and task-specific fine-tuning for every new application, which created bottlenecks in deployment speed and accessibility.

The practice evolved significantly with the introduction of instruction tuning, where models like FLAN and InstructGPT were fine-tuned on collections of instruction-response pairs across diverse tasks 31. This advancement, combined with reinforcement learning from human feedback (RLHF), dramatically improved zero-shot performance by teaching models to better interpret and follow natural language instructions 6. Modern zero-shot prompting has matured from simple command-based interactions to sophisticated techniques incorporating role assignment, emotional cues, and structured output formatting, transforming from an experimental curiosity into a production-ready methodology that underpins commercial AI applications 163.

Key Concepts

Instruction Prompt

The instruction prompt is the explicit natural language directive that defines the task the model should perform, serving as the primary mechanism for invoking the model’s capabilities without examples 32. This component must clearly articulate the desired action, output format, and any constraints to align with the model’s internalized patterns from pre-training.

Example: A customer service automation system uses the instruction prompt: “Classify the following customer email as ‘urgent-technical’, ‘billing-inquiry’, ‘general-question’, or ‘feedback’. Provide only the category label.” When processing the email “My account was charged twice for the same order and I need this resolved immediately,” the model correctly outputs “urgent-technical” based solely on the instruction’s clarity about categories and format.

Emergent Abilities

Emergent abilities refer to the phenomenon where sufficiently large language models develop capabilities to perform tasks they were not explicitly trained on, arising from the complex interactions of patterns learned during pre-training on massive diverse datasets 31. These abilities enable zero-shot inference by allowing models to generalize from their training distribution to novel task formulations.

Example: A legal technology startup discovers that GPT-4 can perform zero-shot contract clause extraction without any legal-specific fine-tuning. Given the prompt “Extract all indemnification clauses from this contract and list them with their section numbers,” the model successfully identifies and formats relevant clauses from a 50-page software licensing agreement, despite never being explicitly trained on legal document parsing—demonstrating emergent understanding of legal terminology and document structure.

Output Directive

An output directive is a specification within the prompt that constrains the format, structure, or style of the model’s response, ensuring consistency and parsability in production systems 32. This element is critical for integrating LLM outputs into downstream workflows and reducing post-processing requirements.

Example: An e-commerce platform implementing product categorization uses the output directive: “Category: [primary category] | Subcategory: [subcategory] | Confidence: [high/medium/low]”. When processing the product description “Wireless Bluetooth earbuds with noise cancellation and 24-hour battery life,” the model returns “Category: Electronics | Subcategory: Audio Accessories | Confidence: high”—a format that directly feeds into the inventory management database without additional parsing.

Instruction Tuning

Instruction tuning is a fine-tuning methodology where language models are trained on diverse collections of instruction-response pairs to enhance their ability to follow natural language directives in zero-shot scenarios 31. This process bridges the gap between pre-training objectives (typically next-token prediction) and instruction-following behavior.

Example: A research team comparing base GPT-3 with instruction-tuned GPT-3.5 finds dramatic differences in zero-shot performance. When given the prompt “Summarize this research abstract in one sentence suitable for a general audience,” the base model often continues the abstract rather than summarizing it, while the instruction-tuned version consistently produces appropriate summaries—demonstrating how instruction tuning specifically enables zero-shot task completion.

Prompt Encoding

Prompt encoding is the process by which the textual prompt is transformed into vector representations (embeddings) that the model uses for pattern matching against its learned knowledge during inference 3. This transformation determines how effectively the model can map the instruction to relevant patterns from its training data.

Example: A multilingual content moderation system processes the zero-shot prompt “Determine if this comment violates community guidelines: [user comment]”. The model’s tokenizer and embedding layers encode both the English instruction and a comment in Spanish, mapping them into a shared semantic space where the model’s pre-trained patterns about policy violations can be activated—enabling cross-lingual zero-shot moderation without language-specific training.

Role Assignment

Role assignment is a prompting technique where the instruction explicitly defines a persona, expertise level, or perspective the model should adopt when generating responses, improving output quality and consistency 16. This approach leverages the model’s ability to simulate different writing styles and knowledge domains learned during pre-training.

Example: A medical information service compares two prompts for symptom assessment. The basic prompt “What could cause chest pain and shortness of breath?” yields generic, sometimes alarming responses. The role-assigned prompt “You are a triage nurse providing initial assessment guidance. What questions would you ask a patient reporting chest pain and shortness of breath to determine urgency?” produces structured, clinically appropriate triage questions that align with medical protocols—demonstrating how role assignment shapes zero-shot medical reasoning.

Prompt Ambiguity

Prompt ambiguity refers to unclear, vague, or underspecified instructions that lead to inconsistent, irrelevant, or hallucinated outputs because the model cannot reliably map the prompt to appropriate learned patterns 31. Reducing ambiguity is essential for reliable zero-shot performance.

Example: A content generation team initially uses the ambiguous prompt “Write about the product” for generating marketing copy, resulting in wildly inconsistent outputs—some technical, some emotional, varying in length from one sentence to multiple paragraphs. After refining to “Write a 3-sentence product description for [product name] highlighting its primary benefit, target user, and key differentiator. Use an enthusiastic but professional tone,” output consistency improves from 40% usable to 85% usable without any model changes—illustrating how ambiguity directly impacts zero-shot reliability.

Applications in Natural Language Processing

Sentiment Analysis and Classification

Zero-shot prompting enables immediate deployment of sentiment analysis systems without collecting and labeling training data for specific domains or products 32. Organizations can classify customer feedback, social media mentions, or product reviews by simply instructing the model to categorize sentiment, with performance often approaching supervised baselines for general sentiment tasks.

Example: A restaurant chain launching a new menu item needs immediate sentiment tracking across review platforms. Using the zero-shot prompt “Classify the sentiment of this review about our new menu item as positive, negative, or neutral, then identify the main topic (taste, price, portion, service): [review text]”, the system processes 10,000 reviews in the first week, identifying that 72% positive sentiment relates primarily to taste while 18% negative sentiment focuses on portion size—actionable insights delivered without weeks of data collection and model training that traditional approaches would require.

Content Moderation and Safety

Zero-shot prompting powers scalable content moderation systems that can adapt to evolving policy guidelines and handle multilingual content without language-specific training datasets 13. This application is critical for platforms needing to enforce community standards across diverse content types and languages.

Example: A global social media platform implements zero-shot content moderation with the prompt: “Evaluate if this post violates our policies against: hate speech, harassment, misinformation, or graphic violence. Respond with: VIOLATION: [category] or SAFE, followed by a brief explanation.” When processing a post containing subtle coded language used by extremist groups—language that emerged after the last model training—the system correctly flags it as “VIOLATION: hate speech – contains coded supremacist terminology” because the model’s broad pre-training enables recognition of harmful patterns even in novel formulations, providing coverage that static classifiers miss.

Enterprise Knowledge Extraction

Organizations deploy zero-shot prompting to extract structured information from unstructured documents, enabling rapid digitization of legacy content and automated processing of incoming documents without document-type-specific training 31. This application dramatically reduces the time and cost of information management workflows.

Example: An insurance company processing diverse claim documents (medical reports, police reports, repair estimates) uses zero-shot extraction: “Extract from this document: incident date, claimant name, claimed amount, incident type, and supporting evidence mentioned. Format as JSON.” Applied to a handwritten police report transcription about a vehicle accident, the model successfully extracts structured data that populates the claims management system, reducing processing time from 15 minutes of manual data entry to 30 seconds of automated extraction with human verification—scaling across 50,000 monthly claims without training separate models for each document type.

Multilingual Customer Support

Zero-shot prompting enables customer support automation across languages without maintaining separate models or training data for each language, leveraging the multilingual capabilities of modern LLMs 63. This application is particularly valuable for companies serving global markets with limited resources for each language.

Example: A software company with customers in 40 countries implements zero-shot support triage using: “Categorize this support ticket’s issue type and urgency. Respond in English regardless of input language: Issue: [category], Urgency: [high/medium/low], Suggested team: [team name].” When a customer submits a ticket in Portuguese describing installation failures, the system correctly identifies “Issue: Technical-Installation, Urgency: high, Suggested team: Technical Support” and routes it appropriately—providing consistent global support without Portuguese-specific training data or separate models for each language market.

Best Practices

Use Explicit Output Formatting

Clearly specify the desired output structure, format, and constraints within the prompt to ensure consistent, parseable responses that integrate smoothly into downstream systems 31. This practice reduces post-processing requirements and improves reliability in production environments.

Rationale: LLMs trained on diverse internet text have learned countless formatting conventions, making their default output format unpredictable. Explicit formatting instructions activate specific learned patterns, dramatically improving consistency.

Implementation Example: A financial services firm analyzing earnings call transcripts initially uses “Identify the key financial metrics mentioned” and receives inconsistent free-text responses mixing narrative and numbers. After revising to “Extract financial metrics in this format: METRIC | VALUE | CHANGE_FROM_PREVIOUS | CONTEXT. List each metric on a new line,” output consistency improves from 60% to 94%, enabling automated population of financial dashboards without manual reformatting—the explicit structure activates the model’s learned patterns for tabular data presentation.

Incorporate Role and Expertise Framing

Begin prompts with explicit role assignments that define the perspective, expertise level, or professional context the model should adopt when responding 16. This technique leverages the model’s ability to simulate different writing styles and knowledge domains learned during pre-training.

Rationale: Role framing activates domain-specific language patterns and reasoning approaches in the model’s learned representations, improving output quality and appropriateness for specialized tasks without domain-specific fine-tuning.

Implementation Example: A legal research platform compares outputs for contract analysis. The generic prompt “Analyze this clause for potential risks” yields superficial observations. The role-framed prompt “You are a corporate attorney reviewing a vendor contract for a Fortune 500 client. Analyze this indemnification clause for potential risks to the client, considering: scope of coverage, liability caps, and carve-outs. Provide specific concerns and suggested modifications” produces detailed analysis identifying three specific liability gaps and proposing concrete contract language revisions—demonstrating how role framing elevates zero-shot legal reasoning to professional utility.

Iterate with Prompt Ablation Testing

Systematically test prompt variations by changing one element at a time (wording, structure, specificity, examples of format) to identify which components most impact output quality for your specific use case 36. This empirical approach optimizes zero-shot performance without requiring model changes.

Rationale: Zero-shot performance is highly sensitive to prompt formulation, with seemingly minor wording changes sometimes producing dramatically different results. Systematic testing reveals which elements matter most for your specific task and model.

Implementation Example: A content marketing team optimizing blog title generation tests five prompt variations, changing only the specificity of constraints: from “Generate a blog title” to “Generate a blog title (8-12 words, include the keyword ‘[keyword]’, use a question or how-to format, appeal to beginners)”. Testing each variant on 50 topics reveals that keyword inclusion improves SEO relevance by 40%, word count limits reduce unusably long titles from 30% to 5%, and format specification increases click-through rates by 25%—insights that shape their production prompt template and deliver measurable business impact through systematic ablation.

Leverage Instruction-Tuned Models

Prioritize using models that have undergone instruction tuning (like GPT-3.5/4, Claude, or FLAN-T5) rather than base models for zero-shot tasks, as instruction tuning specifically enhances instruction-following capabilities 31. This choice provides immediate performance improvements without additional prompt engineering effort.

Rationale: Base language models are trained primarily on next-token prediction and may continue or rephrase prompts rather than following them as instructions. Instruction-tuned models have been specifically optimized to interpret and execute directives, dramatically improving zero-shot task completion.

Implementation Example: A research team benchmarking product categorization compares base GPT-3 (davinci) with instruction-tuned GPT-3.5 (gpt-3.5-turbo) using identical zero-shot prompts across 1,000 products. Base GPT-3 achieves 62% categorization accuracy and frequently generates product descriptions instead of categories, requiring extensive output parsing. GPT-3.5 achieves 89% accuracy with 98% output format compliance, reducing post-processing code from 200 lines to 20 lines—demonstrating that model selection is often more impactful than prompt optimization when instruction-tuned alternatives exist.

Implementation Considerations

Model Selection and API Access

Choosing the appropriate LLM and access method significantly impacts zero-shot performance, cost, latency, and data privacy 13. Organizations must balance model capability (larger models generally perform better at zero-shot tasks) against practical constraints like API costs, response times, and data governance requirements.

Example: A healthcare startup evaluating zero-shot medical coding must consider that GPT-4 provides superior accuracy (87% vs. 76% for GPT-3.5) but costs 10x more per API call and has 2-3x higher latency. For their use case processing 100,000 patient records monthly, they implement a hybrid approach: GPT-3.5 for initial zero-shot coding with confidence scoring, escalating only low-confidence cases (15%) to GPT-4 for review. This reduces costs by 70% while maintaining 94% overall accuracy. Additionally, they deploy Azure OpenAI Service rather than OpenAI’s public API to meet HIPAA compliance requirements, demonstrating how implementation decisions extend beyond pure model performance.

Prompt Templating and Standardization

Developing standardized prompt templates with variable placeholders enables consistent zero-shot performance across teams, use cases, and time periods while facilitating maintenance and improvement 16. Template libraries become organizational assets that encode best practices and domain knowledge.

Example: A customer service organization initially allows 15 support agents to write custom zero-shot prompts for email classification, resulting in inconsistent categorization that fragments reporting. They implement a template library with standardized prompts like: “Classify this customer email into exactly one category: {category_list}. Consider: {classification_criteria}. Email: {email_text}. Output format: CATEGORY: [category_name]”. This standardization improves inter-agent classification agreement from 68% to 91%, enables centralized prompt optimization (improvements benefit all agents immediately), and facilitates A/B testing of prompt variations—transforming ad-hoc prompting into a managed capability with measurable quality metrics.

Evaluation and Monitoring Infrastructure

Implementing systematic evaluation of zero-shot outputs and monitoring for performance drift over time is essential for production reliability 31. Unlike traditional ML models with fixed behavior, LLM outputs can vary with model updates, and zero-shot performance may degrade as input distributions shift.

Example: An e-commerce platform using zero-shot product categorization implements a multi-layered evaluation system: (1) automated format validation checking that outputs match expected structure (runs on 100% of predictions), (2) confidence scoring where outputs below threshold trigger human review (flags 8% of cases), (3) weekly random sampling of 500 predictions for expert evaluation tracking accuracy trends, and (4) A/B testing of prompt variations on 5% of traffic before full deployment. This infrastructure detects a 12% accuracy drop when the platform expands to a new product vertical (camping gear), triggering prompt refinement that incorporates category-specific terminology—preventing quality degradation that would have impacted 50,000 products without monitoring.

Domain-Specific Customization

Adapting zero-shot prompts to incorporate domain terminology, constraints, and context significantly improves performance for specialized applications 36. This customization leverages the model’s broad pre-training while guiding it toward domain-appropriate responses.

Example: A pharmaceutical company implementing zero-shot adverse event detection from clinical trial reports initially uses a generic prompt: “Identify any adverse events mentioned in this report.” This yields 60% recall, missing events described with technical medical terminology. After consulting with clinical experts, they revise to: “You are a pharmacovigilance specialist reviewing a clinical trial report. Identify all adverse events including: serious adverse events (SAEs), adverse drug reactions (ADRs), and adverse events of special interest (AESIs). Include events described with medical terminology (e.g., ‘myocardial infarction’ not just ‘heart attack’). For each event, extract: event term, severity grade (1-5), relationship to study drug, and outcome.” This domain-customized prompt improves recall to 87% and precision to 92%, approaching the 90%/95% performance of their specialized fine-tuned model—demonstrating that domain expertise embedded in prompts can partially substitute for domain-specific training data.

Common Challenges and Solutions

Challenge: Inconsistent Output Quality

Zero-shot prompting often produces variable output quality across different inputs, with some responses being highly accurate while others are irrelevant, incomplete, or hallucinated 31. This inconsistency creates reliability concerns for production systems where predictable performance is essential. The challenge intensifies when inputs vary in complexity, length, or domain—a prompt that works well for simple cases may fail on edge cases.

Solution:

Implement a multi-strategy approach combining prompt refinement, output validation, and confidence-based routing 13. First, enhance prompt specificity by adding explicit constraints, examples of desired output format (not task examples), and error-prevention instructions like “If you cannot determine the answer from the provided information, respond with ‘INSUFFICIENT_DATA’ rather than guessing.” Second, build automated output validation that checks for format compliance, completeness, and logical consistency—flagging anomalies for human review. Third, implement confidence scoring or request the model to indicate certainty, routing low-confidence outputs to alternative processing paths.

Example: A legal document analysis service experiencing 30% unusable outputs implements this solution: they revise their contract clause extraction prompt to include “Only extract clauses explicitly present in the text—do not infer or generate clauses” and add output format validation checking for required fields. They also modify the prompt to request: “Confidence: [high/medium/low]” for each extraction. Low-confidence extractions (18% of total) are routed to a few-shot prompting pipeline with examples, while validation failures (8%) trigger human review. This reduces unusable outputs from 30% to 4% while maintaining processing speed for the 88% of cases that pass validation—creating a reliable production system from initially inconsistent zero-shot performance.

Challenge: Domain-Specific Knowledge Gaps

While LLMs possess broad general knowledge, they often lack deep expertise in specialized domains, leading to superficial or inaccurate zero-shot responses for technical, medical, legal, or niche industry tasks 36. The model may use correct-sounding terminology while making substantive errors, creating dangerous false confidence in outputs.

Solution:

Augment zero-shot prompts with retrieval-augmented generation (RAG) patterns that provide relevant domain knowledge within the prompt context, or implement domain expert review workflows for critical applications 63. For RAG approaches, retrieve relevant documentation, guidelines, or reference materials based on the input, then include this context in the prompt: “Using the following reference information: {retrieved_context}, perform this task: {instruction}.” For critical applications where errors have serious consequences, implement mandatory expert review with the zero-shot output serving as a draft to accelerate rather than replace human expertise.

Example: A medical device company using zero-shot prompting to generate regulatory compliance documentation initially produces submissions with 40% error rate in technical specifications—the model uses plausible-sounding but incorrect regulatory terminology. They implement a RAG solution: when generating compliance documentation for a device component, their system first retrieves relevant sections from FDA guidance documents, ISO standards, and their internal compliance manual, then provides this context in the prompt: “Using these regulatory requirements: {retrieved_regulations}, generate the compliance documentation for {device_component} addressing: {specific_requirements}.” This reduces errors to 12%. For the remaining errors, they implement a workflow where regulatory specialists review and edit the zero-shot draft rather than writing from scratch, reducing documentation time by 60% while maintaining 100% accuracy after review—combining zero-shot efficiency with domain expertise.

Challenge: Prompt Sensitivity and Brittleness

Zero-shot performance can be highly sensitive to minor prompt variations, with seemingly insignificant wording changes producing dramatically different output quality 13. This brittleness makes prompt engineering feel more like art than science and creates maintenance challenges when prompts need updating.

Solution:

Develop systematic prompt testing protocols that evaluate multiple prompt variations across diverse test cases before production deployment 31. Create a test suite of representative inputs spanning common cases, edge cases, and known failure modes. For each prompt variation, measure performance across this suite using quantitative metrics (accuracy, format compliance, latency) and qualitative assessment. Document which prompt elements most impact performance for your specific use case. Implement version control for prompts and A/B testing infrastructure to validate improvements before full rollout.

Example: A content moderation team discovers that changing “Determine if this violates our policy” to “Evaluate whether this violates our policy” reduces false positive rate from 15% to 9%—a significant impact from one word change. They implement a systematic testing protocol: maintaining a test suite of 500 labeled examples (300 policy violations, 200 acceptable content) spanning 10 content categories. Before deploying any prompt change, they evaluate it against this suite, tracking precision, recall, and per-category performance. They discover that prompts using “determine” perform better on explicit violations while “evaluate” handles nuanced cases better, leading them to implement category-specific prompts. They also establish A/B testing where new prompt variants run on 10% of traffic for one week before full deployment, catching a prompt revision that improved overall accuracy by 3% but degraded performance on hate speech detection by 8%—preventing a critical regression through systematic testing.

Challenge: Hallucination and Factual Errors

LLMs sometimes generate plausible-sounding but factually incorrect information in zero-shot scenarios, particularly when asked about specific facts, dates, statistics, or technical details 31. These hallucinations can be difficult to detect because they’re often presented with confident, authoritative language that mimics accurate information.

Solution:

Implement explicit hallucination-reduction techniques in prompts and build verification workflows for factual claims 13. Prompt techniques include: instructing the model to acknowledge uncertainty (“If you don’t know, say ‘I don’t have enough information’ rather than guessing”), requesting citations or reasoning (“Explain your reasoning for this conclusion”), and constraining responses to provided context (“Answer only using information from the following text: {context}”). For applications requiring factual accuracy, implement automated fact-checking against authoritative sources or human verification workflows for claims before publication or decision-making.

Example: A financial news summarization service using zero-shot prompting to generate market summaries discovers that 22% of summaries contain factual errors—incorrect stock prices, misattributed quotes, or fabricated statistics. They implement a multi-layered solution: (1) revise prompts to include “Only include specific numbers, dates, and quotes that appear in the source articles. If you’re unsure about a detail, omit it rather than approximating,” (2) add a verification step where another LLM call checks each factual claim against the source articles, flagging discrepancies, (3) implement automated validation for numerical claims (stock prices, percentages) against market data APIs. This reduces factual errors to 3%, with the remaining errors caught by human editors before publication. The verification step adds 2 seconds to processing time but prevents publication of misinformation—demonstrating that hallucination mitigation requires both prompt engineering and systematic verification rather than relying solely on prompt design.

Challenge: Scaling Costs and Latency

While zero-shot prompting eliminates training costs, production deployment at scale can incur significant API costs and latency challenges, particularly when using large, capable models or processing high volumes 16. A prototype that costs pennies can become expensive at production scale, and response times acceptable for demos may be problematic for user-facing applications.

Solution:

Implement cost and latency optimization strategies including model tiering, prompt compression, caching, and batch processing 16. Model tiering routes simple queries to smaller, faster, cheaper models while reserving large models for complex cases requiring advanced reasoning. Prompt compression removes unnecessary verbosity while preserving essential instructions. Caching stores responses for identical or similar queries to avoid redundant API calls. Batch processing aggregates multiple requests to amortize overhead and potentially access volume discounts.

Example: A customer service platform using zero-shot email classification processes 500,000 emails monthly, initially using GPT-4 for all classifications at $0.03 per email ($15,000/month) with 3-second average latency. They implement optimization: (1) develop a complexity classifier that routes 70% of straightforward emails to GPT-3.5 ($0.002 per email) and 30% of complex cases to GPT-4, reducing costs to $5,100/month, (2) implement semantic caching that identifies emails similar to previously processed ones (15% cache hit rate), saving an additional $765/month, (3) batch process non-urgent emails (60% of volume) in 100-email batches during off-peak hours, reducing per-email costs by 20% for batched emails, (4) compress prompts by removing redundant instructions, reducing token usage by 30%. Combined, these optimizations reduce monthly costs from $15,000 to $3,200 (79% reduction) while improving average latency from 3 seconds to 1.8 seconds for the majority of emails—demonstrating that production-scale zero-shot deployment requires systematic optimization beyond initial prompt design.

See Also

References

  1. Prompting Guide. (2024). Zero-Shot Prompting. https://www.promptingguide.ai/techniques/zeroshot
  2. Texas A&M University-Corpus Christi Libraries. (2024). Prompt Engineering: Shots. https://guides.library.tamucc.edu/prompt-engineering/shots
  3. AI21 Labs. (2024). Zero-Shot Prompting. https://www.ai21.com/glossary/foundational-llm/zero-shot-prompting/
  4. Learn Prompting. (2024). Introduction to Zero-Shot Prompting. https://learnprompting.org/docs/advanced/zero_shot/introduction
  5. IBM. (2024). Zero-Shot Prompting. https://www.ibm.com/think/topics/zero-shot-prompting
  6. Shelf. (2024). Zero-Shot and Few-Shot Prompting. https://shelf.io/blog/zero-shot-and-few-shot-prompting/
  7. Newline. (2024). Zero-Shot vs Few-Shot Prompting: Key Differences. https://www.newline.co/@zaoyang/zero-shot-vs-few-shot-prompting-key-differences–b4c84775
  8. Wei, J., et al. (2022). Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652. https://arxiv.org/abs/2102.09359
  9. OpenAI. (2025). Prompt Engineering Guide. https://platform.openai.com/docs/guides/prompt-engineering