AI feature development playbook
This playbook outlines the key aspects of working with Large Language Models (LLMs), prompts, data, evaluation, and system architecture. It serves as a playbook for AI feature development and operational considerations.
Understanding prompt engineering
For an overview, see this video.
Most important takeaways:
-
Definition of a prompt:
- An instruction sent to a language model to solve a task
- Forms the core of AI features in user interfaces
-
Importance of prompt quality:
- Greatly influences the quality of the language model's response
- Iterating on prompts is crucial for optimal results
-
Key considerations when crafting prompts:
- Understand the task you're asking the model to perform
- Know what kind of response you're expecting
- Prepare a dataset to test the prompts
- Be specific - provide lots of details and context to help the AI understand
- Give examples of potential questions and desired answers
-
Prompt universality:
- Prompts are not universal across different language models
- When changing models, prompts need to be adjusted
- Consult the language model provider's documentation for specific tips
- Test new models before fully switching
-
Tools for working with prompts:
- Anthropic Console: A platform for writing and testing prompts
- Generator Prompt: A tool that creates crafted prompts based on task descriptions
-
Prompt structure:
- Typically includes a general task description
- Contains placeholders for input text
- May include specific instructions and suggested output formats
- Consider wrapping inputs in XML tags for better understanding and data extraction
-
System prompts:
- Set the general tone and role for the AI
- Can improve the model's performance
- Usually placed at the beginning of the prompt
- Set the role for the language model
-
Best practices:
- Invest time in understanding the assignment
- Use prompt generation tools as a starting point
- Test and iterate on prompts to improve results
- Use proper English grammar and syntax to help the AI understand
- Allow uncertainty - tell the AI to say "I don't know" if it is unsure
- Use positive phrasing - say what the AI should do, not what it shouldn't do
Best practices for writing effective prompts
For an overview, see this video about writing effective prompts.
Here are the key takeaways from this video:
-
No universal "good" prompt:
- The effectiveness of a prompt depends on the specific task
- There's no one-size-fits-all approach to prompt writing
-
Characteristics of effective prompts:
- Clear and explanatory of the task and expected outcomes
- Direct and detailed
- Specific about the desired output
-
Key elements to consider:
- Understand the task, audience, and end goal
- Explain these elements clearly in the prompt
-
Strategies for improving prompt performance:
- Add instructions in sequential steps
- Include relevant examples
- Ask the model to think in steps (chain of thought)
- Request reasoning before providing answers
- Guide the input - use delimiters to clearly indicate where the user's input starts and ends
-
Adapting to model preferences:
- Adjust prompts to suit the preferred data structure of the model
- For example, Anthropic models work well with XML tags
-
Importance of system prompts:
- Set the role for the language model
- Placed at the beginning of the interaction
- Can include awareness of tools or long context
-
Iteration is crucial:
- Emphasized as the most important part of working with prompts
- Continual refinement leads to better results
- Build quality control - automate testing prompts with RSpec or Rake tasks to catch differences
-
Use traditional code:
- If a task can be done efficiently outside of calling an LLM, use code for more reliable and deterministic outputs
Tuning and optimizing workflows for prompts
Prompt tuning for LLMs using Langsmith and Anthropic Workbench together + CEF
Iterating on the prompt using Anthropic console
For an overview, see this video.
Iterating on the prompt using Langsmith
For an overview, see this video.
Using Datasets for prompt tuning with Langsmith
For an overview, see this video.
Using automated evaluation in Langsmith
For an overview, see this video.
Using pairwise experiments in Langsmith
For an overview, see this video.
When to use Langsmith and when ELI5
For an overview, see this video.
Key Points on ELI5 (Eval like I'm 5) Project
-
Initial Development
- Start with pure Langsmith for prompt iteration
- Easier and quicker to set up
- More cost-effective for early stages
-
When to Transition to ELI5
- When investing more in the feature
- For working with larger datasets
- For repeated, long-term use
-
ELI5 Setup Considerations
- Requires upfront time investment
- Need to adjust evaluations for specific features
- Set up input data (e.g., local GDK for chat features)
-
Challenges
- Ensuring consistent data across different users
- Exploring options like seats and imports for data sharing
-
Current ELI5 Capabilities
- Supports chat questions about code
- Handles documentation-related queries
- Includes evaluations for code suggestions
-
Advantages of ELI5
- Allows running evaluations on local GDK
- Results viewable in Langsmith UI
- Enables use of larger datasets
-
Flexibility
- Requires customization for specific use cases
- Not a one-size-fits-all solution
-
Documentation
- ELI5 has extensive documentation available
-
Adoption
- Already in use by some teams, including code suggestions and create teams
Evaluation & Monitoring
Building Datasets for Eval
Using CEF dashboard and troubleshooting
Using automated evaluation pipelines for CEF
Continuous monitoring and applying as guidance for Prompt Tuning
A/B testing strategies for Gen AI features
Further resources
For more comprehensive prompt engineering guides, see: