AI feature development playbook

This playbook outlines the key aspects of working with Large Language Models (LLMs), prompts, data, evaluation, and system architecture. It serves as a playbook for AI feature development and operational considerations.

Understanding prompt engineering

For an overview, see this video.

Most important takeaways:

Definition of a prompt:
- An instruction sent to a language model to solve a task
- Forms the core of AI features in user interfaces
Importance of prompt quality:
- Greatly influences the quality of the language model's response
- Iterating on prompts is crucial for optimal results
Key considerations when crafting prompts:
- Understand the task you're asking the model to perform
- Know what kind of response you're expecting
- Prepare a dataset to test the prompts
- Be specific - provide lots of details and context to help the AI understand
- Give examples of potential questions and desired answers
Prompt universality:
- Prompts are not universal across different language models
- When changing models, prompts need to be adjusted
- Consult the language model provider's documentation for specific tips
- Test new models before fully switching
Tools for working with prompts:
- Anthropic Console: A platform for writing and testing prompts
- Generator Prompt: A tool that creates crafted prompts based on task descriptions
Prompt structure:
- Typically includes a general task description
- Contains placeholders for input text
- May include specific instructions and suggested output formats
- Consider wrapping inputs in XML tags for better understanding and data extraction
System prompts:
- Set the general tone and role for the AI
- Can improve the model's performance
- Usually placed at the beginning of the prompt
- Set the role for the language model
Best practices:
- Invest time in understanding the assignment
- Use prompt generation tools as a starting point
- Test and iterate on prompts to improve results
- Use proper English grammar and syntax to help the AI understand
- Allow uncertainty - tell the AI to say "I don't know" if it is unsure
- Use positive phrasing - say what the AI should do, not what it shouldn't do

Best practices for writing effective prompts

For an overview, see this video about writing effective prompts.

Here are the key takeaways from this video:

No universal "good" prompt:
- The effectiveness of a prompt depends on the specific task
- There's no one-size-fits-all approach to prompt writing
Characteristics of effective prompts:
- Clear and explanatory of the task and expected outcomes
- Direct and detailed
- Specific about the desired output
Key elements to consider:
- Understand the task, audience, and end goal
- Explain these elements clearly in the prompt
Strategies for improving prompt performance:
- Add instructions in sequential steps
- Include relevant examples
- Ask the model to think in steps (chain of thought)
- Request reasoning before providing answers
- Guide the input - use delimiters to clearly indicate where the user's input starts and ends
Adapting to model preferences:
- Adjust prompts to suit the preferred data structure of the model
- For example, Anthropic models work well with XML tags
Importance of system prompts:
- Set the role for the language model
- Placed at the beginning of the interaction
- Can include awareness of tools or long context
Iteration is crucial:
- Emphasized as the most important part of working with prompts
- Continual refinement leads to better results
- Build quality control - automate testing prompts with RSpec or Rake tasks to catch differences
Use traditional code:
- If a task can be done efficiently outside of calling an LLM, use code for more reliable and deterministic outputs

Tuning and optimizing workflows for prompts

Prompt tuning for LLMs using Langsmith and Anthropic Workbench together + CEF

Iterating on the prompt using Anthropic console

For an overview, see this video.

Iterating on the prompt using Langsmith

For an overview, see this video.

Using Datasets for prompt tuning with Langsmith

For an overview, see this video.

Using automated evaluation in Langsmith

For an overview, see this video.

Using pairwise experiments in Langsmith

For an overview, see this video.

View the ELI5 documentation.

When to use Langsmith and when ELI5

For an overview, see this video.

Key Points on ELI5 (Eval like I'm 5) Project

Initial Development
- Start with pure Langsmith for prompt iteration
- Easier and quicker to set up
- More cost-effective for early stages
When to Transition to ELI5
- When investing more in the feature
- For working with larger datasets
- For repeated, long-term use
ELI5 Setup Considerations
- Requires upfront time investment
- Need to adjust evaluations for specific features
- Set up input data (e.g., local GDK for chat features)
Challenges
- Ensuring consistent data across different users
- Exploring options like seats and imports for data sharing
Current ELI5 Capabilities
- Supports chat questions about code
- Handles documentation-related queries
- Includes evaluations for code suggestions
Advantages of ELI5
- Allows running evaluations on local GDK
- Results viewable in Langsmith UI
- Enables use of larger datasets
Flexibility
- Requires customization for specific use cases
- Not a one-size-fits-all solution
Documentation
- ELI5 has extensive documentation available
Adoption
- Already in use by some teams, including code suggestions and create teams

AI feature development playbook

Understanding prompt engineering

Best practices for writing effective prompts

Tuning and optimizing workflows for prompts

Prompt tuning for LLMs using Langsmith and Anthropic Workbench together + CEF

Iterating on the prompt using Anthropic console

Iterating on the prompt using Langsmith

Using Datasets for prompt tuning with Langsmith

Using automated evaluation in Langsmith

Using pairwise experiments in Langsmith

When to use Langsmith and when ELI5

Key Points on ELI5 (Eval like I'm 5) Project

Evaluation & Monitoring

Building Datasets for Eval

Using CEF dashboard and troubleshooting

Using automated evaluation pipelines for CEF

Continuous monitoring and applying as guidance for Prompt Tuning

A/B testing strategies for Gen AI features

Further resources