DSPy-Mipro
Overview
Dspy-Mipro is an algorithm from the DSPy project and paper, based on the MiproV2 optimizer, implemented as of 2024.10.21.
Description
Mipro (Multiprompt Instruction Proposal Optimizer) optimizes prompts by suggesting multiple improved instructions and groups of few-shot examples simultaneously. It uses the hyperparameter optimization library Optuna to select the best combination of suggested instructions and few-shot examples.
Differences Between Implementation and Paper
Single Prompt Optimization
While the original DSPy project and the MIPRO algorithm are designed for end-to-end LLM chain optimization, Ape’s implementation focuses on single prompt optimization, in line with Ape’s development philosophy.
Improvements in Instruction Suggestion
In DSPy’s MIPRO implementation, one selected tip is used for every suggested instruction. However, in Ape’s implementation, tips are randomly selected for each suggested instruction to enhance variety.
Unique Insights/Techniques of the Paper
Few-shot Example Selection
The few-shot example selection algorithm is unique. It selects examples where the model successfully solves the task, referred to as bootstrapped_fewshot
. This is achieved by adding intermediate steps, such as chain-of-thought reasoning or retrieval-augmented generation steps into few-shot examples.
Hyperparameter Optimization Approach
Optuna is used to select the optimal combination of suggested instructions and few-shot examples. This allows the algorithm to consider not only the individual impact of each prompt improvement but also the interaction effects between different improvements.
Theoretically, this expands the search space from k*N (where k is the number of types of improvements, and N is the number of suggestions for each type) to N^k. Despite this larger search space, the algorithm efficiently navigates it by Bayesian TPE sampling.
Potential Limitations
Limited Instruction Suggestion Space
Since instruction suggestions are based solely on the LLM and predefined tips, the diversity of suggestions is somewhat limited. While the paper mentions using previous suggestions in instruction proposals, this has not been implemented in the DSPy repository as of 2024.10.21. Additionally, MIPRO is not an iterative algorithm, meaning previous suggestion history must be managed outside the optimization algorithm.
Few-shot Example Selection Algorithm
The few-shot example selection algorithm is innovative, but it may not align well with single-prompt optimization for state-of-the-art models. In the paper, bootstrapped few-shot examples are used to train a student model to mimic a teacher model’s behavior. This setup may not be optimal for prompt optimization when the student and teacher models are the same (as is the case with state-of-the-art models).
Limited number of improvement
This method only apply 2 types of improvement for each optimization, instruction improvement and few-shot example improvement. So it’s performance is hard to be improved as training dataset size increase.
Difficult to use iteratively
This method apply both instruction and few-shot example improvement at same time. However, usually instruction based improvement can be applied iteratively, but few-shot does not. It makes hard to use this method iteratively, to previously optimized prompt.
Benchmark Performance
Summary
DSPy-MIPRO shows good performance in reasoning benchmarks like MATH and GPQA.
Trainset Scores
Benchmarks \ Methods | Baseline | finetuned baseline | DSPy-MIPRO |
---|---|---|---|
BIRD-bench (SQL) | 0.291 | 0.449 (▲) | 0.439 (▲) |
BoolQ (QA) | 0.906 | 1.000 (▲) | 0.960 (▲) |
GPQA (Reasoning) | 0.186 | 0.184 (▼) | 0.240 (▲) |
MATH (Reasoning) | 0.626 | 0.566 (▼) | 0.760 (▲) |
New York Times Topics (Classification) | 0.836 | 0.914 (▲) | 0.920 (▲) |
Testset Scores
Benchmarks \ Methods | Baseline | finetuned baseline | DSPy-MIPRO |
---|---|---|---|
BIRD-bench (SQL) | 0.307 | 0.473 (▲) | 0.242 (▼) |
BoolQ (QA) | 0.850 | 0.892 (▲) | 0.860 (▲) |
GPQA (Reasoning) | 0.146 | 0.080 (▼) | 0.180 (▲) |
MATH (Reasoning) | 0.610 | 0.426 (▼) | 0.650 (▲) |
New York Times Topics (Classification) | 0.794 | 0.818 (▲) | 0.700 (▼) |