DSPy-Mipro

Overview

Dspy-Mipro is an algorithm from the DSPy project and paper, based on the MiproV2 optimizer, implemented as of 2024.10.21.

Description

Mipro (Multiprompt Instruction Proposal Optimizer) optimizes prompts by suggesting multiple improved instructions and groups of few-shot examples simultaneously. It uses the hyperparameter optimization library Optuna to select the best combination of suggested instructions and few-shot examples.

Differences Between Implementation and Paper

Single Prompt Optimization

While the original DSPy project and the MIPRO algorithm are designed for end-to-end LLM chain optimization, Ape’s implementation focuses on single prompt optimization, in line with Ape’s development philosophy.

Improvements in Instruction Suggestion

In DSPy’s MIPRO implementation, one selected tip is used for every suggested instruction. However, in Ape’s implementation, tips are randomly selected for each suggested instruction to enhance variety.

Unique Insights/Techniques of the Paper

Few-shot Example Selection

The few-shot example selection algorithm is unique. It selects examples where the model successfully solves the task, referred to as bootstrapped_fewshot. This is achieved by adding intermediate steps, such as chain-of-thought reasoning or retrieval-augmented generation steps into few-shot examples.

Hyperparameter Optimization Approach

Optuna is used to select the optimal combination of suggested instructions and few-shot examples. This allows the algorithm to consider not only the individual impact of each prompt improvement but also the interaction effects between different improvements.

Theoretically, this expands the search space from k*N (where k is the number of types of improvements, and N is the number of suggestions for each type) to N^k. Despite this larger search space, the algorithm efficiently navigates it by Bayesian TPE sampling.

Potential Limitations

Limited Instruction Suggestion Space

Since instruction suggestions are based solely on the LLM and predefined tips, the diversity of suggestions is somewhat limited. While the paper mentions using previous suggestions in instruction proposals, this has not been implemented in the DSPy repository as of 2024.10.21. Additionally, MIPRO is not an iterative algorithm, meaning previous suggestion history must be managed outside the optimization algorithm.

Few-shot Example Selection Algorithm

The few-shot example selection algorithm is innovative, but it may not align well with single-prompt optimization for state-of-the-art models. In the paper, bootstrapped few-shot examples are used to train a student model to mimic a teacher model’s behavior. This setup may not be optimal for prompt optimization when the student and teacher models are the same (as is the case with state-of-the-art models).

Limited number of improvement

This method only apply 2 types of improvement for each optimization, instruction improvement and few-shot example improvement. So it’s performance is hard to be improved as training dataset size increase.

Difficult to use iteratively

This method apply both instruction and few-shot example improvement at same time. However, usually instruction based improvement can be applied iteratively, but few-shot does not. It makes hard to use this method iteratively, to previously optimized prompt.

Benchmark Performance

Summary

DSPy-MIPRO shows good performance in reasoning benchmarks like MATH and GPQA.

Trainset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	DSPy-MIPRO
BIRD-bench (SQL)	0.291	0.449 (▲)	0.439 (▲)
BoolQ (QA)	0.906	1.000 (▲)	0.960 (▲)
GPQA (Reasoning)	0.186	0.184 (▼)	0.240 (▲)
MATH (Reasoning)	0.626	0.566 (▼)	0.760 (▲)
New York Times Topics (Classification)	0.836	0.914 (▲)	0.920 (▲)

Testset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	DSPy-MIPRO
BIRD-bench (SQL)	0.307	0.473 (▲)	0.242 (▼)
BoolQ (QA)	0.850	0.892 (▲)	0.860 (▲)
GPQA (Reasoning)	0.146	0.080 (▼)	0.180 (▲)
MATH (Reasoning)	0.610	0.426 (▼)	0.650 (▲)
New York Times Topics (Classification)	0.794	0.818 (▲)	0.700 (▼)