OptunaTrainer

Overview

OptunaTrainer is inspired by DSPy but focuses more on instruction optimization.

Description

OptunaTrainer is a variation of DSPy-MIPRO. Unlike DSPy-MIPRO, which improves both instructions and few-shot examples, OptunaTrainer focuses solely on two types of instruction optimization. For the second type of instruction optimization, it uses metadata from evaluation metrics to guide the optimization process.

Motivation for developing this method

While DSPy-MIPRO’s hyperparameter optimization approach is powerful, it has some limitations. It doesn’t learn directly from the training dataset, and its instruction optimization relies entirely on predefined tips.

Team Weavel discovered that, in some real-world cases, people prefer to evaluate and optimize prompts based on complex metrics, not just accuracy. For example, some users may want to optimize for metrics like ROUGE, BLEU, Precision, Recall, or MICRO-F1. To address these needs, OptunaTrainer was developed.

Key Differences

Compared to DSPy-MIPRO, OptunaTrainer generates additional optimized instruction candidates instead of few-shot example candidates. For these new candidates, it extracts insights from the metadata of evaluation metrics and combines this information with predefined tips. Each candidate is generated using different, randomly selected tips. After generating the two types of candidates, OptunaTrainer optimizes them using the Optuna TPESampler.

Benchmark Performance

Summary

The OptunaTrainer shows higher performance improvement on the training dataset compared to other methods, including the finetuned baseline. However, the performance improvement on the test set is not as clear. For some datasets like BIRD-bench, GPQA, and New York Times Topics, the OptunaTrainer optimized prompt’s performance is even lower than the baseline. This suggests that the OptunaTrainer is prone to overfitting.

Trainset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	OptunaTrainer
BIRD-bench (SQL)	0.291	0.449 (▲)	0.490 (▲)
BoolQ (QA)	0.906	1.000 (▲)	1.000 (▲)
GPQA (Reasoning)	0.186	0.184 (▼)	0.280 (▲)
MATH (Reasoning)	0.626	0.566 (▼)	0.760 (▲)
New York Times Topics (Classification)	0.836	0.914 (▲)	0.960 (▲)

Testset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	OptunaTrainer
BIRD-bench (SQL)	0.307	0.473 (▲)	0.294 (▼)
BoolQ (QA)	0.850	0.892 (▲)	0.860 (▲)
GPQA (Reasoning)	0.146	0.080 (▼)	0.120 (▼)
MATH (Reasoning)	0.610	0.426 (▼)	0.720 (▲)
New York Times Topics (Classification)	0.794	0.818 (▲)	0.710 (▼)