EvoPrompt

EvoPrompt is an algorithm described in the paper EvoPrompt and its accompanying GitHub repository EvoPrompt, implemented as of 2024.10.21.

Description

EvoPrompt is a prompt optimization method based on evolutionary algorithms. In each generation, the prompt evolves in one of three ways: paraphrasing, genetic algorithms, or differential evolution. For genetic algorithms and differential evolution, the improved portions of each prompt are extracted and used in the evolution process.

Differences Between Implementation and Paper

The original implementation in the paper is based on the Text Completion API, whereas Ape’s implementation is adapted for the Chat Completion API.

Unique Insights/Techniques from the Paper

The paper highlights that evolutionary algorithm-based paraphrasing is significantly more effective than simple random paraphrasing. This insight can be applied to other prompt optimization methods by introducing additional paraphrasing steps for each improvement.

Potential Limitations

Limited Suggestion Space

Since the next generation of prompts is generated only through the LLM’s paraphrasing, it doesn’t directly learn from the training dataset, leading to limited diversity. Due to this limitation, the average performance within each generation gradually improves with each generation, but the peak performance doesn’t show significant improvement.

Benchmark Performance

Summary

The frequency of performance improvements throughout the training process is notably low. For BoolQ, no performance improvements were observed, while BIRD-bench and MATH showed only one instance of improvement each.

Furthermore, the performance improvements observed in the training dataset do not correlate well with those in the test dataset.

View in GitHub

Trainset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	EvoPrompt
BIRD-bench (SQL)	0.291	0.449 (▲)	0.368 (▲)
BoolQ (QA)	0.906	1.000 (▲)	0.900 (▼)
GPQA (Reasoning)	0.186	0.184 (▼)	0.190 (▲)
MATH (Reasoning)	0.626	0.566 (▼)	0.680 (▲)
New York Times Topics (Classification)	0.836	0.914 (▲)	0.840 (▲)

Testset Scores

Benchmarks \ Methods	Baseline	finetuned baseline	EvoPrompt
BIRD-bench (SQL)	0.307	0.473 (▲)	0.292 (▼)
BoolQ (QA)	0.850	0.892 (▲)	0.870 (▲)
GPQA (Reasoning)	0.146	0.080 (▼)	0.120 (▼)
MATH (Reasoning)	0.610	0.426 (▼)	0.670 (▲)
New York Times Topics (Classification)	0.794	0.818 (▲)	0.600 (▼)