Does the model need to be fine-tuned for SkillOpt to work?

No. Model weights remain completely frozen — SkillOpt modifies only the textual skill files the agent receives as instructions.

How large are the optimized skills?

The median length of optimized skill files is approximately 920 tokens, with only 1 to 4 accepted textual edits per optimization run.

Do learned skills transfer across different models?

Yes. Optimized skills have proven transferable across models of different sizes and different execution environments, with one documented cross-harness transfer yielding an improvement of 59.7 percentage points.

SkillOpt: Skill Files as Trainable Parameters

Microsoft Research has published SkillOpt — a system that optimizes agent skill files through an iterative forward-backward cycle without touching model weights. Across 52 evaluation cells it achieved best or tied results, and GPT-5.5 with optimized skills jumped from 58.8 to 82.3 percent average accuracy.

Microsoft Research has published SkillOpt — a system that addresses a problem that has largely been ignored: how to systematically improve the behavior of an AI agent without touching model weights. Instead of fine-tuning, SkillOpt treats instruction and skill files as trainable parameters and applies an optimization cycle exclusively to the text the agent receives as instructions.

Forward-Backward-Update: What One Cycle Looks Like

The procedure runs in three repeating steps:

Forward pass — the frozen target model executes tasks using the current version of the skill file. Nothing in the model changes; the only thing recorded is the trajectory — the sequence of actions and intermediate results.

Backward pass — a separate optimizer model analyzes the trajectories and identifies patterns: what worked, what didn’t, where the agent went off track. Based on that analysis, it proposes limited textual edits: adding a sentence, deleting an instruction, replacing a formulation.

Update step — proposed edits pass through a validation gate. Only those that improve results on held-out validation data are accepted. Rejected edits enter a feedback loop for the next optimizer call, and at the epoch level, slower meta-updates consolidate long-term lessons.

The mechanism that prevents prompt drift — the situation in which a skill file degenerates into nonsense through accumulated edits — is best-version selection: every edit must be better than the current version, not merely different.

52 Evaluation Cells: Consistency as the Key Finding

Researchers tested SkillOpt on 6 benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld) × 7 models × 3 execution modes — totaling 52 evaluation cells. Across all 52, SkillOpt achieved best or tied results compared to relevant baselines.

The largest documented gains, all measured on GPT-5.5:

Benchmark	Before	After	Gain
Six-benchmark average	58.8%	82.3%	+23.5 pp
SpreadsheetBench	41.8%	80.7%	+39.0 pp
OfficeQA	33.1%	72.1%	+39.0 pp
LiveMathematicianBench	37.6%	66.9%	+29.3 pp

Particularly noteworthy is the OfficeQA optimization result: an improvement of 39 percentage points achieved with a single edit to the skill file. This suggests that high-valence errors exist in current instructions — formulations that systematically misdirect the agent.

Compactness and Transferability

Final skill files contain a median of approximately 920 tokens with 1 to 4 accepted edits per case. The compactness is not accidental — the validation gate naturally filters out redundant edits that do not produce measurable improvement.

Transferability is documented at multiple levels. Optimization for one harness (e.g., Codex) yielded +24.8 pp, and the same skills on the Claude Code harness delivered +19.1 pp without re-optimization. One cross-harness transfer recorded +59.7 pp — meaning an agent with skills optimized for one platform outperformed its own baseline on an entirely different platform.

Why Is This Different From Prompt Engineering?

Manual prompt engineering is iterative, but not systematic. Engineers change instructions based on intuition, without quantitative feedback per edit and without a mechanism that prevents regression. SkillOpt formalizes that process: every change is measured, every step is auditable, and the final artifact — the optimized skill file — can be versioned, shared, and applied to any compatible model.

For organizations that already have agent infrastructure in place, the implication is clear: the model does not need to be better for the agent to be better. It is sufficient to systematically optimize the text the model receives.

SkillOpt: Microsoft Research Treats Agent Instruction Files as Trainable Parameters

Forward-Backward-Update: What One Cycle Looks Like

52 Evaluation Cells: Consistency as the Key Finding

Compactness and Transferability

Why Is This Different From Prompt Engineering?

Frequently Asked Questions

Sources

Related news