AXyBench · CommanderOS

CMD Evolution

CommanderOS CMD1을 같은 AXyBench 문항 풀에 반복 투입해 채점된 라운드의 개선 추이를 기록합니다.

Latest

CMD-130

Score

88.8

OK Rate

100%

Latency

30.9s

Progression

CMD-121 → CMD-130

judged scoremeasured ok rate

Round Log

campaign: openbeta-chat-final · status: draft · scored rounds: 7

Round	Status	Score	OK	Needs Judge	Latency	Cost	Profile	Change
CMD-121	채점 완료	87.0	100%	100%	22.8s	$0.087	cmd_1_0_default	B4 full-category probe after r120 complete-artifact sufficiency fixes; all 5 cells reached judgeable final output.
CMD-123	채점 완료	88.0	100%	100%	57.5s	$0.019	cmd_1_0_default	Generic SaaS marketing deterministic repair and DataLab keyword normalization; q1-only validation.
CMD-124	채점 완료	86.0	100%	100%	24.2s	$0.083	cmd_1_0_default	B4 full-category rerun after generic SaaS repair push
CMD-125	채점 완료	88.0	100%	100%	25.1s	$0.085	cmd_1_0_default	Exact channel-copy constraint and ad source-basis harness fixes
CMD-126	채점 완료	88.5	100%	100%	21.8s	$0.084	cmd_1_0_default	Korean LinkedIn tone self-check leakage gate
CMD-127	채점 완료	86.6	100%	80%	25.1s	$0.084	cmd_1_0_default	LLM call ledger observability; no prompt-quality change
CMD-130	채점 완료	88.8	100%	100%	30.9s	$0.098	cmd_1_0_default	Generic channel-copy completeness gate for concrete sales numbers, channel labels, and truthful self-check counts

AXyBench commander-cmd1 product endpoint. Scores are attached after judge/manual review.

updated 2026-06-04T10:34:02Z