AXyBench · CommanderOS
LLM BenchCMD Evolution
CommanderOS CMD1을 같은 AXyBench 문항 풀에 반복 투입해 채점된 라운드의 개선 추이를 기록합니다.
Latest
CMD-130
Score
88.8
OK Rate
100%
Latency
30.9s
Progression
CMD-121 → CMD-130
judged scoremeasured ok rate
Round Log
라운드별 변경과 결과
campaign: openbeta-chat-final · status: draft · scored rounds: 7
| Round | Status | Score | OK | Needs Judge | Latency | Cost | Profile | Change |
|---|---|---|---|---|---|---|---|---|
| CMD-121 | 채점 완료 | 87.0 | 100% | 100% | 22.8s | $0.087 | cmd_1_0_default | B4 full-category probe after r120 complete-artifact sufficiency fixes; all 5 cells reached judgeable final output. |
| CMD-123 | 채점 완료 | 88.0 | 100% | 100% | 57.5s | $0.019 | cmd_1_0_default | Generic SaaS marketing deterministic repair and DataLab keyword normalization; q1-only validation. |
| CMD-124 | 채점 완료 | 86.0 | 100% | 100% | 24.2s | $0.083 | cmd_1_0_default | B4 full-category rerun after generic SaaS repair push |
| CMD-125 | 채점 완료 | 88.0 | 100% | 100% | 25.1s | $0.085 | cmd_1_0_default | Exact channel-copy constraint and ad source-basis harness fixes |
| CMD-126 | 채점 완료 | 88.5 | 100% | 100% | 21.8s | $0.084 | cmd_1_0_default | Korean LinkedIn tone self-check leakage gate |
| CMD-127 | 채점 완료 | 86.6 | 100% | 80% | 25.1s | $0.084 | cmd_1_0_default | LLM call ledger observability; no prompt-quality change |
| CMD-130 | 채점 완료 | 88.8 | 100% | 100% | 30.9s | $0.098 | cmd_1_0_default | Generic channel-copy completeness gate for concrete sales numbers, channel labels, and truthful self-check counts |
AXyBench commander-cmd1 product endpoint. Scores are attached after judge/manual review.
updated 2026-06-04T10:34:02Z