Synced
Frontend SSOT
lib/reranker-bench.ts
화면 점수, caveat, 후보군, launch lane의 단일 데이터 원장.
AXyBench · RAG Reranker
텍스트, 파일, 엔티티, 멀티모달 route를 한 표에서 분리하고 각 route에 맞는 rank policy를 고릅니다.
benchmarks
10
공개 벤치, 내부 product gate, oracle pool을 분리 표시.
public ready
8
Ready와 Sample을 분리 표기한 공개 평가축.
rank routes
5
텍스트·파일·엔티티·멀티모달·heavy rerank를 분리.
next work
4
Blocked model lane 1개는 별도 보류.
점수와 caveat를 실행물·문서에 고정.
Synced
lib/reranker-bench.ts
화면 점수, caveat, 후보군, launch lane의 단일 데이터 원장.
Synced
260529_reranker_domain_finetune_poc.md
학습 전 평가 기준, held-out 등급, base 선택 결론과 동기화.
Preserved
server A data/
baseline_summary_sds_sample_20260602.md · 99 rows · miss 0.
Synced
qfilter_full50_smoke_compare_20260603.md
BGE/DK full FT 50-step 결과와 SDS prod regression을 화면에 반영.
Synced
prod_harder_bge_full50_lr5e6_public_core_gate_20260603.md
official core coverage PASS, promotion FAIL. SDS regression -0.0042.
Synced
prod_harder_lr2e6_s50_route_release_gate_20260604.md
first high-ceiling checkpoint: public core GREEN, SDS +0.0002, Route release PASS. Preserved as ceiling evidence, not current selected stable candidate.
Synced
prod_harder_lr2e6_s50_repeatability_20260604.md
8 train/dev split seeds on server A 4x3090: GREEN 4/8, original+repeat GREEN 5/9. High ceiling confirmed; recipe-level marketing claim blocked.
Synced
internal_validation_selector.py
Fixed 500-row split created; selector rejects official gate shapes. Real retrain selected s02 before public core.
Synced
prod_harder_bge_full50_lr2e6_s50_fixedval_s02_public_core_gate_20260604.md
AutoRAG +0.0065, Allganize +0.0018, MIRACL flat, SDS -0.0022. Average +0.0015 but promotion FAIL.
Synced
product_file_fusion_gate.py
8 scenarios · Hit@1 1.000 · Hit@3 1.000 · AI Box content violations 0.
Synced
export_product_file_fusion_live_golden.py
Pre-repair .720/.800 → repair1 100 scenarios: Hit@1 1.000 · Hit@3 1.000 · content violations 0 · review_required.
Synced
export_product_file_fusion_trace_golden.py
168h/500 trace events/500 files → 100 scenarios. Tool mix codesign 49 · cmd_page 34 · files 12. Gate .200/.350, content violations 0.
Synced
generated_artifact_metadata.py
future generated Files rows now keep artifact_title/source_prompt/source_trace_summary/rank_hints. Historical trace rows need backfill or fresh smoke.
Synced
build_generated_artifact_metadata_backfill.py
100 trace scenarios → 88 proposal rows. Patched dry-run Hit@1 .920 · Hit@3 .920 · MRR .925 · content violations 0; still FAIL on Files lookup cases.
Synced
product_file_fusion_residual_review.py
8 residual scenarios, all files tool. Expected ranks 9/11x3/24/40/55/64; likely label cleanup before scorer changes.
Synced
product_file_fusion_residual_decision_suggestions.py
machine suggestion ledger: replace 4, needs-human-review 4; stress fixtures 4, EN/KO alias gaps 6, competitor ambiguity 1.
Synced
product_file_fusion_gate.py
patched trace gate reached Hit@1 .970 · Hit@3 .990 · MRR .980 · content violations 0 before final residual drop.
Synced
build_product_file_fusion_review_decisions_from_gate.py
delegated ledger ready_for_apply true, reviewed golden 99 scenarios, contract pass, Hit@1 .980 · Hit@3 1.000 · MRR .990.
Synced
reranker_route_release_gate.py
single checkpoint clears text + Product/File reviewed checks. Next blockers are internal seed selection, repeatability, and latency before product swap.
Synced
product_file_fusion_review_pack.py
100 scenarios · decisions template created · not human-reviewed golden yet.
Synced
product_file_fusion_review_batches.py
100 scenarios → five 20-row batches, index JSON/Markdown generated.
Synced
merge_product_file_fusion_review_decisions.py
batch reviewed files → one ledger with duplicate guard. Real release still waits for reviewer-filled .reviewed.jsonl files.
Synced
audit_product_file_fusion_review_decisions.py
100-row template audit blocks unreviewed=100, ready_for_apply=false.
Synced
apply_product_file_fusion_review_decisions.py
unreviewed template fails as intended. approve-all smoke passes only as tooling check.
Synced
product_file_fusion_reviewed_gate.py
Seed fixture rejected; reviewed smoke accepted. Final gate waits for real reviewer ledger.
Next
top50/top100 capture
BM25 proxy와 별개로 실제 BGE-M3 dense+sparse 후보군 원장 필요.
단일 champion 리랭커 대신, 후보군 성격과 비용 예산에 따라 다른 rank policy를 적용합니다.
text-semantic
본문 chunk, OCR text, 일반 한국어 문서 질의
공개 text benchmark claim과 본문 RAG 품질을 담당. qfilter/prod-harder 실험은 이 lane 안에서만 해석한다.
product-file-fusion
Files 검색, 첨부, 폴더/경로/파일명/날짜/권한/AI Box 상태가 중요한 질의
CMD 실사용 성패를 좌우한다. content excluded 파일은 본문 rerank가 아니라 metadata-only rank로 남긴다.
entity-graph
회사·사람·계약·계정·관계·히스토리를 따라가는 질의
일반 text nDCG 평균에 섞지 않는다. 관계형 질문은 graph route 자체의 성공률로 본다.
visual-page-maxsim
스캔 PDF, 표·그림·슬라이드, 이미지 중심 page retrieval
cross-encoder text reranker와 평균내지 않는다. 자체 maxsim이 이미 리랭킹 역할을 한다.
heavy-final-rerank
비용을 써도 되는 고가치 질문, top 후보 압축 이후의 최종 판단
항상 켜는 모델이 아니라 premium/high-risk route의 상한선과 distillation source로 쓴다.
2026-06-03 server A baseline + qfilter full50 기준 · nDCG@10은 ×100, 적중률은 % · 각 행 accent 셀이 최고점.
| Benchmark | No rerankNo | dragonkueDK | DK FT50DK50 | BGE cleanBGE | BGE FT50BGE50 | BGE hard50Hard50 | Qwen3 0.6BQ0.6 | Qwen3 4BQ4 | GTE multiGTE | Pool |
|---|---|---|---|---|---|---|---|---|---|---|
Grade A Ready public full BM25 proxy AutoRAGRetrieval 금융·공공·의료·법률·커머스 · nDCG@10 · 114Q / 720 docs | 81.7 | 91.6 | 91.6 | 90.8 | 91.0 | 91.5 | 82.1 | — | — | BM25 top50 MIT |
| → fixed-validation selected s02도 AutoRAG는 오른다. release 판단은 SDS no-regression 통과 전까지 보류. | ||||||||||
Grade A Ready public full BM25 proxy Allganize rag-ko 한국어 RAG 질의 · nDCG@10 · HF context 포함 | 78.8 | 88.3 | 88.3 | 88.5 | 88.9 | 88.7 | — | — | — | BM25 top50 dataset별 확인 |
| → BGE qfilter는 소폭 상승. dragonkue 추가학습은 이 축에서 이득 없음. | ||||||||||
Grade A Ready public full license caution PublicHealthQA-ko 보건·공공질의 · nDCG@10 · 77Q / 77 docs | 45.4 | 71.7 | 71.6 | 72.4 | 73.6 | — | — | — | — | BM25 top50 CC-BY-NC 주의 |
| → BGE qfilter는 개선폭이 크지만, 라이선스상 마케팅 인용은 보조로 제한. | ||||||||||
Grade B Ready public sample positive-preserving MIRACL-ko dev 위키 검색 · nDCG@10 · 213Q / 50k sampled docs | 46.1 | 66.6 | — | 66.8 | 66.9 | 66.8 | — | — | — | BM25 top50 + positive-preserving 50k sample Apache-2.0 |
| → BGE qfilter는 극소폭 상승. 공식 core coverage에는 포함하되 단독 claim은 금지. | ||||||||||
Grade A Ready public full retriever miss Ko-StrategyQA 상식·추론 검색 · nDCG@10 · 592Q / 9,251 docs retriever miss 30.9% | 38.4 | 55.1 | — | 56.7 | — | — | — | — | — | BM25 top50 dataset별 확인 |
| → 리랭커 개선과 retriever miss 개선을 분리해서 봐야 함. | ||||||||||
Grade B Oracle public oracle retriever miss XPQA-ko 다국어 QA 검색 · nDCG@10 · 654Q / 889 docs retriever miss 71.6% | 15.4 | 63.9 | — | 65.8 | — | — | — | — | — | BM25 top50 oracle dataset별 확인 |
| → 일반 BM25 pool miss가 커서 oracle 병행 없이는 리랭커 평가가 왜곡됨. | ||||||||||
Grade B Sample public sample BM25 proxy positive-preserving SDS-KoPub sample 공공 6분야 문서 · nDCG@10 · 99Q / 5,098 pages | 53.0 | 98.0 | — | 98.4 | — | — | — | — | — | BM25 top50 · positive-preserving 100Q/5k sample CC-BY-SA 4.0 / KOGL Type1 server A data/baseline_summary_sds_sample_20260602.md |
| → AXyBench 재현팩 후보. sample·BM25 proxy 라벨을 붙여 공개 표시. | ||||||||||
Grade B Ready public backend full hybrid retriever SDS-KoPub-VDR 공공 6분야 문서 · 정답 적중률 · 600Q / 40,781 pages | 63.3% | 70.0% | — | 71.4% | — | — | — | 81.7% | — | backend full 600 · BGE-M3 hybrid top50 CC-BY-SA 4.0 / KOGL Type1 backend full 600 baseline · Qwen3 4B teacher probe |
| → backend full 기준선은 유지. public full pack은 runtime 측정 후 sample과 분리 편입. | ||||||||||
Grade B Ready public prod-style hybrid retriever full50 smoke SDS-KoPub prod top50 공공 6분야 문서 · nDCG@10 · 600Q / 40,781 pages | 62.7 | 69.9 | 68.3 | 71.5 | 70.3 | 71.3 | — | — | — | isolated qdrant-0 full 600 · BGE-M3 hybrid top50 CC-BY-SA 4.0 / KOGL Type1 server A data/qfilter_full50_smoke_compare_20260603.md |
| → 첫 GREEN 고점은 있었지만 fixed-validation selected s02는 SDS가 내려 release 후보가 아니다. 다음은 SDS no-regression repair. | ||||||||||
Grade Internal Ready Internal internal held-out no redistribution AI Hub held-out 569 내부 한국어 QA · 정답 적중률 · 569 held-out | 81.9% | 91.8% | — | 90.5% | — | — | — | — | — | backend held-out · BGE-M3 hybrid top50 internal only |
| → 운영 smoke에는 유효하지만 외부 클레임에는 쓰지 않음. | ||||||||||
active는 바로 비교, teacher는 상한선·라벨링, watch는 같은 pack 실측 후 승격.
1차 메인 베이스
BAAI/bge-reranker-v2-m3
full FT smoke
runs/20260603_qfilter_bge_v2_m3_full50
fixed-validation selected candidate
runs/20260604_prod_harder_bge_v2_m3_lr2e6_s50_fixedval_s02
운영 기준선·delta lane
dragonkue/bge-reranker-v2-m3-ko
delta smoke
runs/20260603_qfilter_dragonkue_full50
고점 probe
Qwen/Qwen3-Reranker-0.6B
teacher / upper-bound
Qwen/Qwen3-Reranker-4B
효율 lane
Alibaba-NLP/gte-multilingual-reranker-base
한국어 watch baseline
upskyy/ko-reranker-8k
한국어 watch baseline
Dongjin-kr/ko-reranker
상용/범용 watch baseline
jina-reranker-v2
AXyBench 리랭커 벤치는 외부 클레임용 held-out과 prod gate를 같은 화면에서 관리합니다.
eval_manifest + blacklist
공개/내부 평가셋과 학습 풀 dedup 기준이 고정됨.
text-only parquet loader
100Q/5k sample 생성·검증·baseline 완료. 화면 SSOT에 반영됨.
600Q / 40,781 corpus
sample이 아닌 full public BM25 pack은 runtime 측정 후 편입.
BAAI/bge-reranker-v2-m3
qfilter full50 완료. 공개 일부 상승, SDS prod-style 하락.
dragonkue current baseline
qfilter full50 완료. delta 이득이 작고 SDS prod-style 하락.
BGE clean rank 8/16
data-ratio 실험용. encoder-only 경로는 별도 PEFT 래퍼 필요.
SDS/prod hard negatives
rank 1-50 prod-harder negative + LR2e-6 s50은 고점 GREEN. fixedval selected s02는 SDS -0.0022라 mix 수리 필요.
internal_validation_selector.py
500-row fixed internal split + real retrain selector PASS. s02 selected before official benchmark.
internal_select_train + Trainer seed
s01-s04 done on server A 4x3090. Selector picked s02; official public core AMBER due SDS regression.
data mix + loss recipe
SDS/prod hard negatives, easy/random negatives, public/AIHub positives, KD labels 중 작은 조합으로 재실험.
product_file_fusion_gate.py
8 scenario seed PASS. AI Box content-exclusion 위반 0건.
export_product_file_fusion_live_golden.py
실제 files rows를 review-required 후보 golden으로 변환. repair1 후 100-scenario smoke PASS, review pack ready.
export_product_file_fusion_trace_golden.py
성공 trace 100-scenario smoke: Hit@1 .200 · Hit@3 .350 · MRR .344 · content violations 0. 생성 artifact intent metadata repair 필요.
generated_artifact_metadata.py + save paths
codesign canonical/export child와 cmd_word/cmd_page export가 source_prompt/source_trace_summary/rank_hints를 source_metadata에 저장.
build_generated_artifact_metadata_backfill.py
trace golden에서 review-required source_metadata_patch JSONL 생성. Dry-run patched gate .920/.920, DB write 없음.
product_file_fusion_residual_review.py
patched gate Top3 밖 trace_files_lookup 8건만 focused review pack으로 분리. replace/drop 우선.
product_file_fusion_residual_decision_suggestions.py
8 residual scenarios → replace suggestion 4 · needs-human-review 4. machine_suggested only, not held-out labels.
product_file_fusion_gate.py
proposal/rfp/deck/ir alias overlap. Patched trace 100: Hit@1 .970 · Hit@3 .990 · MRR .980 · 1 residual.
reviewed_gate + delegated review ledger
99 approve · 1 explicit drop. Reviewed-only gate PASS: Hit@1 .980 · Hit@3 1.000 · MRR .990 · content violations 0.
reranker_route_release_gate.py
first GREEN checkpoint route release PASS. fixedval selected s02 is public core AMBER, so no release route yet.
product_file_fusion_review_pack.py
Markdown + decisions JSONL template 생성. 단일 batch 생성 지원.
product_file_fusion_review_batches.py
전체 후보 golden을 20개 단위 batch와 index로 일괄 생성.
merge_product_file_fusion_review_decisions.py
batch별 .reviewed.jsonl을 단일 ledger로 병합. duplicate scenario id와 audit 실패를 차단.
audit_product_file_fusion_review_decisions.py
missing/unreviewed/invalid/replace id 오류를 apply 전에 차단.
apply_product_file_fusion_review_decisions.py
unreviewed/missing decision 차단. reviewer ledger 적용 후 held-out golden 생성.
product_file_fusion_reviewed_gate.py
seed/live pack 거부. human-reviewed metadata가 있어야 Product/File held-out metric 실행.
BGE clean 25-step / LR2e-6
SDS를 raw BGE 이상으로 되돌리는 보수적 update 실험.
runtime/backend
index error 해소 전까지 효율 lane baseline 확정 불가.
Source: CommanderOS reranker pack · eval_manifest, fixed candidates, held-out blacklist, server A baseline and qfilter full50 summaries.
Next: SDS형 hard negative mix와 LoRA/PEFT adapter sweep를 같은 held-out pack에서 비교.