본문으로 건너뛰기
AXyNowAX IS NOW

AXyBench · RAG Reranker

한국어 리랭커 학습, 이제 route별로 보고 시작

텍스트, 파일, 엔티티, 멀티모달 route를 한 표에서 분리하고 각 route에 맞는 rank policy를 고릅니다.

benchmarks

10

공개 벤치, 내부 product gate, oracle pool을 분리 표시.

public ready

8

Ready와 Sample을 분리 표기한 공개 평가축.

rank routes

5

텍스트·파일·엔티티·멀티모달·heavy rerank를 분리.

next work

4

Blocked model lane 1개는 별도 보류.

Evidence Ledger

리랭커 벤치 원장

점수와 caveat를 실행물·문서에 고정.

Synced

Frontend SSOT

lib/reranker-bench.ts

화면 점수, caveat, 후보군, launch lane의 단일 데이터 원장.

Synced

CommanderOS SSOT

260529_reranker_domain_finetune_poc.md

학습 전 평가 기준, held-out 등급, base 선택 결론과 동기화.

Preserved

SDS sample artifact

server A data/

baseline_summary_sds_sample_20260602.md · 99 rows · miss 0.

Synced

qfilter full50 smoke

qfilter_full50_smoke_compare_20260603.md

BGE/DK full FT 50-step 결과와 SDS prod regression을 화면에 반영.

Synced

prod-harder LR5e-6 gate

prod_harder_bge_full50_lr5e6_public_core_gate_20260603.md

official core coverage PASS, promotion FAIL. SDS regression -0.0042.

Synced

prod-harder LR2e-6 s50 GREEN

prod_harder_lr2e6_s50_route_release_gate_20260604.md

first high-ceiling checkpoint: public core GREEN, SDS +0.0002, Route release PASS. Preserved as ceiling evidence, not current selected stable candidate.

Synced

prod-harder LR2e-6 s50 repeatability

prod_harder_lr2e6_s50_repeatability_20260604.md

8 train/dev split seeds on server A 4x3090: GREEN 4/8, original+repeat GREEN 5/9. High ceiling confirmed; recipe-level marketing claim blocked.

Synced

Internal validation selector

internal_validation_selector.py

Fixed 500-row split created; selector rejects official gate shapes. Real retrain selected s02 before public core.

Synced

fixedval s02 public core AMBER

prod_harder_bge_full50_lr2e6_s50_fixedval_s02_public_core_gate_20260604.md

AutoRAG +0.0065, Allganize +0.0018, MIRACL flat, SDS -0.0022. Average +0.0015 but promotion FAIL.

Synced

Product/File fusion seed

product_file_fusion_gate.py

8 scenarios · Hit@1 1.000 · Hit@3 1.000 · AI Box content violations 0.

Synced

Product/File live export

export_product_file_fusion_live_golden.py

Pre-repair .720/.800 → repair1 100 scenarios: Hit@1 1.000 · Hit@3 1.000 · content violations 0 · review_required.

Synced

Product/File trace export

export_product_file_fusion_trace_golden.py

168h/500 trace events/500 files → 100 scenarios. Tool mix codesign 49 · cmd_page 34 · files 12. Gate .200/.350, content violations 0.

Synced

Generated artifact metadata repair

generated_artifact_metadata.py

future generated Files rows now keep artifact_title/source_prompt/source_trace_summary/rank_hints. Historical trace rows need backfill or fresh smoke.

Synced

Generated artifact metadata backfill dry-run

build_generated_artifact_metadata_backfill.py

100 trace scenarios → 88 proposal rows. Patched dry-run Hit@1 .920 · Hit@3 .920 · MRR .925 · content violations 0; still FAIL on Files lookup cases.

Synced

Residual Files lookup review pack

product_file_fusion_residual_review.py

8 residual scenarios, all files tool. Expected ranks 9/11x3/24/40/55/64; likely label cleanup before scorer changes.

Synced

Residual decision suggestions

product_file_fusion_residual_decision_suggestions.py

machine suggestion ledger: replace 4, needs-human-review 4; stress fixtures 4, EN/KO alias gaps 6, competitor ambiguity 1.

Synced

Product/File alias repair

product_file_fusion_gate.py

patched trace gate reached Hit@1 .970 · Hit@3 .990 · MRR .980 · content violations 0 before final residual drop.

Synced

Product/File trace reviewed gate

build_product_file_fusion_review_decisions_from_gate.py

delegated ledger ready_for_apply true, reviewed golden 99 scenarios, contract pass, Hit@1 .980 · Hit@3 1.000 · MRR .990.

Synced

Route release gate

reranker_route_release_gate.py

single checkpoint clears text + Product/File reviewed checks. Next blockers are internal seed selection, repeatability, and latency before product swap.

Synced

Product/File review pack

product_file_fusion_review_pack.py

100 scenarios · decisions template created · not human-reviewed golden yet.

Synced

Product/File batch index

product_file_fusion_review_batches.py

100 scenarios → five 20-row batches, index JSON/Markdown generated.

Synced

Product/File decision merge

merge_product_file_fusion_review_decisions.py

batch reviewed files → one ledger with duplicate guard. Real release still waits for reviewer-filled .reviewed.jsonl files.

Synced

Product/File decision audit

audit_product_file_fusion_review_decisions.py

100-row template audit blocks unreviewed=100, ready_for_apply=false.

Synced

Product/File reviewed apply

apply_product_file_fusion_review_decisions.py

unreviewed template fails as intended. approve-all smoke passes only as tooling check.

Synced

Product/File reviewed gate

product_file_fusion_reviewed_gate.py

Seed fixture rejected; reviewed smoke accepted. Final gate waits for real reviewer ledger.

Next

Prod retriever lane

top50/top100 capture

BM25 proxy와 별개로 실제 BGE-M3 dense+sparse 후보군 원장 필요.

Route Policy

CMD RAG route별 rank 전략

단일 champion 리랭커 대신, 후보군 성격과 비용 예산에 따라 다른 rank policy를 적용합니다.

text-semantic

Text semantic RAG

Active

본문 chunk, OCR text, 일반 한국어 문서 질의

Policy
BGE-M3 dense+sparse 후보군을 BGE 계열 cross-encoder FT/KD로 재정렬
Model
BAAI/bge-reranker-v2-m3 FT/KD · dragonkue는 운영 기준선
Gate
MTEB/MIRACL/AutoRAG/Allganize + SDS text lane no-regression
Marketing High
Product High

공개 text benchmark claim과 본문 RAG 품질을 담당. qfilter/prod-harder 실험은 이 lane 안에서만 해석한다.

product-file-fusion

Product/File fusion

Active

Files 검색, 첨부, 폴더/경로/파일명/날짜/권한/AI Box 상태가 중요한 질의

Policy
retriever score + metadata + folder path + file type + recency + permission + optional text rerank
Model
Fusion ranker + selective BGE text rerank
Gate
Seed/live PASS; trace reviewed gate PASS .980/1.000 after 99 approve + 1 drop
Marketing Medium
Product Critical

CMD 실사용 성패를 좌우한다. content excluded 파일은 본문 rerank가 아니라 metadata-only rank로 남긴다.

entity-graph

Entity / Graph RAG

Next

회사·사람·계약·계정·관계·히스토리를 따라가는 질의

Policy
entity match, relation traversal, PPR/community score, source chunk confidence를 결합
Model
GraphRAG traversal + entity-aware rank + optional final text rerank
Gate
entity relation golden, multi-hop path accuracy, hallucinated edge rate
Marketing Medium
Product High

일반 text nDCG 평균에 섞지 않는다. 관계형 질문은 graph route 자체의 성공률로 본다.

visual-page-maxsim

Visual / Page MaxSim

Next

스캔 PDF, 표·그림·슬라이드, 이미지 중심 page retrieval

Policy
ColQwen/ColPali/ColBERT 계열 late-interaction MaxSim 후보군을 route 내부에서 재정렬
Model
ColQwen/ColPali style multi-vector ranker
Gate
page-level visual QA, table/figure hit@k, OCR-miss recovery
Marketing High
Product High

cross-encoder text reranker와 평균내지 않는다. 자체 maxsim이 이미 리랭킹 역할을 한다.

heavy-final-rerank

Heavy final rerank

Next

비용을 써도 되는 고가치 질문, top 후보 압축 이후의 최종 판단

Policy
top-k를 작게 줄인 뒤 Qwen3 4B/8B teacher 또는 LLM judge score로 재정렬
Model
Qwen3-Reranker-4B/8B teacher lane
Gate
quality gain vs latency/cost, p95 budget, disagreement audit
Marketing Medium
Product Medium

항상 켜는 모델이 아니라 premium/high-risk route의 상한선과 distillation source로 쓴다.

Benchmark Matrix

벤치마크별 후보군 점수

2026-06-03 server A baseline + qfilter full50 기준 · nDCG@10은 ×100, 적중률은 % · 각 행 accent 셀이 최고점.

BenchmarkNoDKDK50BGEBGE50Hard50Q0.6Q4GTEPool
Grade A
Ready
public
full
BM25 proxy

AutoRAGRetrieval

금융·공공·의료·법률·커머스 · nDCG@10 · 114Q / 720 docs

81.7
91.6
91.6
90.8
91.0
91.5
82.1

BM25 top50

MIT

fixed-validation selected s02도 AutoRAG는 오른다. release 판단은 SDS no-regression 통과 전까지 보류.
Grade A
Ready
public
full
BM25 proxy

Allganize rag-ko

한국어 RAG 질의 · nDCG@10 · HF context 포함

78.8
88.3
88.3
88.5
88.9
88.7

BM25 top50

dataset별 확인

BGE qfilter는 소폭 상승. dragonkue 추가학습은 이 축에서 이득 없음.
Grade A
Ready
public
full
license caution

PublicHealthQA-ko

보건·공공질의 · nDCG@10 · 77Q / 77 docs

45.4
71.7
71.6
72.4
73.6

BM25 top50

CC-BY-NC 주의

BGE qfilter는 개선폭이 크지만, 라이선스상 마케팅 인용은 보조로 제한.
Grade B
Ready
public
sample
positive-preserving

MIRACL-ko dev

위키 검색 · nDCG@10 · 213Q / 50k sampled docs

46.1
66.6
66.8
66.9
66.8

BM25 top50 + positive-preserving 50k sample

Apache-2.0

BGE qfilter는 극소폭 상승. 공식 core coverage에는 포함하되 단독 claim은 금지.
Grade A
Ready
public
full
retriever miss

Ko-StrategyQA

상식·추론 검색 · nDCG@10 · 592Q / 9,251 docs

retriever miss 30.9%

38.4
55.1
56.7

BM25 top50

dataset별 확인

리랭커 개선과 retriever miss 개선을 분리해서 봐야 함.
Grade B
Oracle
public
oracle
retriever miss

XPQA-ko

다국어 QA 검색 · nDCG@10 · 654Q / 889 docs

retriever miss 71.6%

15.4
63.9
65.8

BM25 top50 oracle

dataset별 확인

일반 BM25 pool miss가 커서 oracle 병행 없이는 리랭커 평가가 왜곡됨.
Grade B
Sample
public
sample
BM25 proxy
positive-preserving

SDS-KoPub sample

공공 6분야 문서 · nDCG@10 · 99Q / 5,098 pages

53.0
98.0
98.4

BM25 top50 · positive-preserving 100Q/5k sample

CC-BY-SA 4.0 / KOGL Type1

server A data/baseline_summary_sds_sample_20260602.md

AXyBench 재현팩 후보. sample·BM25 proxy 라벨을 붙여 공개 표시.
Grade B
Ready
public
backend full
hybrid retriever

SDS-KoPub-VDR

공공 6분야 문서 · 정답 적중률 · 600Q / 40,781 pages

63.3%
70.0%
71.4%
81.7%

backend full 600 · BGE-M3 hybrid top50

CC-BY-SA 4.0 / KOGL Type1

backend full 600 baseline · Qwen3 4B teacher probe

backend full 기준선은 유지. public full pack은 runtime 측정 후 sample과 분리 편입.
Grade B
Ready
public
prod-style
hybrid retriever
full50 smoke

SDS-KoPub prod top50

공공 6분야 문서 · nDCG@10 · 600Q / 40,781 pages

62.7
69.9
68.3
71.5
70.3
71.3

isolated qdrant-0 full 600 · BGE-M3 hybrid top50

CC-BY-SA 4.0 / KOGL Type1

server A data/qfilter_full50_smoke_compare_20260603.md

첫 GREEN 고점은 있었지만 fixed-validation selected s02는 SDS가 내려 release 후보가 아니다. 다음은 SDS no-regression repair.
Grade Internal
Ready
Internal
internal
held-out
no redistribution

AI Hub held-out 569

내부 한국어 QA · 정답 적중률 · 569 held-out

81.9%
91.8%
90.5%

backend held-out · BGE-M3 hybrid top50

internal only

운영 smoke에는 유효하지만 외부 클레임에는 쓰지 않음.
Model Slate

학습·비교 후보군

active는 바로 비교, teacher는 상한선·라벨링, watch는 같은 pack 실측 후 승격.

1차 메인 베이스

BGE clean base

Active

BAAI/bge-reranker-v2-m3

Signal
Allganize·PublicHealth·MIRACL·Ko-StrategyQA·XPQA·SDS sample에서 우세
Inference
CrossEncoder / sequence-classifier
Risk
한국어 도메인 튜닝을 직접 먹여야 함

full FT smoke

BGE qfilter FT50

Watch

runs/20260603_qfilter_bge_v2_m3_full50

Signal
Allganize +0.0037, PublicHealth +0.0121
Inference
CrossEncoder / full checkpoint
Risk
SDS prod top50 -0.0114라 promotion 후보 아님

fixed-validation selected candidate

BGE prod-harder selected

Watch

runs/20260604_prod_harder_bge_v2_m3_lr2e6_s50_fixedval_s02

Signal
internal selector PASS: nDCG +0.0237, Hit@1 +0.046; public core AMBER avg +0.0015
Inference
CrossEncoder / full checkpoint
Risk
SDS prod top50 -0.0022라 release/marketing 후보 아님. SDS repair 후 재선별 필요.

운영 기준선·delta lane

dragonkue current

Active

dragonkue/bge-reranker-v2-m3-ko

Signal
AutoRAG·AI Hub held-out에서 최고
Inference
현 운영 기준 리랭커
Risk
이미 한국어 튜닝된 모델이라 추가 LoRA 효과/과적합 확인 필요

delta smoke

DK qfilter FT50

Watch

runs/20260603_qfilter_dragonkue_full50

Signal
AutoRAG +0.0004, Allganize +0.0000
Inference
CrossEncoder / full checkpoint
Risk
PublicHealth -0.0011, SDS prod top50 -0.0155

고점 probe

Qwen3 0.6B

Watch

Qwen/Qwen3-Reranker-0.6B

Signal
0.6B는 약했지만 Qwen 계열 고점 확인용
Inference
느림 · AutoRAG p95 약 2.04s / 50 docs
Risk
현 상태 prod 후보로는 비용/속도 불리

teacher / upper-bound

Qwen3 4B

Teacher

Qwen/Qwen3-Reranker-4B

Signal
SDS backend full 600에서 81.74% teacher upper-bound
Inference
학습 라벨링·상한선 측정용
Risk
실서비스 리랭커로 쓰기에는 무겁다

효율 lane

GTE multilingual

Blocked

Alibaba-NLP/gte-multilingual-reranker-base

Signal
후보 탈락 아님
Inference
custom model forward runtime 확인 필요
Risk
server A에서 CPU/GPU 모두 index error

한국어 watch baseline

upskyy ko-reranker

Watch

upskyy/ko-reranker-8k

Signal
기존 문서상 이겨야 할 baseline
Inference
추가 실측 대기
Risk
같은 top50 pack 실측 전까지 주장 금지

한국어 watch baseline

Dongjin ko-reranker

Watch

Dongjin-kr/ko-reranker

Signal
기존 문서상 이겨야 할 baseline
Inference
추가 실측 대기
Risk
같은 top50 pack 실측 전까지 주장 금지

상용/범용 watch baseline

Jina reranker v2

Watch

jina-reranker-v2

Signal
경계 baseline
Inference
추가 실측 대기
Risk
라이선스·실측 경로를 먼저 고정해야 함
Launch Plan

본격 학습 전 남은 작업

AXyBench 리랭커 벤치는 외부 클레임용 held-out과 prod gate를 같은 화면에서 관리합니다.

01

Held-out pack 봉인

eval_manifest + blacklist

공개/내부 평가셋과 학습 풀 dedup 기준이 고정됨.

02

SDS public loader

text-only parquet loader

100Q/5k sample 생성·검증·baseline 완료. 화면 SSOT에 반영됨.

03

SDS full public pack

600Q / 40,781 corpus

sample이 아닌 full public BM25 pack은 runtime 측정 후 편입.

04

BGE clean full FT smoke

BAAI/bge-reranker-v2-m3

qfilter full50 완료. 공개 일부 상승, SDS prod-style 하락.

05

dragonkue delta FT smoke

dragonkue current baseline

qfilter full50 완료. delta 이득이 작고 SDS prod-style 하락.

06

LoRA/PEFT adapter sweep

BGE clean rank 8/16

data-ratio 실험용. encoder-only 경로는 별도 PEFT 래퍼 필요.

07

Product mix repair

SDS/prod hard negatives

rank 1-50 prod-harder negative + LR2e-6 s50은 고점 GREEN. fixedval selected s02는 SDS -0.0022라 mix 수리 필요.

08

Fixed validation selector

internal_validation_selector.py

500-row fixed internal split + real retrain selector PASS. s02 selected before official benchmark.

09

Stable tiny retrain

internal_select_train + Trainer seed

s01-s04 done on server A 4x3090. Selector picked s02; official public core AMBER due SDS regression.

10

SDS no-regression repair

data mix + loss recipe

SDS/prod hard negatives, easy/random negatives, public/AIHub positives, KD labels 중 작은 조합으로 재실험.

11

Product/File fusion seed

product_file_fusion_gate.py

8 scenario seed PASS. AI Box content-exclusion 위반 0건.

12

Live Files export

export_product_file_fusion_live_golden.py

실제 files rows를 review-required 후보 golden으로 변환. repair1 후 100-scenario smoke PASS, review pack ready.

13

Trace task export

export_product_file_fusion_trace_golden.py

성공 trace 100-scenario smoke: Hit@1 .200 · Hit@3 .350 · MRR .344 · content violations 0. 생성 artifact intent metadata repair 필요.

14

Generated artifact metadata repair

generated_artifact_metadata.py + save paths

codesign canonical/export child와 cmd_word/cmd_page export가 source_prompt/source_trace_summary/rank_hints를 source_metadata에 저장.

15

Historical metadata backfill proposal

build_generated_artifact_metadata_backfill.py

trace golden에서 review-required source_metadata_patch JSONL 생성. Dry-run patched gate .920/.920, DB write 없음.

16

Residual Files lookup review

product_file_fusion_residual_review.py

patched gate Top3 밖 trace_files_lookup 8건만 focused review pack으로 분리. replace/drop 우선.

17

Residual decision suggestions

product_file_fusion_residual_decision_suggestions.py

8 residual scenarios → replace suggestion 4 · needs-human-review 4. machine_suggested only, not held-out labels.

18

Product/File alias repair

product_file_fusion_gate.py

proposal/rfp/deck/ir alias overlap. Patched trace 100: Hit@1 .970 · Hit@3 .990 · MRR .980 · 1 residual.

19

Product/File trace reviewed gate

reviewed_gate + delegated review ledger

99 approve · 1 explicit drop. Reviewed-only gate PASS: Hit@1 .980 · Hit@3 1.000 · MRR .990 · content violations 0.

20

Route release gate

reranker_route_release_gate.py

first GREEN checkpoint route release PASS. fixedval selected s02 is public core AMBER, so no release route yet.

21

Human review pack

product_file_fusion_review_pack.py

Markdown + decisions JSONL template 생성. 단일 batch 생성 지원.

22

Review batch index

product_file_fusion_review_batches.py

전체 후보 golden을 20개 단위 batch와 index로 일괄 생성.

23

Review decision merge

merge_product_file_fusion_review_decisions.py

batch별 .reviewed.jsonl을 단일 ledger로 병합. duplicate scenario id와 audit 실패를 차단.

24

Decision audit

audit_product_file_fusion_review_decisions.py

missing/unreviewed/invalid/replace id 오류를 apply 전에 차단.

25

Reviewed golden apply

apply_product_file_fusion_review_decisions.py

unreviewed/missing decision 차단. reviewer ledger 적용 후 held-out golden 생성.

26

Reviewed-only gate

product_file_fusion_reviewed_gate.py

seed/live pack 거부. human-reviewed metadata가 있어야 Product/File held-out metric 실행.

27

Lower-strength repair

BGE clean 25-step / LR2e-6

SDS를 raw BGE 이상으로 되돌리는 보수적 update 실험.

28

GTE adapter fix

runtime/backend

index error 해소 전까지 효율 lane baseline 확정 불가.

Source: CommanderOS reranker pack · eval_manifest, fixed candidates, held-out blacklist, server A baseline and qfilter full50 summaries.

Next: SDS형 hard negative mix와 LoRA/PEFT adapter sweep를 같은 held-out pack에서 비교.