Comparative Evaluation Report of Embedding Models for Korean Legal Documents

1. Evaluation Overview

Objective: Select embedding models optimized for Korean statutes and ordinance search (RAG) systems

Evaluation Dataset

KCL-MCQA (Korean Canonical Legal Benchmark)
282 questions, 867 case law documents (expert-tagged Ground Truth)

Rationale for Data Selection

Currently, no public benchmark dataset exists for Korean statutes/ordinances
KCL-MCQA is the only verified Korean search evaluation dataset in the legal domain
Case law and statutes/ordinances share identical legal terminology and writing style, enabling similar embedding performance expectations
Re-evaluation recommended when statute/ordinance-specific evaluation datasets are built

Evaluation Environment

Search Engine: PostgreSQL pgvector with HNSW index
Evaluation Metrics: Recall@5, Precision@5, MRR, NDCG@5

2. Comparison Models

Model	Provider	Dimension	Characteristics
Amazon Titan V2	AWS Bedrock	1024	AWS native, cost-effective
Cohere Embed V4	AWS Bedrock	1536	Multilingual specialized, high-performance
KURE-v1	HuggingFace (requires SageMaker hosting)	1024	Korean-specialized open-source

3. Evaluation Metrics Explanation

Recall@K (Recall) ⭐⭐⭐

Definition: Proportion of relevant documents included in top K results
Formula: (Number of relevant documents found in top K) / (Total number of relevant documents)
Interpretation

Higher values mean better identification of relevant documents without omissions
Most important metric for legal search (prevents missing critical documents)

Example: If there are 5 relevant cases but only 3 appear in top 5 results → Recall@5 = 60%

Precision@K (Precision) ⭐

Definition: Proportion of relevant documents among top K results
Formula: (Number of relevant documents found in top K) / K
Interpretation

Higher values mean fewer unnecessary documents in search results
Reduces the number of documents users need to review

Example: If 3 out of top 5 results are actually relevant → Precision@5 = 60%

MRR (Mean Reciprocal Rank) ⭐⭐

Definition: Average of the reciprocal ranks where the first relevant document appears
Formula: 1 / (Rank of first relevant document)
Interpretation

Higher values indicate relevant documents appear higher in rankings
Measures the quality of the first result users see

Example

First result is relevant → MRR = 1.0
Third result is the first relevant one → MRR = 0.33

NDCG@K (Normalized Discounted Cumulative Gain) ⭐⭐

Definition: Comprehensive search quality score considering ranking positions
Interpretation

Higher scores when relevant documents rank higher
Evaluates “where” documents appear, not just “whether” they appear
Values between 0-1, with values closer to 1 indicating ideal ranking

Example: Documents ranked 1st and 2nd get high scores; documents ranked 8th and 9th get low scores

4. Evaluation Results

4.1 Performance Metrics Comparison

Model	Recall@5	Precision@5	MRR	NDCG@5	Rank
Cohere Embed V4	55.60%	34.54%	69.45%	55.14%	1st
KURE-v1	47.15%	29.36%	63.57%	47.02%	2nd
Amazon Titan V2	40.09%	24.18%	59.00%	40.68%	3rd

4.2 Performance Comparison Chart

Cohere vs. KURE-v1 vs. Amazon Titan V2

Figure: Performance metrics comparison of three models

4.3 Performance Difference Analysis

Comparison	Recall Difference	Improvement	MRR Difference	Improvement
Cohere vs Titan	+15.5%p	38.7%	+10.5%p	17.7%
Cohere vs KURE	+8.5%p	17.9%	+5.9%p	9.3%
KURE vs Titan	+7.1%p	17.6%	+4.6%p	7.7%

4.4 Key Findings

Cohere Embed V4 ranks first across all metrics
Korean-specialized KURE-v1 outperforms general-purpose Titan V2, but falls short of Cohere
MRR differences are smaller than Recall differences → All models find “first results” relatively well

5. Cost Analysis

5.1 Embedding API Costs (Bedrock)

Model	Price (1K tokens)	Cost for 1,000 documents	Cost for 1M documents
Amazon Titan V2	$0.00002	$0.012	$12.30
Cohere Embed V4	$0.00012	$0.060	$60.00

5.2 KURE-v1 Hosting Costs (SageMaker)

KURE-v1 is open-source but requires SageMaker infrastructure for hosting.

Hosting Method	Memory/Instance	Hourly Cost	Monthly (100k requests)	Notes
Serverless	4GB	~$0.0000467/sec	$5-10	Variable cost
Real-time	ml.g4dn.xlarge	~$0.7364/hour	$530	Fixed cost

5.3 Monthly Operating Cost Comparison (100k requests basis)

Model	Hosting Method	Monthly Cost	Management Burden
Titan V2	Bedrock API	~$1.23	None
Cohere V4	Bedrock API	~$6.00	None
KURE-v1	SageMaker Serverless	~$5-10	Medium (cold start latency)
KURE-v1	SageMaker Real-time	~$50-100	High (instance management)

6. Detailed Model Analysis

6.1 Cohere Embed V4

Aspect	Details
Strengths	- Highest performance (Recall 55.6%) - Excellent Korean legal terminology understanding from multilingual training - Easy operations via Bedrock integration
Weaknesses	- About 5x higher cost than Titan
Best For	Legal services, B2B services where accuracy is critical

6.2 KURE-v1 (Korea University)

Aspect	Details
Strengths	- Korean-specialized training - Customizable as open-source - Storage-efficient with 1024 dimensions
Weaknesses	- Requires SageMaker hosting (additional infrastructure management) - Serverless cold start latency (several seconds) - Real-time endpoints incur significant costs - 8.5%p lower performance than Cohere
Best For	Cases requiring fine-tuning/customization, internal systems (where latency tolerance is acceptable)

6.3 Amazon Titan V2

Aspect	Details
Strengths	- Most cost-effective - Native Bedrock integration - Minimal operational complexity
Weaknesses	- Lowest performance (Recall 40.1%) - Lacks optimization for Korean legal domain
Best For	MVP/prototypes, cost-sensitive services, where speed/cost matters more than accuracy

7. Recommendations

7.1 Final Recommendation

Cohere Embed V4
In legal domains, a 15%p difference in Recall is far more significant than the cost difference.

7.2 Rationale

Legal Domain Specificity
Missed statutes/ordinances can lead to incorrect legal interpretation.

Recall 55.6% vs 40.1% = 1.5 additional documents found per 10 queries
These 1.5 documents could contain critical legal provisions

Relative Cost Significance

Item	Cost
Additional Cohere cost (monthly, 100k requests)	$4.77
Single legal service retainer fee	Several hundred thousand won+

→ From ROI perspective, Cohere cost is negligible

Operational Convenience

Infrastructure management unnecessary with Bedrock API
No SageMaker operational burden compared to KURE-v1

8. Conclusion

Performance Rankings

Rank	Model	Recall@5	Characteristics
1	Cohere Embed V4	55.6%	Highest performance
2	KURE-v1	47.2%	Korean-specialized
3	Amazon Titan V2	40.1%	Cost-effective

Core Insights

Perspective	Conclusion
Absolute Performance	Cohere V4 superior across all metrics
Cost Efficiency	Titan V2 most efficient but sacrifices performance significantly
Korean Specialization	KURE-v1 better than Titan but cannot match Cohere
Operational Convenience	Bedrock models (Titan, Cohere) more convenient than SageMaker

Final Conclusion

The core value in legal domains is “reliability.”
Cohere V4’s 55.6% Recall is not perfect, but significantly more trustworthy than Titan V2’s 40.1% or KURE-v1’s 47.2%.
The risk of “missing critical legal provisions” in statute/ordinance search cannot be offset by saving a few dollars monthly.
Therefore, Cohere Embed V4 is strongly recommended.

Appendix: Evaluation Environment

Item	Details
Evaluation Date	January 2026
Dataset	KCL-MCQA (lbox/kcl)
Evaluation Queries	282
Corpus	867 case law documents (500+ characters)
Search Engine	PostgreSQL 16 + pgvector 0.7
Index	HNSW (m=16, ef_construction=64)

1. Evaluation Overview#

2. Comparison Models#

3. Evaluation Metrics Explanation#

Recall@K (Recall) ⭐⭐⭐#

Precision@K (Precision) ⭐#

MRR (Mean Reciprocal Rank) ⭐⭐#

NDCG@K (Normalized Discounted Cumulative Gain) ⭐⭐#

4. Evaluation Results#

4.1 Performance Metrics Comparison#

4.2 Performance Comparison Chart#

4.3 Performance Difference Analysis#

4.4 Key Findings#

5. Cost Analysis#

5.1 Embedding API Costs (Bedrock)#

5.2 KURE-v1 Hosting Costs (SageMaker)#

5.3 Monthly Operating Cost Comparison (100k requests basis)#

6. Detailed Model Analysis#

6.1 Cohere Embed V4#

6.2 KURE-v1 (Korea University)#

6.3 Amazon Titan V2#

7. Recommendations#

7.1 Final Recommendation#

7.2 Rationale#

8. Conclusion#

Performance Rankings#

Core Insights#

Final Conclusion#

Appendix: Evaluation Environment#