Incremental Sync of 448K CSV Records — 3-Layer Optimization from 52s to 0.3s

HTTP HEAD pre-check · High-Water Mark reverse scanning · Score threshold noise filtering
Real-world guide to syncing 81MB CSV → PostgreSQL in a serverless environment

Problem Definition — Why Optimization Was Needed

I was implementing a “district unit plan notice matching” feature for a building code review AI system. When querying district unit plan information for a specific address via VWorld API, it returns ntfc_sn (notice serial number), but this number alone doesn’t provide access to the actual notice document.

The notice list exists on eum.go.kr (Korea Land Information Portal), but their WAF blocks AWS IP ranges, making direct API calls impossible. As an alternative, I chose to sync the notice list CSV file from the public data portal (data.go.kr) to a PostgreSQL DB for matching.

Data Scale

Item	Value
CSV file size	~81MB
Total records	448,100
District unit plan related	~28,300
CSV update frequency	1-2 times/month
New additions	Tens to hundreds/month

The problem was that downloading and parsing all 448K records every time took about 52 seconds. We couldn’t make users wait 52 seconds while an agent responds to their questions.

Constraints — Limitations of Serverless Environment

The system runs serverlessly on AWS Bedrock AgentCore. This environment has the following constraints:

No local filesystem: Since containers spin up fresh each time, we cannot cache the CSV locally.

Cold start consideration: Reasonable response time is needed even on first invocation.

DB is the only persistent storage: PostgreSQL is the only place that can maintain state.

No HTTP Range support: data.go.kr doesn’t support Range requests, so partial downloads are impossible.

Key question: “If only tens of records are added monthly, do we really need to process all 448K records every time?”

Overall Architecture — 3-Layer Optimization Strategy

Following the principle of “do as little as possible,” I designed a 3-stage optimization. The earlier we exit from each layer, the lower the total cost.

Layer	Strategy	Cost	Skip Condition
Layer 1	HEAD request → Compare Content-Length	0.3s	Exit immediately if file size is identical
Layer 2	Full download + reverse line scan	~2s	Stop scan at max_seq boundary
Layer 3	Parse only new records with csv.reader + INSERT	~0.01s	Process only new records

The key to this strategy is that “most calls end at Layer 1.” If the CSV hasn’t been updated, the entire process terminates with just one HEAD request (0.3 seconds).

Call flow:

  [HEAD request 0.3s]
       │
       ├─ Content-Length same? ──→ Exit (0.3s)     ← 90% of cases
       │
       ├─ Different → [Full download ~1.5s]
       │              │
       │         [Reverse scan ~0.1s]
       │              │
       │              ├─ 0 new records? ──→ Save CL and exit (~2s)
       │              │
       │              └─ N new records → [Parse+INSERT ~0.01s] ──→ Exit (~2s)
       │
       └─ HEAD failed → Proceed with full download (fallback)

Layer 1: HEAD Request Pre-check

The Idea

HTTP HEAD requests return only headers without the response body. By storing Content-Length in the DB and comparing it on the next call, we can determine if the file has changed without downloading 81MB.

Preliminary Investigation — data.go.kr Response Headers

Results from actual HEAD request:

Header	Value	Usability
Content-Length	84,733,371 (~81MB)	O — Used for change detection
Last-Modified	(none)	X — Not provided
ETag	(none)	X — Not provided

Only Content-Length is provided, but since file size always changes when rows are added to the CSV, it’s sufficient. ETag or Last-Modified would have been more accurate, but they’re rarely expected from public data portals.

Layer 1 Implementation

# Store Content-Length in app_settings table
CREATE TABLE IF NOT EXISTS app_settings (
    setting_key VARCHAR(100) PRIMARY KEY,
    setting_value TEXT,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

# HEAD request → Compare → Decide whether to skip
req = urllib.request.Request(url, method='HEAD')
with urllib.request.urlopen(req, timeout=30) as resp:
    remote_cl = int(resp.headers.get('Content-Length', 0))

stored_cl = get_stored_content_length()  # Query from DB

if remote_cl > 0 and remote_cl == stored_cl:
    return {"status": "ok", "message": "No changes"}  # Exit in 0.3s

Note: HEAD requests can fail (timeout, server errors, etc.), so we proceed with full download as fallback. Even if optimization fails, the feature itself should work.

Layer 2: Reverse Line Scan (High-Water Mark)

High-Water Mark Pattern

This is a commonly used pattern in CDC (Change Data Capture). Record “the last processed point” and process only after that point next time.

In this CSV, each row’s eum_url has a sequence number in the format seq=NNNNN, and seq increases as new notices are added. By querying MAX(seq) from the DB, we know “where we processed up to.”

CSV structure example:

─────────────────────────────────────────────────────
Row 1:     "Seoul Notice No. 2020-001", ..., seq=100001
Row 2:     "Busan Notice No. 2020-002", ..., seq=100002
...
Row 448098: "Gyeonggi Notice No. 2025-311", ..., seq=611080  ← DB max_seq
Row 448099: "Ministry Notice No. 2025-598", ..., seq=611081  ← New
Row 448100: "Seoul Notice No. 2025-155", ..., seq=611082  ← New
─────────────────────────────────────────────────────

Why Scan in Reverse (from the end)?

Forward scanning (from the beginning) requires checking all 448,100 rows to find new records. But since new rows are always appended to the end of the file, scanning from the end allows us to stop immediately upon encountering existing data (seq ≤ max_seq).

Scan Direction	0 new	5 new	1000 new
Forward (start→end)	Scan all 448,100 rows	Scan all 448,100 rows	Scan all 448,100 rows
Reverse (end→start)	Stop after 1 row	Stop after 6 rows	Stop after 1,001 rows

In reality where tens to hundreds of records are added monthly, reverse scanning operates in near-constant time (close to O(1)).

Layer 2 Implementation

# Query last processed point from DB
max_seq = db.query("SELECT MAX(seq) FROM gosi_info WHERE eum_url LIKE '%seq=%'")

# Load entire CSV into memory (no local file)
csv_text = urllib.request.urlopen(url).read().decode('utf-8-sig')
lines = csv_text.split('\n')

# Scan from the end — stop immediately when seq ≤ max_seq
new_line_indices = []
for i in range(len(lines) - 1, 0, -1):
    line = lines[i].strip()
    if not line:
        continue

    seq_pos = line.find('seq=')
    if seq_pos == -1:
        continue

    seq = int(line[seq_pos + 4:].split(',')[0].split('"')[0])

    if seq <= max_seq:
        break              # Already in DB from here → Stop scanning

    new_line_indices.append(i)

new_line_indices.reverse()  # Restore original order

Key point: Instead of parsing the entire CSV with DictReader, we extract only seq= from the raw string. Parsing is performed only for new records. Full DictReader parsing of 448,100 rows: 1.60s vs reverse scan: 0.14s — 11x faster.

Layer 3: Lightweight Parsing + Batch INSERT

After securing only the indices of new rows through reverse scanning, we parse only those rows with csv.reader.

csv.DictReader vs csv.reader

DictReader creates a dictionary for each row, which has overhead. For CSVs with fixed column order, index-based access is more efficient.

# CSV column order: Notice No[0], Notice Date[1], URL[2], Notice Title[3]
for i in new_line_indices:
    parsed = list(csv.reader([lines[i]]))[0]
    gosi_no    = parsed[0].strip()
    gosi_date  = parsed[1].strip()
    eum_url    = parsed[2].strip()
    gosi_title = parsed[3].strip()

    new_rows.append((gosi_no, gosi_date, gosi_title, eum_url))

Why use csv.reader: str.split(",") is faster, but it breaks when values contain commas (e.g., “Seoul Gangnam-gu, Seocho-gu”). csv.reader safely handles quoted values according to RFC 4180 specification.

Batch INSERT + ON CONFLICT

from psycopg2.extras import execute_batch

execute_batch(cursor,
    """INSERT INTO gosi_info (gosi_no, gosi_date, gosi_date_raw, gosi_title, eum_url,
                              data_source, updated_at, created_at)
       VALUES (%s, %s, %s, %s, %s, 'csv', NOW(), NOW())
       ON CONFLICT (eum_url) DO NOTHING""",
    new_rows,
    page_size=5000,
)

ON CONFLICT DO NOTHING: If reverse scanning works perfectly, duplicate INSERTs shouldn’t occur, but we add a UNIQUE index on eum_url and ON CONFLICT as a safety net. This is defensive programming.

PoC-Based Decision — Removing sido/sigungu Columns

In the initial design, the gosi_info table had sido (province name) and sigungu (city/district name) columns, parsed and stored from the notice number. For example, extracting “Seoul Metropolitan City” from “Seoul Metropolitan City Notice No. 2025-155” and storing it in the sido column.

Data Quality Issues Discovered

Validating actual data revealed parsing quality problems:

Type	Ratio	Example
Normal (province only)	~30%	“Seoul Metropolitan City”
Province+City mixed	~65%	“Gyeonggi-do Yongin-si”, “Chungcheongnam-do Asan-si”
Unparseable	~5%	Issuing authority not specified

If 65% is incorrect data, cleansing logic is needed. But do we really need this column?

PoC: Can We Get the Same Results Without sido?

For the address “Daegu Metropolitan City Suseong-gu Yeonho-dong 203-2”, matched with ntfc_sn 27000NTC202411080001:

Search Method	Result Count	Top 1 Result
sido column filter	5	Daegu Yeonho Public Housing District Plan Amendment (4th) [score=0.75]
gosi_no LIKE filter	5	Daegu Yeonho Public Housing District Plan Amendment (4th) [score=0.75]
No filter (date+title only)	50	Daegu Yeonho Public Housing District Plan Amendment (4th) [score=0.75]

Conclusion: All three methods returned the same top result. The sido column reduces candidates through SQL filtering, but the same result rises to the top in score-based sorting anyway. Especially with min_score=0.3 threshold, 50 records → 2 records as noise is removed.

Benefits of Removal

Preprocessing time reduced: No sido/sigungu UPDATE needed for 448K records
2 fewer columns on INSERT → slight I/O savings
Data consistency issues (65% errors) eliminated entirely
Code simplification: _update_sido_sigungu() function deleted

Lesson: Before adding columns that “seem necessary,” validate actual necessity with PoC. In this case, removing the column entirely was better than the cost of cleansing 65% erroneous data.

Score Threshold for Removing Search Noise

After optimizing DB sync, the next problem was search quality. After narrowing candidates by date (±7 days) parsed from ntfc_sn and “district unit plan” keyword, we scored and sorted by multiple factors — but returning top 10 without threshold resulted in severe noise.

Problem: 8 of Top 10 Results Were Noise

Below are results for “Daegu Suseong-gu Yeonho-dong” address searched without threshold:

Rank	Notice Title (Summary)	score
1	Daegu Yeonho Public Housing District Plan Amendment (4th)	0.75
2	Daegu Yeonho Public Housing District Plan Amendment (4th) Topographic Map	0.75
3	Ulsan Jung-gu Bangu-dong District Unit Plan…	0.10
4	Gangwon-do Wonju-si Haengu-dong District Unit Plan…	0.10
5	Gyeongsangnam-do Geoje-si Jangpyeong-dong District Unit Plan…	0.10
…	(Same pattern of unrelated notices)	0.10
10	Seoul Seocho-gu Bangbae-dong District Unit Plan…	0.10

Only the top 2 (score=0.75) were actually relevant notices, while the remaining 8 (score=0.10) were completely unrelated data from different regions that just happened to be published in the same period. 0.10 is just a base score for “exists within date range,” not an actual match.

Score Composition Analysis

The matching score is a sum of 4 factors:

Factor	Score	Condition
Base (date range match)	+0.10	Notice date within ±7 days of ntfc_sn date
Province name match	+0.30	gosi_no contains province name (e.g., “Daegu”)
Dong name match	+0.40	gosi_title contains dong name (e.g., “Yeonho”)
City/district bonus	+0.20	gosi_title contains city/district name (e.g., “Suseong-gu”)

Gap in Score Distribution: There’s a clear gap between relevant notices (0.75) and irrelevant ones (0.10). If any one of province name, dong name, or city/district matches, it’s at least 0.40, so the 0.10~0.30 range is “noise that only shares dates.”

Solution: Apply min_score=0.3 Threshold

def search_gosi_by_ntfc_sn(
    ntfc_sn: str,
    address: str = None,
    limit: int = 10,
    min_score: float = 0.3    # This one line is key
) -> List[Dict[str, Any]]:

    # ... Query candidates + calculate scores ...

    for row in rows:
        score = calculate_match_score(row, sido_name, dong_name, address)
        if score >= min_score:     # Discard below threshold
            results.append({...})

Results After Application

	No min_score	min_score=0.3
Results returned	10	2
Relevant notices	2 (score=0.75)	2 (score=0.75)
Noise	8 (score=0.10)	0
Precision	20%	100%

Why 0.3: The weakest meaningful match is “province name only” (0.10 + 0.30 = 0.40). 0.3 is slightly below this minimum meaningful match — a conservative threshold that definitely filters out pure noise (0.10) while not missing weak matches. In actual testing, no scores between 0.10 and 0.40 were observed, so any value in the 0.2~0.35 range would have the same effect.

Lesson: Rather than always returning results up to the limit count, it’s much better for user experience to not return results that don’t meet quality criteria. “Nothing is better than wrong.” Especially when AI agents consume these results, noise leads to incorrect inference.

Performance Measurement Results

Response Time by Scenario

Scenario	Time	Description
Initial full load	52.2s	Download(1.5s) + Parse(4.2s) + INSERT(46.4s)
Incremental (HEAD skip)	0.32s	Content-Length same → Exit immediately
Incremental (5 new)	2.26s	Download(1.5s) + Scan(0.1s) + INSERT(0.01s)
Incremental (0 new, CL changed)	2.02s	Download(1.5s) + Scan(0.1s) + No new records

Time Breakdown by Layer

Step	Time	Notes
HEAD request	0.14s	Network latency only
Full download (81MB)	1.5s	Server→Server (within AWS)
Reverse scan (448K rows)	0.14s	0 new → Check only 1 row
Full DictReader parse (comparison)	1.60s	Pre-optimization method
csv.reader 5 records parse	< 0.01s	Negligible
execute_batch 5 records INSERT	< 0.01s	Negligible
execute_batch 448K records INSERT	46.4s	Initial load only

Bottleneck Analysis: The bottleneck in initial full load is the 46.4s INSERT (89% of total). However, since initial load happens only once and subsequent syncs need only incremental sync (0.3~2s), this was excluded from optimization targets. Could be improved with COPY FROM, but ROI is low.

Before/After Comparison (Typical Calls)

	Before Optimization	After Optimization	Improvement
No changes	52s (full load every time)	0.3s (HEAD skip)	173x
5 new records	52s (full parse + compare all)	2.3s (reverse scan + incremental)	23x

Key Lessons

1. Check HTTP Headers First

Before downloading data, determine “whether you need to download” with a HEAD request first. Any one of Content-Length, ETag, or Last-Modified is sufficient. Public data APIs often don’t provide these headers well, but Content-Length is mostly available.

2. Design Algorithms Based on Data Characteristics

The domain knowledge that “new data is always appended at the end” enabled reverse scanning. Algorithms tailored to data characteristics are far more efficient than general-purpose ones (full scan, hash comparison).

3. Validate “Seemingly Necessary” Features with PoC

The sido/sigungu columns seemed intuitively necessary, but actual PoC results showed we could get the same results without them. Additional columns/indexes come with maintenance costs and data consistency risks. One PoC before writing code can avoid unnecessary complexity.

4. Separate Optimization Layers

Rather than trying to achieve perfect optimization at once, separating by layers allows independent testing and improvement of each layer. A structure where Layer 2 works even if Layer 1 fails is more stable.

5. Code Defensively, But Don’t Sacrifice Performance

We maximized happy path performance while adding safety nets like ON CONFLICT DO NOTHING, fallback on HEAD failure, and date parsing correction (“2020-03-00” → “2020-03-01”).

6. Search Results: Quality Over Quantity

A single threshold (min_score=0.3) improved Precision from 20% to 100%. Especially in pipelines where AI agents consume results, noise leads to incorrect inference. Apply the principle “nothing is better than wrong.”

Problem Definition — Why Optimization Was Needed#

Data Scale#

Constraints — Limitations of Serverless Environment#

Overall Architecture — 3-Layer Optimization Strategy#

Layer 1: HEAD Request Pre-check#

The Idea#

Preliminary Investigation — data.go.kr Response Headers#

Layer 1 Implementation#

Layer 2: Reverse Line Scan (High-Water Mark)#

High-Water Mark Pattern#

Why Scan in Reverse (from the end)?#

Layer 2 Implementation#

Layer 3: Lightweight Parsing + Batch INSERT#

csv.DictReader vs csv.reader#

Batch INSERT + ON CONFLICT#

PoC-Based Decision — Removing sido/sigungu Columns#

Data Quality Issues Discovered#

PoC: Can We Get the Same Results Without sido?#

Benefits of Removal#

Score Threshold for Removing Search Noise#

Problem: 8 of Top 10 Results Were Noise#

Score Composition Analysis#

Solution: Apply min_score=0.3 Threshold#

Results After Application#

Performance Measurement Results#

Response Time by Scenario#

Time Breakdown by Layer#

Before/After Comparison (Typical Calls)#

Key Lessons#

1. Check HTTP Headers First#

2. Design Algorithms Based on Data Characteristics#

3. Validate “Seemingly Necessary” Features with PoC#

4. Separate Optimization Layers#

5. Code Defensively, But Don’t Sacrifice Performance#

6. Search Results: Quality Over Quantity#