December 15, 2024

Building a CSV Diff Tool for API Regression Testing

When you're working with data feed systems, one of the scariest moments is deploying a configuration change and hoping nothing breaks. A field mapping that worked fine in staging might silently corrupt data in production. HTML content that rendered correctly might get stripped or double-encoded.

After one too many "wait, why is this field empty now?" incidents, I decided to build a tool that would catch these regressions before they hit production.

The Requirements

I needed something that could:

Fetch CSV exports from both production and development environments
Handle thousands of rows without running out of memory
Compare field-by-field and tell me exactly what changed
Run fast enough to be useful in a CI/CD pipeline

The Approach

Python's asyncio and aiohttp made the concurrency part straightforward. Instead of fetching one URL at a time, I could fire off 200 requests simultaneously and wait for them all to complete.

async def fetch_all(urls: list[str], concurrency: int = 200):
    semaphore = asyncio.Semaphore(concurrency)
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_with_semaphore(session, url, semaphore) for url in urls]
        return await asyncio.gather(*tasks)

The tricky part was the diffing logic. CSV cells can contain HTML, which means whitespace normalization gets complicated. I ended up building a normalize function that handles the common edge cases:

Stripping leading/trailing whitespace
Normalizing line endings
Handling encoded HTML entities
Dealing with floating-point precision issues in numeric fields

The Result

The tool now runs as part of our pre-deployment checklist. When someone makes a change to a feed configuration, they can run the diff against both environments and see exactly what will change before it goes live.

It's caught several bugs that would have been painful to debug in production—fields that got renamed, HTML that got double-encoded, and one memorable case where a regex replacement was eating legitimate data.

Lessons Learned

The biggest takeaway: investing time in testing infrastructure pays off quickly. The tool took a weekend to build, but it's saved hours of debugging and prevented at least a few production incidents.

Sometimes the best code you write is the code that tells you when other code is broken.