Building a CSV Diff Tool for API Regression Testing
When you're working with data feed systems, one of the scariest moments is deploying a configuration change and hoping nothing breaks. A field mapping that worked fine in staging might silently corrupt data in production. HTML content that rendered correctly might get stripped or double-encoded.
After one too many "wait, why is this field empty now?" incidents, I decided to build a tool that would catch these regressions before they hit production.
The Requirements
I needed something that could:
- Fetch CSV exports from both production and development environments
- Handle thousands of rows without running out of memory
- Compare field-by-field and tell me exactly what changed
- Run fast enough to be useful in a CI/CD pipeline
The Approach
Python's asyncio and aiohttp made the concurrency
part straightforward. Instead of fetching one URL at a time, I could fire off
200 requests simultaneously and wait for them all to complete.
async def fetch_all(urls: list[str], concurrency: int = 200):
semaphore = asyncio.Semaphore(concurrency)
async with aiohttp.ClientSession() as session:
tasks = [fetch_with_semaphore(session, url, semaphore) for url in urls]
return await asyncio.gather(*tasks) The tricky part was the diffing logic. CSV cells can contain HTML, which means whitespace normalization gets complicated. I ended up building a normalize function that handles the common edge cases:
- Stripping leading/trailing whitespace
- Normalizing line endings
- Handling encoded HTML entities
- Dealing with floating-point precision issues in numeric fields
The Result
The tool now runs as part of our pre-deployment checklist. When someone makes a change to a feed configuration, they can run the diff against both environments and see exactly what will change before it goes live.
It's caught several bugs that would have been painful to debug in production—fields that got renamed, HTML that got double-encoded, and one memorable case where a regex replacement was eating legitimate data.
Lessons Learned
The biggest takeaway: investing time in testing infrastructure pays off quickly. The tool took a weekend to build, but it's saved hours of debugging and prevented at least a few production incidents.
Sometimes the best code you write is the code that tells you when other code is broken.