How I Built a Pipeline Covering 932 US Counties

The ask

A proptech startup reached out in mid-2024. They needed unified property tax data across a large number of US counties for a product they were building. Their users (real estate investors) needed fresh assessor records, tax status, ownership history, and encumbrances for the counties they invested in.

The challenge was not any single county. Any developer can write a scraper for one county assessor website. The challenge was doing it across 932 counties, each with a different website, a different search interface, a different data schema, and a different idea of what "property records" means.

No API exists for this. The enterprise data providers cover it at enterprise pricing. The investor platforms package it inside a subscription UI. This client wanted the raw data, in their format, flowing into their own system, on a schedule they controlled.

Why county websites are harder than they look

From the outside, county assessor websites seem simple. Search by address or parcel number, get back property details. No login required, no anti-bot protection, just a public service.

The complexity is in the heterogeneity.

Different underlying platforms. Across the US, county assessor websites run on a small number of software platforms, but each platform works differently. Some render results through server-side HTML. Others require JavaScript execution. Some use session tokens with per-request validation. Others expose search results as paginated HTML tables with no consistent pagination pattern.

Different schemas. One county calls it "Assessed Value." Another calls it "Total Appraised." A third calls it "Land + Improvement Total" and requires you to add two fields together. One county includes mailing address in the property record. Another puts it in a separate "owner" tab that requires a second request. Normalizing these into a consistent schema requires per-county mapping rules.

Different access patterns. Some counties return search results immediately. Others require selecting a county, then a municipality, then entering a name or parcel number, then clicking through a CAPTCHA, then paginating results 10 at a time. A scraper that works for Harris County, Texas will not work for Cook County, Illinois without significant modification.

Frequent changes. County websites update when the county migrates to a new vendor, redesigns the site, or adds new required fields to the search form. This happens more often than you would expect. Across 932 counties, I see 5 to 10 site changes per month that require scraper updates.

How the pipeline works

I am not going to detail the specific technical approach (that is proprietary and hard-won), but the high-level architecture is worth understanding because it explains why this is structured as a managed service rather than a one-time build.

Classification first

The first step was cataloging every target county website and classifying it by the software platform it runs on. This reduced 932 unique websites into a much smaller number of platform groups. Each group gets a specialized parser that handles the common patterns for that platform type.

Per-county overrides

Within each platform group, individual counties diverge. A county might use the same underlying software as 50 others but have a custom search form, a different URL structure, or an additional data field. These get per-county configuration overrides layered on top of the group parser.

Normalization layer

Every county's raw output gets mapped into a unified schema. For this client, that schema has 42 standardized fields covering parcel identification, ownership, assessed and market values, tax amounts and status, transfer history, and lien records. The mapping rules are defined per county and tested against known records before going live.

Scheduling and monitoring

The pipeline runs on a configurable schedule (daily for this client) with monitoring that catches failed counties on the first attempt. If a county site changes and the scraper breaks, I get alerted immediately, not after a week of missed data.

The numbers

After building and maintaining this pipeline for over a year:

932 counties covered
4.2 million property records extracted
42 normalized fields per record
98.1% successful extraction rate across all counties per run
Monthly delivery via PostgreSQL with change detection reports (CSV)

The 1.9% failure rate comes from counties that are temporarily down, counties undergoing site migrations, and a small number of counties with access restrictions that prevent reliable automated extraction. These are flagged in every delivery so the client knows exactly which counties produced data and which did not.

What broke along the way

No scraping project runs perfectly. Here is what went wrong and how it got fixed:

The vendor migration problem

Three months into the project, one of the most common county software platforms pushed a major update that changed its URL structure, session handling, and result rendering. Sixty-two counties broke overnight. The fix took three days of full-time work to update the group parser and test against every affected county.

This is the single biggest argument for structured maintenance rather than a one-time build. If this pipeline had been a script sitting on someone's server with no monitoring, the data would have silently stopped flowing for 62 counties and nobody would have noticed for weeks.

The CAPTCHA counties

About 40 of the 932 counties have CAPTCHAs on their search pages. For most of these, the CAPTCHAs are standard and solvable within the provider's terms. For a handful, the CAPTCHAs are aggressive enough that extraction rates drop below an acceptable threshold. These counties are marked in the delivery metadata and the client decides whether the lower extraction rate is acceptable for their use case.

The schema drift problem

Counties occasionally change what fields they display. A county that previously showed "Last Sale Price" might remove it or rename it in a site update. The normalization layer catches these changes because the expected field is suddenly null or mapped to a different source field. But catching it does not mean fixing it is automatic. Each schema drift requires investigation: did the field move? Was it removed? Is it now behind a different tab?

Over the course of a year, I handled approximately 30 schema drift events across 932 counties. Each one took 1 to 4 hours to resolve.

What I learned

Heterogeneity is the real challenge

The technical scraping is not the hard part. The hard part is the variation across thousands of sources. Any one county is a simple extraction. A thousand counties is a systems problem.

Monitoring pays for itself

The monitoring layer is the single most valuable part of the pipeline. It catches breakage within one run cycle, alerts me before the client notices, and gives me the context to fix it quickly. Without monitoring, this pipeline would deliver stale or incomplete data regularly and nobody would know.

Maintenance is the product

The initial build took about 6 weeks of full-time work. The ongoing maintenance averages 15 to 20 hours per month. That maintenance is what keeps the data flowing reliably. Without it, the pipeline would degrade to maybe 60% extraction rate within 6 months as county websites update their sites.

The service is not "I built you a scraper." The service is "your data keeps flowing and I handle everything that tries to stop it."

Start with what the client needs, not what the counties have

Early in the project I tried to extract every field every county exposes. That was a mistake. The normalization effort scales with the number of fields, and many fields are only relevant to title companies or tax assessors, not to real estate investors. Scoping the schema to exactly what the client needs (42 fields, not 120) reduced the build time by at least 40%.

Would I do it again?

Yes. This is one of the most interesting projects I have worked on because it combines real technical challenge (the heterogeneity problem) with real business value (the data directly powers a product people pay for). The pipeline has been running for over a year now and the client has not had to think about county websites once since we launched.

If you need county property data for your own use case, whether it is 5 counties or 500, the infrastructure and approach are the same. The per-county effort decreases as coverage grows because most counties fall into platform groups that are already handled. What changes is the normalization rules and the monitoring scope.

The county records data service page has more details on pricing, output schema, and delivery options.