The short answer
Use an API when one exists and gives you the data you need. Use web scraping when there is no API, the API does not expose what you need, or the API costs more than the data is worth to you.
That covers 90% of cases. The rest of this article covers the 10% where the answer is not obvious.
What an API gives you
An API (Application Programming Interface) is an official, structured way to get data from a platform. The platform decides what data to expose, how to expose it, and what it costs. You send a request in a specified format, you get back a response in a specified format.
The advantages are real:
- Structured and documented. The response format is defined, versioned, and (usually) stable. You know what you are going to get.
- Permitted by the platform. You are using the service as intended. No gray areas.
- Reliable at scale. Official APIs are built to handle load. Rate limits are documented. Error codes are meaningful.
- Maintained by the provider. When the platform changes its internal structure, the API stays stable (usually). Breaking changes are versioned and announced.
What an API does not give you
The gap between what a platform shows on its website and what it exposes through its API can be enormous. A few examples from real projects:
- LinkedIn has no commercial API for profile enrichment. The Marketing API covers ad campaigns. The Sales Navigator API is locked to enterprise contracts. The public profile data that everyone wants is not available through any official endpoint at a useful scale.
- Instagram's Graph API only shows data for accounts you own or manage. Competitor profiles, hashtag analytics across other accounts, and public engagement data on posts you did not create are all outside the API scope.
- SEC EDGAR has an official API that covers XBRL-tagged financial data beautifully. But it does not cover the narrative sections of a 10-K filing (Management Discussion, Risk Factors, Legal Proceedings) where much of the investment signal lives.
- County assessor websites have no APIs at all. 3,143 US counties, each with its own website, its own search form, and its own data format. The only way to get property records programmatically is to scrape the individual county sites.
In each of these cases, the API either does not exist, does not cover the data you need, or requires an enterprise contract that costs more than the project is worth.
When scraping fills the gap
Web scraping works from the same data the website shows to a human visitor. If you can see it in a browser, a scraper can extract it. This means the data coverage is always equal to what the website displays, not what the platform has decided to expose through an API.
The tradeoffs are also real:
- Fragile. A scraper depends on the website's HTML structure, which can change without notice.
- Rate-limited by behavior. You cannot hammer a website the way you can hammer an API. You need to pace requests to avoid bans.
- Legal gray in some cases. Scraping public data is broadly legal in the US (see the court rulings around public data access), but some platforms have terms of service that restrict automated access. The practical risk varies by platform and use case.
- Maintenance overhead. When the target site changes, the scraper breaks and someone needs to fix it.
The decision framework
Here is how I think about it for client projects:
Start with the API
If a platform offers an API that covers your data needs at a price that makes sense, use it. Do not scrape what you can get officially. It is more reliable, more maintainable, and less legally ambiguous.
Scrape when the API falls short
If the API does not exist, does not cover the fields you need, rate-limits below your volume needs, or prices above your budget, scraping is the practical alternative. This is the majority of projects I work on.
Combine both
Some of the most robust pipelines use both. Pull what you can from the official API and supplement with scraped data for the fields or coverage the API does not provide. For SEC EDGAR projects, I use the official XBRL API for structured financial data and custom scrapers for narrative sections. Both data sources feed into the same normalized output.
Avoid scraping when the data changes too fast
Real-time data (stock prices, live inventory, auction bids) is rarely a good fit for scraping because the latency of a scrape cycle means the data is already stale when it arrives. For these use cases, a streaming API or websocket connection is the right tool. If neither exists, you need a very fast scraping loop with the infrastructure to support it, which is expensive.
Cost comparison
| Factor | Official API | Web Scraping |
|---|---|---|
| Setup cost | Low (documentation exists) | Medium to high (reverse engineering) |
| Per-record cost | Usually explicit (credits, seats) | Infrastructure cost (proxies, servers) |
| Maintenance | Low (provider maintains) | Medium to high (you maintain) |
| Data coverage | Limited to what provider exposes | Equal to what website displays |
| Reliability | High (SLAs, versioning) | Medium (depends on site stability) |
| Legal clarity | Clear (permitted use) | Varies (public data is broadly OK) |
A real example: county property records
One of the clearest cases for scraping over API is US county property data. ATTOM Data and CoreLogic (now Cotality) offer property APIs, but they cost thousands per month and their data is aggregated, not live county records. They are designed for mortgage servicers and insurance companies, not for a real estate investor who needs fresh tax-delinquent rolls from 10 specific counties.
The county websites publish this data for free, for public access. But there is no API. You have to search, paginate, and extract the HTML. Across 200+ projects through sidb.work, the county records pipeline is one of the most common recurring engagements precisely because the API alternative is either too expensive or does not exist.
What I would tell you in a scoping call
If you come to me with a data project, the first thing I check is whether an official API covers your need. If it does, I will tell you to use it and save your money. If it does not, I will tell you what scraping the source actually involves, what it costs, and what the maintenance looks like over time.
The worst outcome is spending weeks building a scraper for data you could have gotten through an API call. The second worst outcome is paying enterprise API prices for data you could get from a public website for a fraction of the cost. Knowing which path fits your situation is the first step.