BUSINESS INTELLIGENCE

SEC EDGAR data scraping service.

Custom filing pipelines, narrative section extraction, and real-time 8-K delivery for quant funds and research firms.

The problem.

WHY THIS IS HARDER THAN IT LOOKS

Quant hedge funds, equity research firms, ESG analysts, and compliance teams all use SEC EDGAR data, and they all eventually hit the same limit. The official SEC API at data.sec.gov covers structured financial statements through XBRL-tagged line items and filer metadata. It serves its purpose well for standardized financial metrics. But it does not cover the narrative sections of a 10-K where a significant portion of the investment signal lives: Management's Discussion and Analysis, Risk Factors, Legal Proceedings, Critical Accounting Policies, and the footnotes that contextualize the numbers. These narrative sections are where companies disclose emerging risks, strategy shifts, litigation exposure, and accounting judgment calls that the structured data does not capture.

The enterprise data terminals and research platforms serve the high end of this market well and price accordingly, often running into five or six figures per seat per year. Generic EDGAR API wrappers cover the easy structured data at low cost but struggle with custom schemas, extension taxonomies, and the narrative text that requires actual parsing.

The gap is a managed service that handles custom extraction questions case by case: narrative sections parsed into structured fields, XBRL normalized across extension taxonomies, real-time 8-K delivery filtered to a watchlist, or a bespoke pipeline feeding directly into a fund's internal research system. This is not a wrapper around a public API. It is a custom pipeline that handles the parsing, normalization, and delivery that the official endpoints do not cover.

Across 200+ projects through sidb.work, the SEC EDGAR work has supported one running hedge fund client with a 1,200-ticker watchlist, 2,000+ filings processed daily, a 3.2 minute average latency from SEC publication to structured data delivery, and 99.4% parse accuracy across 180+ normalized financial metrics. The same approach applies to any filing type, any watchlist size, and any delivery target.

For firms that need the data the official structured endpoints do not cover, in the format their systems already consume, at a latency that matters for their workflow, this is what I build. You tell me the filings, the fields, the tickers, and the delivery targets. I build the pipeline.

Is this right for you?

GOOD FIT IF ANY OF THESE SOUND LIKE YOU

You run a quant fund or research firm and need filing data faster or in a different shape than generic API wrappers provide

You need narrative section extraction (MD&A, Risk Factors, Legal Proceedings) that XBRL does not cover

You want real-time 8-K pipelines filtered to a specific watchlist, with sub-5-minute latency

You have custom research questions that require bespoke parsing, not a standard schema

What you receive.

EXACT FIELDS, DELIVERED IN YOUR FORMAT

accession_numberstringSEC-assigned unique identifier for the filing. Primary key across the system.
filer_cikstringCentral Index Key of the filing entity, zero-padded to 10 digits.
tickerstringPrimary trading symbol for the filer, resolved from the CIK.
form_typeenumFiling form type: 10-K, 10-Q, 8-K, DEF 14A, S-1, and others supported on request.
filing_datedateDate the document was filed with the SEC.
period_of_reportdateFiscal period end date covered by the filing.
xbrl_metricsobjectNormalized financial metrics parsed from XBRL, mapped to a consistent schema across filers.
risk_factors_extractedarrayRisk Factors section parsed into individual risk items with classification tags.
mdna_summarystringManagement's Discussion and Analysis section extracted as structured text with headings preserved.
legal_proceedingsarrayLegal Proceedings section parsed into individual case summaries.
segment_breakdownobjectSegment-level financial data with unit conversion and scale factor normalization.
publication_latency_secondsnumberTime from SEC publication to delivery to your system, for SLA tracking.

Sample record.

sec-edgar.sample.json
{
  "accession_number": "0000789019-26-000042",
  "filer_cik": "0000789019",
  "ticker": "MSFT",
  "form_type": "10-K",
  "filing_date": "2026-03-15",
  "period_of_report": "2025-12-31",
  "xbrl_metrics": {
    "revenue_usd": 245122000000,
    "operating_income_usd": 109433000000,
    "net_income_usd": 88136000000
  },
  "risk_factors_extracted": ["Competition", "Cybersecurity", "Regulatory", "AI Model Liability"],
  "segment_breakdown": {
    "Productivity and Business Processes": { "revenue_usd": 74021000000 },
    "Intelligent Cloud": { "revenue_usd": 106362000000 },
    "More Personal Computing": { "revenue_usd": 64739000000 }
  },
  "publication_latency_seconds": 192,
  "delivered_at": "2026-03-15T22:03:12Z"
}

Straightforward pricing.

SCALE DETERMINES PRICE · NO HIDDEN FEES

Filing set extraction

from $199

One-time extraction of a specific filing set. Delivered in 3 to 7 days.

  • Up to 1,000 filings
  • Custom field schema
  • XBRL + narrative sections
  • CSV, JSON, or Parquet
Get a quote →

Live pipeline

from $499/mo

Real-time processing of new filings on a configured watchlist.

  • Watchlist of up to 500 tickers
  • All form types (10-K, 10-Q, 8-K, etc.)
  • Target latency under 5 minutes
  • Webhook + database delivery
Get a quote →

Research partnership

Custom

Larger watchlists, custom research questions, and bespoke schemas.

  • Full ticker universe
  • Narrative topic extraction
  • Per-engagement scoping
  • Research call included
Get a quote →

Frequently asked questions.

EVERYTHING YOU NEED TO KNOW

All of the common ones: 10-K, 10-Q, 8-K, DEF 14A, S-1, S-4, and any form type available through EDGAR on request. The supporting case study covers 2,000+ filings processed per day across 1,200 watchlist tickers with a 3.2 minute average latency from publication to delivery.

Yes. Risk Factors, MD&A, Legal Proceedings, and Critical Accounting Policies are parsed into structured fields with headings preserved. For Risk Factors specifically I can classify individual risk items into categories you define (cybersecurity, regulatory, competition, etc.) and track how the language evolves across years for the same filer.

The target for the Live Pipeline tier is under 5 minutes from SEC publication to your system. One running client pipeline averages 3.2 minutes end-to-end with a 99.4% parse accuracy across 180+ normalized metrics. Delivery is push-based via webhook or direct database write.

Custom XBRL extensions get resolved against the base US-GAAP or IFRS taxonomy at parse time, with explicit mapping decisions surfaced in the output so your analysts can review anything the mapper was not certain about. For engagements with specific taxonomy concerns this is scoped during the initial call.

Six formats cover most clients: CSV or JSON to S3 or Google Cloud Storage, Parquet files for analytical workloads, direct writes to PostgreSQL or BigQuery, and webhook pushes to internal research systems. For quant clients the most common integration is webhook to a research engine plus a parquet archive.

Yes. Amended filings (10-K/A, 10-Q/A) and restated figures are tracked as diffs against the original filing, so downstream systems can see exactly what changed and when. This is important for historical analysis and risk modeling.

Yes. For research engagements I build cross-filer datasets that normalize the same metrics across all companies in a given sector, SIC code, or custom peer group. This is the most common use case for the Research Partnership tier and is how many quant clients build factor models or ESG scoring systems from raw filing data.

Yes. For example, if you want every mention of 'supply chain risk' or 'AI regulation' across all 10-Ks filed by a peer group in the last 3 years, I build a topic extraction pipeline that searches the narrative sections and returns structured results with filing context, section location, and surrounding text. For clients building ESG models this is often the primary engagement: tracking how disclosure language around climate risk, workforce diversity, or board governance evolves across reporting periods. This is scoped as a Research Partnership engagement because each extraction question is configured differently.

Foreign private issuers that file on SEC EDGAR (20-F, 6-K) are fully supported. These filings sometimes use IFRS instead of US-GAAP taxonomy, and the parsing engine handles the taxonomy mapping transparently. Dual-listed companies that file both a 20-F and domestic filings can be tracked under a single watchlist entry. Filings on non-SEC systems (such as Companies House in the UK or SEDAR+ in Canada) are outside the current scope but can be discussed during scoping if your research requires cross-jurisdiction coverage.

Any filing on SEC EDGAR, which goes back to 1993 for most filers and 2001 for full-text search. For historical backfills I scope by the number of filings and the fields you need. Large historical reconstructions (for example, 10-K Risk Factors across a full sector for 15 years) are common research engagements and scope cleanly into the Filing Set Extraction tier for smaller runs or the Research Partnership tier for larger ones.

Yes. The most common Live Pipeline integration is webhook-to-research-system plus a parallel parquet archive for historical replay. For clients running Databricks or Snowflake, direct table writes with schema sync are supported. For clients with an internal factor library, custom delivery targets are scoped in the initial call.

For a filing set extraction, the typical turnaround is 3 to 7 business days depending on the number of filings and the complexity of the field schema you need. For a live pipeline, the setup takes approximately 5 to 10 business days for watchlist configuration, parser setup, delivery integration testing, and a validation run against known filings. After the pipeline is live, new filings are processed automatically with the target latency defined in your SLA.

A one-time filing set extraction starts at $199 for up to 1,000 filings, depending on field depth and form type complexity. The Live Pipeline tier starts at $499 per month for a watchlist of up to 500 tickers with real-time delivery. Research partnerships are scoped per engagement. This is the same pricing model I have used across 200+ projects delivered through sidb.work.

Ready to get SEC EDGAR data?

Book a 30-minute call and I’ll scope it live.