How do you handle sites with Cloudflare, DataDome, or CAPTCHAs?

I use headless browsers, residential proxies, and custom fingerprinting to bypass most anti-bot systems. If a site is genuinely locked down, I'll tell you upfront during scoping, not after you've paid.

What does 'clean data' actually mean?

Every delivery goes through automated validation: schema checks, null detection, deduplication, and format verification. You get structured rows ready to import, not raw HTML or broken encodings.

What data sources can you work with?

Any publicly accessible website or API. If you can see it in a browser without logging in, I can extract it. I also work with public APIs, government databases, and open data portals.

What happens if the source site changes its layout?

For one-time jobs, the delivery is final. For recurring pipelines, I monitor for breakage and fix scrapers within 24 hours of detection, usually before you notice.

How do recurring deliveries work?

I set up an automated pipeline that runs on your schedule (daily, weekly, monthly) and delivers fresh data to your inbox, Google Sheets, S3 bucket, database, or API endpoint. You just consume the data.

Yes. Happy to sign your NDA before we discuss any project details. I work with financial firms, legal teams, and enterprise clients regularly.

What if the delivered data isn't what I expected?

Every project includes revisions. If the output doesn't match what was scoped, I re-run and re-deliver at no extra cost. If I can't deliver what was promised, you don't pay.

BUSINESS INTELLIGENCE

SEC EDGAR data scraping service.

Custom filing pipelines, narrative section extraction, and real-time 8-K delivery for quant funds and research firms.

By Siddharth Bishnoi · Updated April 13, 2026

Get a quote →hello@sidb.work

The problem.

WHY THIS IS HARDER THAN IT LOOKS

Quant hedge funds, equity research firms, ESG analysts, and compliance teams all use SEC EDGAR data, and they all eventually hit the same limit. The official SEC API at data.sec.gov covers structured financial statements through XBRL-tagged line items and filer metadata. It serves its purpose well for standardized financial metrics. But it does not cover the narrative sections of a 10-K where a significant portion of the investment signal lives: Management's Discussion and Analysis, Risk Factors, Legal Proceedings, Critical Accounting Policies, and the footnotes that contextualize the numbers. These narrative sections are where companies disclose emerging risks, strategy shifts, litigation exposure, and accounting judgment calls that the structured data does not capture.

The enterprise data terminals and research platforms serve the high end of this market well and price accordingly, often running into five or six figures per seat per year. Generic EDGAR API wrappers cover the easy structured data at low cost but struggle with custom schemas, extension taxonomies, and the narrative text that requires actual parsing.

The gap is a managed service that handles custom extraction questions case by case: narrative sections parsed into structured fields, XBRL normalized across extension taxonomies, real-time 8-K delivery filtered to a watchlist, or a bespoke pipeline feeding directly into a fund's internal research system. This is not a wrapper around a public API. It is a custom pipeline that handles the parsing, normalization, and delivery that the official endpoints do not cover.

Across 200+ projects through sidb.work, the SEC EDGAR work has supported one running hedge fund client with a 1,200-ticker watchlist, 2,000+ filings processed daily, a 3.2 minute average latency from SEC publication to structured data delivery, and 99.4% parse accuracy across 180+ normalized financial metrics. The same approach applies to any filing type, any watchlist size, and any delivery target.

For firms that need the data the official structured endpoints do not cover, in the format their systems already consume, at a latency that matters for their workflow, this is what I build. You tell me the filings, the fields, the tickers, and the delivery targets. I build the pipeline.

Is this right for you?

GOOD FIT IF ANY OF THESE SOUND LIKE YOU

✓

You run a quant fund or research firm and need filing data faster or in a different shape than generic API wrappers provide

✓

You need narrative section extraction (MD&A, Risk Factors, Legal Proceedings) that XBRL does not cover

✓

You want real-time 8-K pipelines filtered to a specific watchlist, with sub-5-minute latency

✓

You have custom research questions that require bespoke parsing, not a standard schema

What you receive.

EXACT FIELDS, DELIVERED IN YOUR FORMAT

accession_numberstringSEC-assigned unique identifier for the filing. Primary key across the system.

filer_cikstringCentral Index Key of the filing entity, zero-padded to 10 digits.

tickerstringPrimary trading symbol for the filer, resolved from the CIK.

form_typeenumFiling form type: 10-K, 10-Q, 8-K, DEF 14A, S-1, and others supported on request.

filing_datedateDate the document was filed with the SEC.

period_of_reportdateFiscal period end date covered by the filing.

xbrl_metricsobjectNormalized financial metrics parsed from XBRL, mapped to a consistent schema across filers.

risk_factors_extractedarrayRisk Factors section parsed into individual risk items with classification tags.

mdna_summarystringManagement's Discussion and Analysis section extracted as structured text with headings preserved.

legal_proceedingsarrayLegal Proceedings section parsed into individual case summaries.

segment_breakdownobjectSegment-level financial data with unit conversion and scale factor normalization.

publication_latency_secondsnumberTime from SEC publication to delivery to your system, for SLA tracking.

Sample record.

sec-edgar.sample.json
{
  "accession_number": "0000789019-26-000042",
  "filer_cik": "0000789019",
  "ticker": "MSFT",
  "form_type": "10-K",
  "filing_date": "2026-03-15",
  "period_of_report": "2025-12-31",
  "xbrl_metrics": {
    "revenue_usd": 245122000000,
    "operating_income_usd": 109433000000,
    "net_income_usd": 88136000000
  },
  "risk_factors_extracted": ["Competition", "Cybersecurity", "Regulatory", "AI Model Liability"],
  "segment_breakdown": {
    "Productivity and Business Processes": { "revenue_usd": 74021000000 },
    "Intelligent Cloud": { "revenue_usd": 106362000000 },
    "More Personal Computing": { "revenue_usd": 64739000000 }
  },
  "publication_latency_seconds": 192,
  "delivered_at": "2026-03-15T22:03:12Z"
}

Relevant work.

Legal2K filings/day

SEC Filing Extractor

Automated extraction of 10-K, 10-Q, and 8-K filings with structured financial data output. Parses XBRL data and delivers normalized figures for quantitative analysis.

Straightforward pricing.

SCALE DETERMINES PRICE · NO HIDDEN FEES

Filing set extraction

from $199

One-time extraction of a specific filing set. Delivered in 3 to 7 days.

Up to 1,000 filings
Custom field schema
XBRL + narrative sections
CSV, JSON, or Parquet

Get a quote →

Live pipeline

from $499/mo

Real-time processing of new filings on a configured watchlist.

Watchlist of up to 500 tickers
All form types (10-K, 10-Q, 8-K, etc.)
Target latency under 5 minutes
Webhook + database delivery

Get a quote →

Research partnership

Custom

Larger watchlists, custom research questions, and bespoke schemas.

Full ticker universe
Narrative topic extraction
Per-engagement scoping
Research call included

Get a quote →

Frequently asked questions.

EVERYTHING YOU NEED TO KNOW

What filing types do you support?

All of the common ones: 10-K, 10-Q, 8-K, DEF 14A, S-1, S-4, and any form type available through EDGAR on request. The supporting case study covers 2,000+ filings processed per day across 1,200 watchlist tickers with a 3.2 minute average latency from publication to delivery.

Can you extract narrative sections from 10-Ks at scale?

Yes. Risk Factors, MD&A, Legal Proceedings, and Critical Accounting Policies are parsed into structured fields with headings preserved. For Risk Factors specifically I can classify individual risk items into categories you define (cybersecurity, regulatory, competition, etc.) and track how the language evolves across years for the same filer.

How fast is real-time delivery?

The target for the Live Pipeline tier is under 5 minutes from SEC publication to your system. One running client pipeline averages 3.2 minutes end-to-end with a 99.4% parse accuracy across 180+ normalized metrics. Delivery is push-based via webhook or direct database write.

How do you handle custom extension taxonomies?

Custom XBRL extensions get resolved against the base US-GAAP or IFRS taxonomy at parse time, with explicit mapping decisions surfaced in the output so your analysts can review anything the mapper was not certain about. For engagements with specific taxonomy concerns this is scoped during the initial call.

What delivery formats do you support?

Six formats cover most clients: CSV or JSON to S3 or Google Cloud Storage, Parquet files for analytical workloads, direct writes to PostgreSQL or BigQuery, and webhook pushes to internal research systems. For quant clients the most common integration is webhook to a research engine plus a parquet archive.

Can you handle amendments and restatements?

Yes. Amended filings (10-K/A, 10-Q/A) and restated figures are tracked as diffs against the original filing, so downstream systems can see exactly what changed and when. This is important for historical analysis and risk modeling.

Do you support sector or industry-level analysis across multiple filers?

Yes. For research engagements I build cross-filer datasets that normalize the same metrics across all companies in a given sector, SIC code, or custom peer group. This is the most common use case for the Research Partnership tier and is how many quant clients build factor models or ESG scoring systems from raw filing data.

Can you extract specific narrative topics I define?

Yes. For example, if you want every mention of 'supply chain risk' or 'AI regulation' across all 10-Ks filed by a peer group in the last 3 years, I build a topic extraction pipeline that searches the narrative sections and returns structured results with filing context, section location, and surrounding text. For clients building ESG models this is often the primary engagement: tracking how disclosure language around climate risk, workforce diversity, or board governance evolves across reporting periods. This is scoped as a Research Partnership engagement because each extraction question is configured differently.

What about non-US filings or foreign private issuers?

Foreign private issuers that file on SEC EDGAR (20-F, 6-K) are fully supported. These filings sometimes use IFRS instead of US-GAAP taxonomy, and the parsing engine handles the taxonomy mapping transparently. Dual-listed companies that file both a 20-F and domestic filings can be tracked under a single watchlist entry. Filings on non-SEC systems (such as Companies House in the UK or SEDAR+ in Canada) are outside the current scope but can be discussed during scoping if your research requires cross-jurisdiction coverage.

What historical depth can you cover?

Any filing on SEC EDGAR, which goes back to 1993 for most filers and 2001 for full-text search. For historical backfills I scope by the number of filings and the fields you need. Large historical reconstructions (for example, 10-K Risk Factors across a full sector for 15 years) are common research engagements and scope cleanly into the Filing Set Extraction tier for smaller runs or the Research Partnership tier for larger ones.

Can you pipe this directly into our quant research system?

Yes. The most common Live Pipeline integration is webhook-to-research-system plus a parallel parquet archive for historical replay. For clients running Databricks or Snowflake, direct table writes with schema sync are supported. For clients with an internal factor library, custom delivery targets are scoped in the initial call.

How long does it take to get the first delivery?

For a filing set extraction, the typical turnaround is 3 to 7 business days depending on the number of filings and the complexity of the field schema you need. For a live pipeline, the setup takes approximately 5 to 10 business days for watchlist configuration, parser setup, delivery integration testing, and a validation run against known filings. After the pipeline is live, new filings are processed automatically with the target latency defined in your SLA.

How much does it cost?

A one-time filing set extraction starts at $199 for up to 1,000 filings, depending on field depth and form type complexity. The Live Pipeline tier starts at $499 per month for a watchlist of up to 500 tickers with real-time delivery. Research partnerships are scoped per engagement. This is the same pricing model I have used across 200+ projects delivered through sidb.work.

Ready to get SEC EDGAR data?

Book a 30-minute call and I’ll scope it live.

Book a 30-min call →

hello@sidb.work