API Integration Challenges: The Breakdown

In the last post, I used the duck migration example to illustrate why API integration is harder than it looks. Now I’ll break down the specific challenges we’ve encountered building Plotter across 50+ data sources.

I’m grouping these into three categories: Technical Complexity, Data Reliability, and Compliance & Operations. Understanding where an API falls on each dimension tells you whether you’re looking at a 3-day integration or a 3-month one.

Technical Complexity

Architecture and Documentation

Architecture: Modern REST APIs with stable versioning are generally easy to integrate. Legacy or SOAP-based systems aren’t. The BEA API is a good example: it uses older patterns and ships its documentation as a static PDF, making testing slow and awkward. In contrast, Alpaca offers a modern REST interface with clear examples, stable versioning and an interactive playground making onboarding dramatically faster.

Documentation: Good documentation is just as important. Accurate examples and correct response schemas make integration straightforward. Missing, outdated, or incorrect docs do the opposite. When the documentation doesn’t match what the API actually returns, you can lose hours debugging problems that aren’t yours at all.

Internal System Configuration

Configuration: Integration depends on how well an API’s structure matches your existing system such as your data warehouse, ETL pipelines, and internal schemas. Some APIs, like FRED, return small, clean responses that fit easily into standard models. Others, like BLS, provide complex, interconnected tables where you have to infer relationships across columns. That’s less “configuration” and more “architectural heavy lifting.”

Uptime Monitoring: Reliability matters, too. When an API has high latency or frequent downtime, you’re forced to add caching, retries, and monitoring just to keep your pipelines stable. That’s extra infrastructure you wouldn’t need if the API were dependable.

Request Accessibility

Simple Requests: The best APIs let you make a single HTTPS call and get everything you need…no custom clients, no extra steps.

Complex Requests: Others require multi-stage authentication or chained requests, like fetching categories → sections → data. Econoday is a prime example: you can’t request a table directly. You must pull events in a time window, filter them, then fetch each event individually, which can lead to thousands of calls for one historic dataset. It also doesn’t reliably publish events on time, so you need polling and backoff logic just to keep data current.

Compute and Technical Resources

Compute: Some APIs can run on lightweight, once-a-day functions. Others require heavy memory, long runtimes, or full-time environments. NDL, for example, needs high memory just to load its data at a reasonable speed, even though it rarely updates. FRED is the opposite: each job is tiny, but you may need hundreds running daily to cover all their series.

Integration Effort: This affects not just cost but architecture. Do you need dedicated infrastructure? Kubernetes orchestration? How does the workload scale as you add more sources? Developer effort varies, too—a clean API might take a week to integrate, while a messy one with surprises can take a month or more.

The key is building reusable infrastructure. Each source is a different puzzle but requires the same fundamental skills.

Data Reliability

Format Chaos

Every API structures data differently. That’s not hyperbole — I mean every single one has its quirks.

Formatting: Every API structures data differently—time series vs. tables, nested JSON vs. CSV, even compressed archives like .tar.bz2 and other proprietary formats requiring special libraries. Date formats alone are chaos: 01-01-2023, 2023-01-01, January 1st 2023, 2023-01-01T00:00:00z, and sometimes all of them mixed together. We’ve even seen a column of dates where one entry says “January 3rd” and breaks the parser. These inconsistencies require heavy normalization before you can even begin storing or processing the data.

Inconsistency: Units and definitions vary just as much. BEA’s regional datasets use labels like “CHAINED 2024 DOLLARS,” “CONSTANT 2024 DOLLARS,” and “2024 DOLLARS” all meaning the same thing but written three different ways and require normalization rules for each variation . Some APIs return human-readable text (“two thousand”) instead of machine-readable numbers which is fine for display but terrible for computation. Without standardization rules, automation falls apart.

In short, every API has quirks. Cleaning and normalizing the data — dates, units, structure, formats — is often a larger challenge than fetching it in the first place.

Discovery and Updating

Discovery: Many APIs don’t clearly list what data exists or how to access it. You have to manually investigate which endpoints matter, which tables update, and what the relationships are. BLS is a good example: dozens of tables with key-value mappings, and no clear guidance on how they relate. We ended up reverse-engineering the relationships to detect which columns reference which tables.

Updating: Updating logic is equally inconsistent. Well-designed APIs support incremental updates with “last modified” timestamps. Poorly designed APIs require full reloads. You pull everything, compare against what you already have, deduplicate, and store. That’s slower, more expensive, and error-prone. Econoday is especially challenging: because you can only fetch events by time window, not by table, keeping data current requires constant polling and checking for late arrivals.

Cross-Reference Fragmentation

Cross-Referencing: APIs often spread related data across multiple tables or hierarchies. Getting a complete dataset means joining them across layers, which requires understanding the schema relationships. The scientific duck database is a good example: Family → Subfamily → Genus → Species. One lookup is fine; aggregating data means chaining queries across five tables.

Integration Impact: BLS works the same way. To load employment data, you must cross-reference series IDs with area codes, industry codes, and time-period definitions, each stored separately. Miss any one piece and the data is incomplete or unusable.

Normalization isn’t bad design, but when relationships aren’t well documented or when everything is identified by opaque codes like 6452183 instead of human-readable keys, integration becomes significantly more complex and time-consuming.

Rate Limits and Volume

Rate Limits: Most APIs impose quotas — some generous (thousands of requests per minute), some extremely strict (a few hundred per day). Strict limits force you to batch requests, schedule loads during off-peak times, and implement retry logic with exponential backoff. FRED’s rate limits aren’t draconian, but because we’re loading hundreds of series daily, we hit them regularly. That means careful orchestration to stay under the threshold while keeping data fresh. Alpaca has hourly limits on certain endpoints, so pulling data for many symbols requires spreading requests out across time or using bulk endpoints.

Data Volume: Large payloads introduce latency and heavy processing taking longer to transfer and parse. Gigabyte-scale datasets require efficient pagination, parallelization, and compression. The combination of strict rate limits and large volume is the worst case. You can’t parallelize aggressively because you’ll hit the limit, but sequential requests take too long. Finding the balance requires tuning.

Compliance and Operations

Authentication and Security

Authentication: Simple API keys are effortless; generate a key, add it to your request headers, and you’re done. But more advanced systems add complexity. OAuth requires multi-step authentication, token refresh cycles, and expiration handling. If tokens expire mid-process, your job fails and needs retry logic. Some APIs demand cryptographically signed requests or finely tuned permission scopes that fail with vague errors if misconfigured. Higher security is good, but it adds integration friction and requires solid key storage, rotation policies, and monitoring.

Privacy: When APIs involve sensitive data, compliance matters. GDPR, CCPA, HIPAA, and similar standards require secure handling of PII, encryption keys, and audit logs. Compliance with security overhead makes integrations noticeably more complex.

Data Ownership and Vendor Lock-In

Data Ownership: Usage rights aren’t always clear. Can you cache the data? Redistribute it? Use it for training models? Some APIs have explicit licenses, others have vague terms that create legal risk. Without clear licensing, even basic storage or sharing can be questionable.

Vendor Lock-In: Proprietary formats and frequent version changes can create heavy dependence on a provider. If an API deprecates an endpoint, changes its structure, or alters authentication requirements, your integration can break overnight. We’ve seen this happen across multiple sources when new formats and auth requirements are introduced; it forces rewrites, adds maintenance costs, and ties you tightly to the vendor’s decisions.

What This Means

Not all APIs are created equal. About 10% integrate smoothly – modern architecture, clean data, good docs. Another 25% are worst-case scenarios requiring 5-10× more effort than you’d expect. The rest fall somewhere in between, and the challenge is you often can’t tell which is which until you’re halfway through implementation.

In the next post, I’ll lay out the framework we use to evaluate APIs upfront: best-case vs worst-case across these three categories. Then I’ll show what we built to handle the worst cases without multiplying engineering effort every time we add a new source.

If you’re building a data platform, this evaluation framework is how you avoid getting stuck in integration hell. To better illustrate these challenges, it is helpful to look at the extremes of the best vs. worse case for an API, which I go into in depth in my next post: API Integration Best and Worst Case