A client's payment provider was sending webhook notifications correctly. Their system acknowledged every one. And then quietly threw most of them away.
A client's AI features were burning through their OpenAI budget 3x faster than projected. Adding OpenTelemetry's GenAI semantic conventions revealed the problem wasn't what anyone expected.
Most healthcheck endpoints return 200 OK as long as the process is running. That's not a healthcheck — it's a pulse check. Here's what happened when we confused the two, and what a real healthcheck should verify.
A client's pods were getting OOMKilled during peak traffic, but the team spent days chasing application bugs. The real problem was resource limits that nobody had revisited since the initial cluster setup.