A client's AI features were burning through their OpenAI budget 3x faster than projected. Adding OpenTelemetry's GenAI semantic conventions revealed the problem wasn't what anyone expected.
Most healthcheck endpoints return 200 OK as long as the process is running. That's not a healthcheck — it's a pulse check. Here's what happened when we confused the two, and what a real healthcheck should verify.
A client's pods were getting OOMKilled during peak traffic, but the team spent days chasing application bugs. The real problem was resource limits that nobody had revisited since the initial cluster setup.
A client was confident about how their services talked to each other. Then we instrumented the system with OpenTelemetry and found out what was actually happening.