A GraphQL health check is a dedicated endpoint or query that verifies whether your GraphQL server is fully operational — not just running, but capable of processing real requests. A robust health check validates critical dependencies: database connectivity, cache availability, and schema integrity. This matters because a server can respond to TCP pings while silently failing to execute any query. Automated infrastructure like Kubernetes, load balancers, and CI/CD pipelines rely on health check results to route traffic, restart failing pods, and gate deployments.
Key Benefits at a Glance
- Improved Reliability: Detect database disconnects and dependency failures before users notice — health checks surface silent failures that a simple ping misses entirely.
- Automated Recovery: Kubernetes and load balancers use health check results to remove unhealthy instances from rotation and restart broken pods without human intervention.
- Faster Troubleshooting: A well-designed health response tells you which dependency failed — database, cache, or upstream service — cutting diagnosis time from minutes to seconds.
- Safer Deployments: Gate production traffic in CI/CD pipelines on a passing readiness check, so a broken release never reaches users.
- Lightweight Monitoring: A minimal health query like
{ __typename }validates schema parsing and resolver infrastructure with negligible performance overhead.
Understanding GraphQL Health Checks: What They Are and Why You Need Them
A GraphQL health check endpoint is the contract between your API and the infrastructure managing it. When a load balancer or Kubernetes scheduler needs to know whether a pod can receive traffic, it doesn’t execute a business query — it calls your health endpoint and reads the HTTP status code. A 200 OK means the server is ready; a 503 Service Unavailable pulls it from rotation.
The minimal JSON response for a healthy GraphQL service looks like this:
{
"status": "ok",
"timestamp": "2025-06-01T12:00:00Z",
"dependencies": {
"database": "ok",
"cache": "ok"
}
}When a dependency fails, return 503 and mark the component:
{
"status": "degraded",
"dependencies": {
"database": "error: connection timeout after 5000ms",
"cache": "ok"
}
}- Health checks enable zero-downtime deployments by allowing gradual traffic shifting
- Orchestration platforms like Kubernetes rely on health checks for automatic pod recovery
- Monitoring systems use health check data to trigger alerts and scaling decisions
- Proper health checks reduce mean time to recovery during incidents
The key insight is that health checks serve the infrastructure, not developers. They must be fast (under 1 second), unauthenticated, and accurate. A check that returns 200 while the database is down is actively harmful — it creates false confidence and prevents automatic recovery.
Types of GraphQL Health Checks: Liveness vs. Readiness
Understanding the distinction between liveness and readiness checks is the most important design decision when implementing health checks for GraphQL services. They answer fundamentally different questions and trigger different infrastructure responses.
Liveness answers: “Is the process alive and not deadlocked?” A liveness failure tells Kubernetes the pod is unrecoverable — it gets restarted. Only check the bare minimum: HTTP server responding, event loop not blocked.
Readiness answers: “Is this instance ready to handle production traffic?” A readiness failure removes the pod from service endpoints without restarting it, allowing temporary issues (startup lag, cache warming) to resolve naturally. This is where you check database connections, upstream services, and schema loading.
| Check Type | Purpose | What to Check | Failure Consequence | Complexity |
|---|---|---|---|---|
| Liveness | Is the process alive? | HTTP server responding, no deadlock | Pod restart | Low |
| Readiness | Ready for traffic? | DB connection, schema loaded, cache available | Removed from load balancer rotation | Medium–High |
- Never check database connectivity in a liveness probe — a DB outage should not restart your pods
- Readiness probes can and should check all critical dependencies
- A pod can be live but not ready (e.g., still loading schema)
- Separate endpoints:
/health/liveand/health/readyis a clean pattern
A common mistake is putting database checks inside a liveness probe. If the database goes down, every pod fails its liveness check and gets restarted in a loop — making the outage worse. Readiness probes are built for this: they simply remove the pod from rotation until the database recovers, then automatically restore it.
Readiness probes should validate downstream dependencies. Integrate observability checks to ensure health endpoints reflect true service capability.
Implementing Health Checks in Popular GraphQL Servers
Each major GraphQL server handles health checks differently — some provide built-in endpoints, others require middleware. Here’s what you actually need to configure for each.
Apollo Server Health Checks
Apollo Server 4 does not expose a health endpoint by default. You add one via Express middleware before the GraphQL handler:
import express from 'express';
import { ApolloServer } from '@apollo/server';
import { expressMiddleware } from '@apollo/server/express4';
const app = express();
const server = new ApolloServer({ typeDefs, resolvers });
await server.start();
// Health check — mount BEFORE GraphQL middleware
app.get('/health', async (req, res) => {
try {
// Optional: verify DB connection
await db.query('SELECT 1');
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'error', message: err.message });
}
});
app.use('/graphql', expressMiddleware(server));
- Start Apollo Server and attach Express middleware
- Mount
/healthroute before the GraphQL handler - Add dependency checks (DB ping, cache ping) inside the handler
- Return
200for healthy,503for any dependency failure - Configure Kubernetes probes to hit this endpoint
“GraphOS Router and Apollo Router Core support a basic HTTP-level health check. This is enabled by default and is served on port 8088 at the URL path /health. This returns a 200 status code if the HTTP server is successfully serving.”
— Apollo GraphQL Docs
Source link
GraphQL-level Health Checks for Apollo Server
GraphQL-level health checks validate more components than an HTTP ping — they exercise schema parsing, the execution engine, and resolver infrastructure. The standard pattern uses { __typename } as a minimal introspection query that requires no resolver logic:
# Send as HTTP GET to avoid CSRF concerns
GET /graphql?query={__typename}
# Expected response when healthy
{ "data": { "__typename": "Query" } }
This approach catches issues like schema compilation errors or resolver initialization failures that an HTTP-only check would miss. However, you must implement CSRF protection when enabling GET requests — use a dedicated query name or a separate unauthenticated endpoint rather than exposing the full GraphQL endpoint unauthenticated.
GraphQL Yoga Server Health Checks
GraphQL Yoga ships with /health and /readiness endpoints enabled out of the box — no configuration needed for basic usage. For custom dependency checks, use the plugin system:
import { createSchema, createYoga } from 'graphql-yoga';
import { useReadinessCheck } from '@graphql-yoga/plugin-readiness-check';
const yoga = createYoga({
schema,
plugins: [
useReadinessCheck({
endpoint: '/readiness',
check: async () => {
// Returns true if ready, false or throws if not
const dbOk = await db.ping().catch(() => false);
const cacheOk = await redis.ping().catch(() => false);
return dbOk && cacheOk;
},
}),
],
});
- Built-in
/health(liveness) and/readinessendpoints, zero config - Plugin-based custom readiness logic for any dependency
- Works with Node.js, Bun, Cloudflare Workers, and Deno
- Readiness endpoint returns
200or503based on plugin result
Hasura GraphQL Engine Health Checks
“Hasura gives a health check endpoint to monitor the status of the GraphQL API. This is available under the /healthz endpoint for all Hasura projects.”
— Hasura Docs
Source link
Hasura exposes /healthz out of the box. Query it directly:
# Basic health check
curl https://your-hasura-domain/healthz
# Healthy response
{"status":"OK"}
# With metadata consistency check
curl "https://your-hasura-domain/healthz?strict=true"
The strict=true parameter adds metadata consistency validation — useful in production where you want to catch schema drift between the database and Hasura’s tracked tables. The endpoint integrates directly with Prometheus and Datadog via standard HTTP scraping.
GraphQL Mesh Health Checks
GraphQL Mesh is built on GraphQL Yoga and inherits its /health and /readiness endpoints. In Mesh’s layered architecture, readiness checks are especially critical because Mesh federates multiple upstream GraphQL sources — a readiness check should reflect the aggregate health of all of them:
# mesh.config.ts
export const config = {
serve: {
healthCheckEndpoint: '/health',
readinessCheckEndpoint: '/readiness',
},
// Custom readiness: validate all upstream sources
};
Configure readiness to check whether all upstream sources have successfully loaded their schemas. A Mesh instance that starts before upstream services are ready should fail readiness until federation is complete.
Custom GraphQL Health Check Implementations
A production-grade health check checks dependencies in parallel to keep response time fast, implements per-dependency timeouts to avoid hanging, and returns structured metadata so monitoring tools can distinguish between a full outage and partial degradation.
import { Router } from 'express';
const healthRouter = Router();
async function checkWithTimeout(fn, timeoutMs = 3000) {
return Promise.race([
fn(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
),
]);
}
healthRouter.get('/health/ready', async (req, res) => {
const start = Date.now();
// Run all checks in parallel — don't let one slow check block others
const [dbResult, cacheResult] = await Promise.allSettled([
checkWithTimeout(() => db.query('SELECT 1')),
checkWithTimeout(() => redis.ping()),
]);
const dependencies = {
database: dbResult.status === 'fulfilled' ? 'ok' : dbResult.reason.message,
cache: cacheResult.status === 'fulfilled' ? 'ok' : cacheResult.reason.message,
};
const healthy = Object.values(dependencies).every(v => v === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
uptime: process.uptime(),
responseTime: `${Date.now() - start}ms`,
dependencies,
});
});
- Use
Promise.allSettled, notPromise.all— you want all results even when one fails - Wrap every check in a timeout; a hanging DB call shouldn’t delay your health response
- Return
responseTimein the body — slow health checks are a performance warning sign - Consider circuit breaker patterns: if a dependency has failed 5 times in 30 seconds, skip checking it
- Never throw inside the health handler — catch everything and map to 200/503
For services that can operate in a degraded state (e.g., serving cached data when the primary DB is slow), return 200 with "status": "degraded" rather than 503. This keeps the pod in rotation while signaling to monitoring that intervention may be needed.
Testing Health Check Endpoints
Always test both the happy path and failure scenarios. The most common oversight is only testing that health checks return 200 when everything works, never verifying they return 503 when a dependency is down.
# Test healthy state
curl -i http://localhost:4000/health/ready
# Expected: HTTP/1.1 200 OK
# Body: {"status":"ok","dependencies":{"database":"ok","cache":"ok"}}
# Simulate DB failure (stop your local DB), then:
curl -i http://localhost:4000/health/ready
# Expected: HTTP/1.1 503 Service Unavailable
# Body: {"status":"degraded","dependencies":{"database":"error: ...","cache":"ok"}}
| Testing Approach | Use Case | Automation Level | Complexity |
|---|---|---|---|
| Manual curl | Development testing | None | Low |
| CI/CD pipeline | Pre-deployment validation | Full | Medium |
| Monitoring system | Production health tracking | Full | High |
In CI/CD pipelines, add a post-deploy step that polls the health endpoint until it returns 200 or times out — this is your deployment gate. Combine it with a synthetic failure test: temporarily kill the database container and assert your health endpoint returns 503 before restoring it.
Validate health check logic using unit testing frameworks to ensure endpoints correctly report service state under mocked dependency conditions.
Integrating Health Checks with Deployment Environments
Health checks are the primary interface between your GraphQL service and the infrastructure managing it. Getting this integration right is what enables zero-downtime deployments, automatic recovery, and reliable rolling updates.
Kubernetes Integration
Configure separate liveness and readiness probes pointing to different endpoints. Here’s a production-ready pod spec for a GraphQL service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-api
spec:
template:
spec:
containers:
- name: graphql-api
image: your-graphql-image:latest
ports:
- containerPort: 4000
livenessProbe:
httpGet:
path: /health/live # Only checks: is the HTTP server responding?
port: 4000
initialDelaySeconds: 15 # Wait for app startup
periodSeconds: 20
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health/ready # Checks DB, cache, schema
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5
- Set
initialDelaySecondsbased on your actual startup time — measure it, don’t guess - Point liveness to a lightweight endpoint (HTTP server only, no DB checks)
- Point readiness to your full dependency-checking endpoint
- Set
failureThreshold: 3to tolerate transient failures without premature action - Monitor probe behavior in production via
kubectl describe podand events
For GraphQL services that load large schemas or pre-warm caches on startup, increase initialDelaySeconds generously. A pod that restarts because its liveness probe fired too early (before startup completed) is a common configuration mistake that causes deployment instability.
Configure Kubernetes probes to expect standard HTTP codes using GraphQL HTTP status code conventions, ensuring orchestration layers correctly interpret service health.
Docker Health Check Configuration
Add a HEALTHCHECK instruction to your Dockerfile to enable container-level health monitoring:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 4000
HEALTHCHECK \
--interval=30s \
--timeout=10s \
--start-period=60s \
--retries=3 \
CMD wget -qO- http://localhost:4000/health/ready || exit 1
CMD ["node", "server.js"]
--interval=30s— check every 30 seconds in production--timeout=10s— kill the check if it hasn’t responded in 10 seconds--start-period=60s— grace period after container start before failures count--retries=3— require 3 consecutive failures before marking unhealthy- Use
wgetinstead ofcurlin Alpine-based images — it’s included by default
Docker health status is visible via docker inspect --format='{{.State.Health.Status}}' container_name. In Kubernetes deployments, Docker health checks complement pod-level probes — they provide container runtime visibility useful for local debugging and standalone Docker Compose environments.
Using in Containerized Environments
The most common health check failure in containerized environments is a server bound to 127.0.0.1 (localhost). Health check requests from Kubernetes or Docker’s health check mechanism originate from outside the container’s network namespace and cannot reach localhost-bound services:
# Wrong — health checks will fail in containers
server.listen(4000, '127.0.0.1');
# Correct — bind to all interfaces
server.listen(4000, '0.0.0.0');
# Or simply:
server.listen(4000);
Verify your service is bound correctly inside a running container:
# Check what interfaces the server is listening on
docker exec container_name netstat -tlnp | grep 4000
# Should show: 0.0.0.0:4000, not 127.0.0.1:4000
Advanced Health Check Strategies
Production systems need more than binary healthy/unhealthy states. Advanced strategies handle the reality that dependencies fail partially, transiently, or in ways that don’t require taking the entire service offline.
| Strategy | Complexity | Effectiveness | Best Use Cases |
|---|---|---|---|
| Partial availability | Medium | High | Services with optional dependencies (analytics, search) |
| Graceful degradation | High | Very High | High-traffic systems with cache fallback |
| Circuit breaking | High | High | Services with unreliable external APIs |
Circuit breaking inside health checks prevents cascading failures. If a dependency has failed 5 consecutive times, stop checking it for 30 seconds and report it as “circuit open” rather than hammering a failing service with health check requests. Libraries like opossum implement this pattern for Node.js GraphQL services.
Graceful degradation is the most valuable pattern for high-traffic GraphQL APIs. If your service can serve cached data during a database outage, your health check should return 200 with "status": "degraded" — keeping the pods in rotation while your alerting notifies on-call engineers of the underlying issue.
Correlate health check failures with GraphQL timeout configurations — a service that times out frequently may report healthy while consistently failing real user requests.
Logging and Monitoring Health Checks
Log health check failures in structured JSON — successful checks rarely need logging in production, but failures should include enough context for immediate diagnosis:
// Only log failures and slow checks
if (!healthy || responseTime > 500) {
logger.warn({
event: 'health_check',
status: healthy ? 'ok' : 'degraded',
responseTime,
dependencies,
timestamp: new Date().toISOString(),
});
}
Health check patterns in logs reveal system behavior before it becomes a user-visible incident. Set alerts on sustained failure patterns — not individual failures — to reduce noise. A single 503 from a health check is often a transient network blip; three consecutive 503s from the same pod signal a real problem.
- Log health check failures with dependency-level error detail
- Track
responseTime— consistently slow health checks indicate latency problems - Use structured JSON logging for Prometheus, Grafana, and Datadog ingestion
- Alert on sustained failure patterns (3+ consecutive), not individual failures
- Correlate health check failures with deployment events in your monitoring tool
Correlate health check failures with resolver-level metrics using GraphQL monitoring dashboards to accelerate root cause analysis during incidents.
Best Practices and Common Pitfalls
- DO: Keep health checks fast — target under 500ms, hard limit 1 second
- DO: Separate liveness (
/health/live) from readiness (/health/ready) endpoints - DO: Run dependency checks in parallel with
Promise.allSettled - DO: Return
200/503HTTP status codes — orchestration platforms read the code, not the body - DO: Test failure scenarios explicitly during development
- DON’T: Put database checks in liveness probes — this causes restart loops during DB outages
- DON’T: Require authentication on health endpoints — it adds failure points and complexity
- DON’T: Bind your server to
127.0.0.1in containers — health checks won’t reach it - DON’T: Log every successful check — 200 successful checks per minute creates noise that obscures failures
- DON’T: Let a health check hang — always wrap dependency calls in timeouts
The single most critical principle: your health check must accurately reflect your service’s ability to handle real user requests. A health check that returns 200 while your resolvers are all failing is worse than no health check — it prevents automatic recovery and delays incident response. Build your health checks to fail loudly when the service is genuinely broken, and stay silent when everything works.
Use GraphQL load testing to validate that your health check endpoints remain accurate and responsive under peak traffic conditions — a health check that returns correct results at 10 RPS may timeout at 1,000 RPS.
More GraphQL Reliability Guides
- GraphQL Monitoring: Metrics, Tracing, and Observability
- GraphQL Unit Testing: Strategies and Tools
- GraphQL Load Testing: How to Benchmark Your API
- GraphQL Rate Limiting: Protect Your API from Abuse
- GraphQL Timeout Configuration and Best Practices
- GraphQL HTTP Status Codes Explained
- GraphQL Caching Strategies for Better Performance
Frequently Asked Questions
A GraphQL health check is a dedicated HTTP endpoint (typically /health or /healthz) that verifies whether your GraphQL server is operational and capable of handling requests. Unlike a basic TCP ping, a proper health check validates critical dependencies — database connectivity, cache availability, schema loading — and returns a structured JSON response with 200 OK when healthy and 503 Service Unavailable when not. This allows load balancers, Kubernetes, and monitoring systems to make intelligent traffic-routing and recovery decisions automatically.
Health checks are the primary interface between your GraphQL service and the infrastructure managing it. Without them, Kubernetes cannot restart broken pods, load balancers cannot remove failing instances, and CI/CD pipelines cannot gate deployments on a working service. A GraphQL server can respond to TCP pings while failing to execute any query — health checks catch these silent failures. They reduce mean time to recovery by enabling automatic remediation without waiting for a human to notice and respond to an incident.
Liveness checks answer “is the process alive?” — they should only verify that the HTTP server is responding and the event loop isn’t blocked. A liveness failure triggers a pod restart. Readiness checks answer “is this instance ready to handle traffic?” — they verify database connectivity, schema loading, cache availability, and upstream service health. A readiness failure removes the pod from the load balancer rotation without restarting it. The critical rule: never put database checks in a liveness probe, or a database outage will cause all your pods to restart in a loop.
For Apollo Server 4 with Express, add a /health route before the GraphQL middleware: ping your database, return 200 with {"status":"ok"} on success, or 503 with an error message on failure. GraphQL Yoga provides /health and /readiness built-in — use the readiness check plugin for custom dependency validation. Hasura exposes /healthz out of the box. The common requirements for all implementations: no authentication, response under 1 second, and accurate reflection of the service’s ability to handle real requests.
Test both success and failure paths. For the happy path: curl -i http://localhost:4000/health/ready and verify you get HTTP 200 with a valid JSON body. For the failure path: stop your database, hit the same endpoint, and verify you get HTTP 503 with the database listed as the failing dependency. In CI/CD pipelines, automate both scenarios — deploy the service, poll the health endpoint until it returns 200, then gate production promotion on that result. Automated failure testing is the step most teams skip and then regret during an incident.
Add livenessProbe and readinessProbe sections to your pod spec. Point liveness to a lightweight endpoint like /health/live (HTTP-only check), and readiness to /health/ready (full dependency check). Set initialDelaySeconds to match your actual startup time — measure it rather than guessing. Use failureThreshold: 3 so transient issues don’t trigger unnecessary restarts. Both probes use httpGet with your service port. Verify the configuration works by watching pod events with kubectl describe pod after deployment.



