GraphQL Health Check: Liveness, Readiness & Kubernetes Setup

Q: How do you implement a basic health check endpoint in GraphQL?

For Apollo Server 4 with Express, add a /health route before the GraphQL middleware: ping your database, return 200 with {"status":"ok"} on success, or 503 with an error message on failure. GraphQL Yoga provides /health and /readiness built-in — use the readiness check plugin for custom dependency validation. Hasura exposes /healthz out of the box. The common requirements for all implementations: no authentication, response under 1 second, and accurate reflection of the service's ability to handle real requests.

Q: How can you test a GraphQL health check endpoint?

Test both success and failure paths. For the happy path: curl -i http://localhost:4000/health/ready and verify you get HTTP 200 with a valid JSON body. For the failure path: stop your database, hit the same endpoint, and verify you get HTTP 503 with the database listed as the failing dependency. In CI/CD pipelines, automate both scenarios — deploy the service, poll the health endpoint until it returns 200, then gate production promotion on that result. Automated failure testing is the step most teams skip and then regret during an incident.

Q: How do you integrate GraphQL health checks with Kubernetes?

Add livenessProbe and readinessProbe sections to your pod spec. Point liveness to a lightweight endpoint like /health/live (HTTP-only check), and readiness to /health/ready (full dependency check). Set initialDelaySeconds to match your actual startup time — measure it rather than guessing. Use failureThreshold: 3 so transient issues don't trigger unnecessary restarts. Both probes use httpGet with your service port. Verify the configuration works by watching pod events with kubectl describe pod after deployment.

A GraphQL health check is a dedicated endpoint or query that verifies whether your GraphQL server is fully operational — not just running, but capable of processing real requests. A robust health check validates critical dependencies: database connectivity, cache availability, and schema integrity. This matters because a server can respond to TCP pings while silently failing to execute any query. Automated infrastructure like Kubernetes, load balancers, and CI/CD pipelines rely on health check results to route traffic, restart failing pods, and gate deployments.

Key Benefits at a Glance

Improved Reliability: Detect database disconnects and dependency failures before users notice — health checks surface silent failures that a simple ping misses entirely.
Automated Recovery: Kubernetes and load balancers use health check results to remove unhealthy instances from rotation and restart broken pods without human intervention.
Faster Troubleshooting: A well-designed health response tells you which dependency failed — database, cache, or upstream service — cutting diagnosis time from minutes to seconds.
Safer Deployments: Gate production traffic in CI/CD pipelines on a passing readiness check, so a broken release never reaches users.
Lightweight Monitoring: A minimal health query like { __typename } validates schema parsing and resolver infrastructure with negligible performance overhead.

Table of Contents

Understanding GraphQL Health Checks: What They Are and Why You Need Them

A GraphQL health check endpoint is the contract between your API and the infrastructure managing it. When a load balancer or Kubernetes scheduler needs to know whether a pod can receive traffic, it doesn’t execute a business query — it calls your health endpoint and reads the HTTP status code. A 200 OK means the server is ready; a 503 Service Unavailable pulls it from rotation.

The minimal JSON response for a healthy GraphQL service looks like this:

{
  "status": "ok",
  "timestamp": "2025-06-01T12:00:00Z",
  "dependencies": {
    "database": "ok",
    "cache": "ok"
  }
}

When a dependency fails, return 503 and mark the component:

{
  "status": "degraded",
  "dependencies": {
    "database": "error: connection timeout after 5000ms",
    "cache": "ok"
  }
}

Health checks enable zero-downtime deployments by allowing gradual traffic shifting
Orchestration platforms like Kubernetes rely on health checks for automatic pod recovery
Monitoring systems use health check data to trigger alerts and scaling decisions
Proper health checks reduce mean time to recovery during incidents

The key insight is that health checks serve the infrastructure, not developers. They must be fast (under 1 second), unauthenticated, and accurate. A check that returns 200 while the database is down is actively harmful — it creates false confidence and prevents automatic recovery.

Types of GraphQL Health Checks: Liveness vs. Readiness

Understanding the distinction between liveness and readiness checks is the most important design decision when implementing health checks for GraphQL services. They answer fundamentally different questions and trigger different infrastructure responses.

Liveness answers: “Is the process alive and not deadlocked?” A liveness failure tells Kubernetes the pod is unrecoverable — it gets restarted. Only check the bare minimum: HTTP server responding, event loop not blocked.

Readiness answers: “Is this instance ready to handle production traffic?” A readiness failure removes the pod from service endpoints without restarting it, allowing temporary issues (startup lag, cache warming) to resolve naturally. This is where you check database connections, upstream services, and schema loading.

Check Type	Purpose	What to Check	Failure Consequence	Complexity
Liveness	Is the process alive?	HTTP server responding, no deadlock	Pod restart	Low
Readiness	Ready for traffic?	DB connection, schema loaded, cache available	Removed from load balancer rotation	Medium–High

Never check database connectivity in a liveness probe — a DB outage should not restart your pods
Readiness probes can and should check all critical dependencies
A pod can be live but not ready (e.g., still loading schema)
Separate endpoints: /health/live and /health/ready is a clean pattern

A common mistake is putting database checks inside a liveness probe. If the database goes down, every pod fails its liveness check and gets restarted in a loop — making the outage worse. Readiness probes are built for this: they simply remove the pod from rotation until the database recovers, then automatically restore it.

Readiness probes should validate downstream dependencies. Integrate observability checks to ensure health endpoints reflect true service capability.

Implementing Health Checks in Popular GraphQL Servers

Each major GraphQL server handles health checks differently — some provide built-in endpoints, others require middleware. Here’s what you actually need to configure for each.

Apollo Server Health Checks

Apollo Server 4 does not expose a health endpoint by default. You add one via Express middleware before the GraphQL handler:

import express from 'express';
import { ApolloServer } from '@apollo/server';
import { expressMiddleware } from '@apollo/server/express4';

const app = express();
const server = new ApolloServer({ typeDefs, resolvers });
await server.start();

// Health check — mount BEFORE GraphQL middleware
app.get('/health', async (req, res) => {
  try {
    // Optional: verify DB connection
    await db.query('SELECT 1');
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({ status: 'error', message: err.message });
  }
});

app.use('/graphql', expressMiddleware(server));

Start Apollo Server and attach Express middleware
Mount /health route before the GraphQL handler
Add dependency checks (DB ping, cache ping) inside the handler
Return 200 for healthy, 503 for any dependency failure
Configure Kubernetes probes to hit this endpoint

“GraphOS Router and Apollo Router Core support a basic HTTP-level health check. This is enabled by default and is served on port 8088 at the URL path /health. This returns a 200 status code if the HTTP server is successfully serving.”
— Apollo GraphQL Docs
Source link

GraphQL-level Health Checks for Apollo Server

GraphQL-level health checks validate more components than an HTTP ping — they exercise schema parsing, the execution engine, and resolver infrastructure. The standard pattern uses { __typename } as a minimal introspection query that requires no resolver logic:

# Send as HTTP GET to avoid CSRF concerns
GET /graphql?query={__typename}

# Expected response when healthy
{ "data": { "__typename": "Query" } }

This approach catches issues like schema compilation errors or resolver initialization failures that an HTTP-only check would miss. However, you must implement CSRF protection when enabling GET requests — use a dedicated query name or a separate unauthenticated endpoint rather than exposing the full GraphQL endpoint unauthenticated.

GraphQL Yoga Server Health Checks

GraphQL Yoga ships with /health and /readiness endpoints enabled out of the box — no configuration needed for basic usage. For custom dependency checks, use the plugin system:

import { createSchema, createYoga } from 'graphql-yoga';
import { useReadinessCheck } from '@graphql-yoga/plugin-readiness-check';

const yoga = createYoga({
  schema,
  plugins: [
    useReadinessCheck({
      endpoint: '/readiness',
      check: async () => {
        // Returns true if ready, false or throws if not
        const dbOk = await db.ping().catch(() => false);
        const cacheOk = await redis.ping().catch(() => false);
        return dbOk && cacheOk;
      },
    }),
  ],
});

Built-in /health (liveness) and /readiness endpoints, zero config
Plugin-based custom readiness logic for any dependency
Works with Node.js, Bun, Cloudflare Workers, and Deno
Readiness endpoint returns 200 or 503 based on plugin result

Hasura GraphQL Engine Health Checks

“Hasura gives a health check endpoint to monitor the status of the GraphQL API. This is available under the /healthz endpoint for all Hasura projects.”
— Hasura Docs
Source link

Hasura exposes /healthz out of the box. Query it directly:

# Basic health check
curl https://your-hasura-domain/healthz

# Healthy response
{"status":"OK"}

# With metadata consistency check
curl "https://your-hasura-domain/healthz?strict=true"

The strict=true parameter adds metadata consistency validation — useful in production where you want to catch schema drift between the database and Hasura’s tracked tables. The endpoint integrates directly with Prometheus and Datadog via standard HTTP scraping.

GraphQL Mesh Health Checks

GraphQL Mesh is built on GraphQL Yoga and inherits its /health and /readiness endpoints. In Mesh’s layered architecture, readiness checks are especially critical because Mesh federates multiple upstream GraphQL sources — a readiness check should reflect the aggregate health of all of them:

# mesh.config.ts
export const config = {
  serve: {
    healthCheckEndpoint: '/health',
    readinessCheckEndpoint: '/readiness',
  },
  // Custom readiness: validate all upstream sources
};

Configure readiness to check whether all upstream sources have successfully loaded their schemas. A Mesh instance that starts before upstream services are ready should fail readiness until federation is complete.

Custom GraphQL Health Check Implementations

A production-grade health check checks dependencies in parallel to keep response time fast, implements per-dependency timeouts to avoid hanging, and returns structured metadata so monitoring tools can distinguish between a full outage and partial degradation.

import { Router } from 'express';

const healthRouter = Router();

async function checkWithTimeout(fn, timeoutMs = 3000) {
  return Promise.race([
    fn(),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
    ),
  ]);
}

healthRouter.get('/health/ready', async (req, res) => {
  const start = Date.now();

  // Run all checks in parallel — don't let one slow check block others
  const [dbResult, cacheResult] = await Promise.allSettled([
    checkWithTimeout(() => db.query('SELECT 1')),
    checkWithTimeout(() => redis.ping()),
  ]);

  const dependencies = {
    database: dbResult.status === 'fulfilled' ? 'ok' : dbResult.reason.message,
    cache: cacheResult.status === 'fulfilled' ? 'ok' : cacheResult.reason.message,
  };

  const healthy = Object.values(dependencies).every(v => v === 'ok');

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    uptime: process.uptime(),
    responseTime: `${Date.now() - start}ms`,
    dependencies,
  });
});

Use Promise.allSettled, not Promise.all — you want all results even when one fails
Wrap every check in a timeout; a hanging DB call shouldn’t delay your health response
Return responseTime in the body — slow health checks are a performance warning sign
Consider circuit breaker patterns: if a dependency has failed 5 times in 30 seconds, skip checking it
Never throw inside the health handler — catch everything and map to 200/503

For services that can operate in a degraded state (e.g., serving cached data when the primary DB is slow), return 200 with "status": "degraded" rather than 503. This keeps the pod in rotation while signaling to monitoring that intervention may be needed.

Testing Health Check Endpoints

Always test both the happy path and failure scenarios. The most common oversight is only testing that health checks return 200 when everything works, never verifying they return 503 when a dependency is down.

# Test healthy state
curl -i http://localhost:4000/health/ready

# Expected: HTTP/1.1 200 OK
# Body: {"status":"ok","dependencies":{"database":"ok","cache":"ok"}}

# Simulate DB failure (stop your local DB), then:
curl -i http://localhost:4000/health/ready

# Expected: HTTP/1.1 503 Service Unavailable
# Body: {"status":"degraded","dependencies":{"database":"error: ...","cache":"ok"}}

Testing Approach	Use Case	Automation Level	Complexity
Manual curl	Development testing	None	Low
CI/CD pipeline	Pre-deployment validation	Full	Medium
Monitoring system	Production health tracking	Full	High

In CI/CD pipelines, add a post-deploy step that polls the health endpoint until it returns 200 or times out — this is your deployment gate. Combine it with a synthetic failure test: temporarily kill the database container and assert your health endpoint returns 503 before restoring it.

Validate health check logic using unit testing frameworks to ensure endpoints correctly report service state under mocked dependency conditions.

Integrating Health Checks with Deployment Environments

Health checks are the primary interface between your GraphQL service and the infrastructure managing it. Getting this integration right is what enables zero-downtime deployments, automatic recovery, and reliable rolling updates.

Kubernetes Integration

Configure separate liveness and readiness probes pointing to different endpoints. Here’s a production-ready pod spec for a GraphQL service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: graphql-api
spec:
  template:
    spec:
      containers:
        - name: graphql-api
          image: your-graphql-image:latest
          ports:
            - containerPort: 4000
          livenessProbe:
            httpGet:
              path: /health/live    # Only checks: is the HTTP server responding?
              port: 4000
            initialDelaySeconds: 15  # Wait for app startup
            periodSeconds: 20
            failureThreshold: 3
            timeoutSeconds: 5
          readinessProbe:
            httpGet:
              path: /health/ready   # Checks DB, cache, schema
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 10
            failureThreshold: 3
            timeoutSeconds: 5

Set initialDelaySeconds based on your actual startup time — measure it, don’t guess
Point liveness to a lightweight endpoint (HTTP server only, no DB checks)
Point readiness to your full dependency-checking endpoint
Set failureThreshold: 3 to tolerate transient failures without premature action
Monitor probe behavior in production via kubectl describe pod and events

For GraphQL services that load large schemas or pre-warm caches on startup, increase initialDelaySeconds generously. A pod that restarts because its liveness probe fired too early (before startup completed) is a common configuration mistake that causes deployment instability.

Configure Kubernetes probes to expect standard HTTP codes using GraphQL HTTP status code conventions, ensuring orchestration layers correctly interpret service health.

Docker Health Check Configuration

Add a HEALTHCHECK instruction to your Dockerfile to enable container-level health monitoring:

FROM node:20-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

EXPOSE 4000

HEALTHCHECK \
  --interval=30s \
  --timeout=10s \
  --start-period=60s \
  --retries=3 \
  CMD wget -qO- http://localhost:4000/health/ready || exit 1

CMD ["node", "server.js"]

--interval=30s — check every 30 seconds in production
--timeout=10s — kill the check if it hasn’t responded in 10 seconds
--start-period=60s — grace period after container start before failures count
--retries=3 — require 3 consecutive failures before marking unhealthy
Use wget instead of curl in Alpine-based images — it’s included by default

Docker health status is visible via docker inspect --format='{{.State.Health.Status}}' container_name. In Kubernetes deployments, Docker health checks complement pod-level probes — they provide container runtime visibility useful for local debugging and standalone Docker Compose environments.

Using in Containerized Environments

The most common health check failure in containerized environments is a server bound to 127.0.0.1 (localhost). Health check requests from Kubernetes or Docker’s health check mechanism originate from outside the container’s network namespace and cannot reach localhost-bound services:

# Wrong — health checks will fail in containers
server.listen(4000, '127.0.0.1');

# Correct — bind to all interfaces
server.listen(4000, '0.0.0.0');
# Or simply:
server.listen(4000);

Verify your service is bound correctly inside a running container:

# Check what interfaces the server is listening on
docker exec container_name netstat -tlnp | grep 4000
# Should show: 0.0.0.0:4000, not 127.0.0.1:4000

Advanced Health Check Strategies

Production systems need more than binary healthy/unhealthy states. Advanced strategies handle the reality that dependencies fail partially, transiently, or in ways that don’t require taking the entire service offline.

Strategy	Complexity	Effectiveness	Best Use Cases
Partial availability	Medium	High	Services with optional dependencies (analytics, search)
Graceful degradation	High	Very High	High-traffic systems with cache fallback
Circuit breaking	High	High	Services with unreliable external APIs

Circuit breaking inside health checks prevents cascading failures. If a dependency has failed 5 consecutive times, stop checking it for 30 seconds and report it as “circuit open” rather than hammering a failing service with health check requests. Libraries like opossum implement this pattern for Node.js GraphQL services.

Graceful degradation is the most valuable pattern for high-traffic GraphQL APIs. If your service can serve cached data during a database outage, your health check should return 200 with "status": "degraded" — keeping the pods in rotation while your alerting notifies on-call engineers of the underlying issue.

Correlate health check failures with GraphQL timeout configurations — a service that times out frequently may report healthy while consistently failing real user requests.

Logging and Monitoring Health Checks

Log health check failures in structured JSON — successful checks rarely need logging in production, but failures should include enough context for immediate diagnosis:

// Only log failures and slow checks
if (!healthy || responseTime > 500) {
  logger.warn({
    event: 'health_check',
    status: healthy ? 'ok' : 'degraded',
    responseTime,
    dependencies,
    timestamp: new Date().toISOString(),
  });
}

Health check patterns in logs reveal system behavior before it becomes a user-visible incident. Set alerts on sustained failure patterns — not individual failures — to reduce noise. A single 503 from a health check is often a transient network blip; three consecutive 503s from the same pod signal a real problem.

Log health check failures with dependency-level error detail
Track responseTime — consistently slow health checks indicate latency problems
Use structured JSON logging for Prometheus, Grafana, and Datadog ingestion
Alert on sustained failure patterns (3+ consecutive), not individual failures
Correlate health check failures with deployment events in your monitoring tool

Correlate health check failures with resolver-level metrics using GraphQL monitoring dashboards to accelerate root cause analysis during incidents.

Best Practices and Common Pitfalls

DO: Keep health checks fast — target under 500ms, hard limit 1 second
DO: Separate liveness (/health/live) from readiness (/health/ready) endpoints
DO: Run dependency checks in parallel with Promise.allSettled
DO: Return 200/503 HTTP status codes — orchestration platforms read the code, not the body
DO: Test failure scenarios explicitly during development
DON’T: Put database checks in liveness probes — this causes restart loops during DB outages
DON’T: Require authentication on health endpoints — it adds failure points and complexity
DON’T: Bind your server to 127.0.0.1 in containers — health checks won’t reach it
DON’T: Log every successful check — 200 successful checks per minute creates noise that obscures failures
DON’T: Let a health check hang — always wrap dependency calls in timeouts

The single most critical principle: your health check must accurately reflect your service’s ability to handle real user requests. A health check that returns 200 while your resolvers are all failing is worse than no health check — it prevents automatic recovery and delays incident response. Build your health checks to fail loudly when the service is genuinely broken, and stay silent when everything works.

Use GraphQL load testing to validate that your health check endpoints remain accurate and responsive under peak traffic conditions — a health check that returns correct results at 10 RPS may timeout at 1,000 RPS.

More GraphQL Reliability Guides

Frequently Asked Questions

A GraphQL health check is a dedicated HTTP endpoint (typically /health or /healthz) that verifies whether your GraphQL server is operational and capable of handling requests. Unlike a basic TCP ping, a proper health check validates critical dependencies — database connectivity, cache availability, schema loading — and returns a structured JSON response with 200 OK when healthy and 503 Service Unavailable when not. This allows load balancers, Kubernetes, and monitoring systems to make intelligent traffic-routing and recovery decisions automatically.

Health checks are the primary interface between your GraphQL service and the infrastructure managing it. Without them, Kubernetes cannot restart broken pods, load balancers cannot remove failing instances, and CI/CD pipelines cannot gate deployments on a working service. A GraphQL server can respond to TCP pings while failing to execute any query — health checks catch these silent failures. They reduce mean time to recovery by enabling automatic remediation without waiting for a human to notice and respond to an incident.

Liveness checks answer “is the process alive?” — they should only verify that the HTTP server is responding and the event loop isn’t blocked. A liveness failure triggers a pod restart. Readiness checks answer “is this instance ready to handle traffic?” — they verify database connectivity, schema loading, cache availability, and upstream service health. A readiness failure removes the pod from the load balancer rotation without restarting it. The critical rule: never put database checks in a liveness probe, or a database outage will cause all your pods to restart in a loop.

For Apollo Server 4 with Express, add a /health route before the GraphQL middleware: ping your database, return 200 with {"status":"ok"} on success, or 503 with an error message on failure. GraphQL Yoga provides /health and /readiness built-in — use the readiness check plugin for custom dependency validation. Hasura exposes /healthz out of the box. The common requirements for all implementations: no authentication, response under 1 second, and accurate reflection of the service’s ability to handle real requests.

Test both success and failure paths. For the happy path: curl -i http://localhost:4000/health/ready and verify you get HTTP 200 with a valid JSON body. For the failure path: stop your database, hit the same endpoint, and verify you get HTTP 503 with the database listed as the failing dependency. In CI/CD pipelines, automate both scenarios — deploy the service, poll the health endpoint until it returns 200, then gate production promotion on that result. Automated failure testing is the step most teams skip and then regret during an incident.

Add livenessProbe and readinessProbe sections to your pod spec. Point liveness to a lightweight endpoint like /health/live (HTTP-only check), and readiness to /health/ready (full dependency check). Set initialDelaySeconds to match your actual startup time — measure it rather than guessing. Use failureThreshold: 3 so transient issues don’t trigger unnecessary restarts. Both probes use httpGet with your service port. Verify the configuration works by watching pod events with kubectl describe pod after deployment.

GraphQL health check for reliable API performance and uptime monitoring