Hiring a DevOps engineer is one of the most technically nuanced recruiting challenges an organization faces. The role spans infrastructure, automation, software delivery, security, and operations — and a candidate who interviews well on CI/CD concepts might have no real production experience managing infrastructure at scale.
This guide provides 50+ DevOps interview questions organized by category and difficulty level, along with guidance on what strong answers look like. Whether you are a hiring manager building an interview panel, a recruiter screening candidates, or a DevOps engineer preparing for interviews, this resource will help you assess (or demonstrate) the skills that matter.
If you are actively hiring for DevOps roles, our DevOps and SRE recruiting team can help you find engineers who have been technically vetted against questions like these.
How to Use This Guide
Each question includes a difficulty level (Junior, Mid, Senior) and what to look for in the candidate’s response. The difficulty levels correspond roughly to:
- Junior (0-2 years): Understands fundamental concepts and has used the tools in a learning or early professional context.
- Mid-Level (3-5 years): Has production experience and can explain trade-offs, not just definitions.
- Senior (6+ years): Can discuss architectural decisions, failure modes, scaling challenges, and has opinions backed by experience.
A strong interview process uses questions from multiple categories and multiple difficulty levels. Do not just ask about the candidate’s strongest area — probe for breadth as well as depth.
CI/CD Pipeline Questions
1. What is the difference between continuous integration, continuous delivery, and continuous deployment?
Difficulty: Junior
What to look for: Continuous integration is the practice of merging code changes to a shared branch frequently (multiple times per day) with automated builds and tests. Continuous delivery extends CI by ensuring the codebase is always in a deployable state, with automated release processes that can deploy to production at the push of a button. Continuous deployment goes one step further by automatically deploying every change that passes the pipeline to production with no manual gate. Strong candidates will note that most organizations practice continuous delivery, not continuous deployment, because they want a human approval step before production releases.
2. Describe how you would design a CI/CD pipeline for a microservices application with 15-20 services.
Difficulty: Senior
What to look for: The candidate should discuss monorepo vs polyrepo considerations and their impact on pipeline design. Look for mention of pipeline-per-service with shared pipeline templates or libraries, independent deployability of services, contract testing between services, canary or blue-green deployment strategies, and rollback mechanisms. Senior candidates should discuss how to handle cross-service dependencies during deployment and database migration strategies.
3. What is the purpose of a build artifact repository, and which ones have you used?
Difficulty: Junior
What to look for: Understanding that artifact repositories (JFrog Artifactory, Nexus, GitHub Packages, AWS ECR, Azure Container Registry) store versioned build outputs so deployments are reproducible and decoupled from the build process. The candidate should understand why you deploy artifacts rather than rebuilding from source at deployment time.
4. How do you handle secrets management in a CI/CD pipeline?
Difficulty: Mid
What to look for: Secrets should never be stored in source code or pipeline configuration files. Look for experience with secrets management tools (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or platform-native CI/CD secret storage). Strong candidates will discuss secret rotation, least-privilege access, audit logging, and the difference between build-time and runtime secret injection.
5. Explain the concept of pipeline as code. What are the advantages over GUI-configured pipelines?
Difficulty: Junior
What to look for: Pipeline as code (e.g., Jenkinsfile, .gitlab-ci.yml, GitHub Actions workflows, Azure DevOps YAML) means defining pipeline configuration in version-controlled files alongside the application code. Advantages include version history, code review of pipeline changes, reproducibility, branch-specific pipeline behavior, and the ability to test pipeline changes in feature branches before merging.
6. What strategies would you use to reduce CI/CD pipeline execution time from 45 minutes to under 10 minutes?
Difficulty: Senior
What to look for: Parallelization of test suites, caching (dependency caches, Docker layer caching, build caches), incremental builds (only building and testing changed services in a monorepo), test splitting across multiple runners, moving slow integration tests to a separate pipeline stage that does not block deployment, optimizing Docker image sizes, and using faster runner hardware. Senior candidates should discuss the trade-offs between speed and thoroughness.
7. How do you implement rollback in a CI/CD pipeline?
Difficulty: Mid
What to look for: Multiple approaches: redeploying a previous known-good artifact version, blue-green deployment (switching traffic back to the previous environment), feature flags (disabling the problematic feature without redeploying), and database rollback strategies (backward-compatible migrations). The candidate should understand that rollback is not just redeploying old code — it must account for database state, configuration changes, and dependent service compatibility.
Containerization and Orchestration Questions
8. Explain the difference between a Docker image and a Docker container.
Difficulty: Junior
What to look for: An image is a read-only template (built from a Dockerfile) that contains the application code, runtime, libraries, and dependencies. A container is a running instance of an image — an isolated process with its own filesystem, networking, and process space. Multiple containers can be created from the same image. This is a fundamental concept, and candidates who struggle here likely lack hands-on container experience.
9. What is a multi-stage Docker build, and why would you use one?
Difficulty: Mid
What to look for: Multi-stage builds use multiple FROM statements in a single Dockerfile, allowing you to use one stage for building/compiling and a separate, minimal stage for the runtime image. This produces smaller, more secure production images because build tools, source code, and intermediate artifacts are not included in the final image. Candidates should be able to provide a concrete example, such as compiling a Go binary in one stage and copying it into a scratch or alpine image.
10. How does Kubernetes handle service discovery?
Difficulty: Mid
What to look for: Kubernetes provides built-in service discovery through Services (ClusterIP, NodePort, LoadBalancer) and DNS. When a Service is created, Kubernetes assigns it a stable DNS name (service-name.namespace.svc.cluster.local) that other pods can use to communicate. CoreDNS resolves these names to the Service’s ClusterIP, which load-balances across healthy pods. Strong candidates will also mention headless services, endpoint slices, and external service discovery patterns.
11. Explain the difference between a Deployment, StatefulSet, and DaemonSet in Kubernetes.
Difficulty: Mid
What to look for: Deployments manage stateless applications with interchangeable pods, supporting rolling updates and rollbacks. StatefulSets manage stateful applications where each pod has a unique, persistent identity, stable network identifiers, and ordered deployment/scaling (e.g., databases, message queues). DaemonSets ensure a copy of a pod runs on every node (or a subset of nodes), used for node-level operations like log collection, monitoring agents, and network plugins.
12. How would you troubleshoot a pod stuck in CrashLoopBackOff?
Difficulty: Junior
What to look for: Check pod events with kubectl describe pod, examine container logs with kubectl logs (including --previous for crashed container logs), verify resource limits are not too restrictive, check liveness/readiness probe configurations, validate environment variables and secrets are correctly mounted, and verify the container image exists and is pullable. Strong candidates will have a systematic troubleshooting methodology rather than random guessing.
13. Describe how you would implement zero-downtime deployments in Kubernetes.
Difficulty: Senior
What to look for: Rolling update strategy with appropriate maxUnavailable and maxSurge settings, properly configured readiness probes (so traffic is not sent to pods that are not ready), preStop lifecycle hooks (to allow in-flight requests to complete before pod termination), pod disruption budgets (to prevent too many pods from being unavailable during voluntary disruptions), and graceful shutdown handling in the application. Senior candidates may also discuss canary deployments, service mesh traffic splitting, or Argo Rollouts.
14. What are Kubernetes resource requests and limits, and how do you determine appropriate values?
Difficulty: Mid
What to look for: Requests are the guaranteed minimum resources a container needs (used by the scheduler for placement decisions). Limits are the maximum resources a container can use (enforced by the kubelet). Setting requests too low leads to resource contention and throttling; setting them too high wastes cluster capacity. The candidate should describe using monitoring data (Prometheus metrics, Vertical Pod Autoscaler recommendations) from actual production workloads to right-size these values rather than guessing.
Infrastructure as Code Questions
15. What is Infrastructure as Code, and what problems does it solve?
Difficulty: Junior
What to look for: IaC is the practice of defining infrastructure (servers, networks, databases, etc.) in declarative or imperative code files that are version-controlled, reviewable, and reproducible. It solves configuration drift, undocumented changes, environment inconsistency, and the inability to reproduce infrastructure reliably. The candidate should understand the difference between declarative (Terraform, CloudFormation, Pulumi) and imperative (scripts, Ansible playbooks) approaches.
16. Explain the difference between Terraform and Ansible. When would you use each?
Difficulty: Mid
What to look for: Terraform is a declarative infrastructure provisioning tool — it defines what the infrastructure should look like, and the provider APIs create or modify resources to match. Ansible is primarily a configuration management and automation tool that runs procedural tasks on existing infrastructure. Terraform excels at creating and managing cloud resources (VMs, networks, databases). Ansible excels at configuring those resources after they exist (installing packages, managing config files, running scripts). Most mature environments use both: Terraform for provisioning, Ansible (or cloud-init, Packer) for configuration.
17. How do you manage Terraform state in a team environment?
Difficulty: Mid
What to look for: Remote state backends (S3 + DynamoDB for locking, Azure Blob Storage, Terraform Cloud, GCS) are essential for team use. The candidate should discuss state locking to prevent concurrent modifications, state file security (it often contains sensitive values), and state file organization (per-environment, per-service, or per-team). Senior candidates should discuss state import, state moves, and strategies for refactoring Terraform configurations without destroying and recreating resources.
18. What is the difference between Terraform modules and workspaces?
Difficulty: Mid
What to look for: Modules are reusable packages of Terraform configuration — they encapsulate a set of related resources (e.g., a VPC module, a Kubernetes cluster module) that can be called with different parameters. Workspaces provide separate state files within the same configuration, allowing you to manage multiple environments (dev, staging, prod) from the same codebase. The candidate should note that many teams prefer directory-based environment separation over workspaces for clarity, and understand the trade-offs of each approach.
19. Describe a time when an IaC change caused an unexpected production issue. How did you handle it?
Difficulty: Senior
What to look for: This is a behavioral question that reveals real-world experience. Look for specific details: what the change was, how it affected production (data loss, downtime, security exposure), how they detected the issue, their incident response process, and what preventive measures they implemented afterward (plan review, better testing, drift detection, blast radius reduction). Candidates who have never had a production IaC issue either have very limited experience or are not being honest.
20. How do you handle secrets and sensitive data in Terraform?
Difficulty: Mid
What to look for: Never commit secrets to Terraform files or state. Use variable files excluded from version control, environment variables, or integrate with a secrets manager (Vault, AWS Secrets Manager). Mark sensitive outputs with sensitive = true. Understand that Terraform state contains sensitive values in plaintext and must be stored securely. Discuss strategies like using data sources to reference secrets stored externally rather than passing them as variables.
Monitoring, Observability, and Incident Response
21. What is the difference between monitoring and observability?
Difficulty: Mid
What to look for: Monitoring is the practice of collecting and analyzing predefined metrics and logs to detect known failure modes (e.g., CPU > 90%, error rate > 1%). Observability is the ability to understand the internal state of a system from its external outputs — it helps you diagnose unknown failure modes and answer questions you did not anticipate. Observability is built on three pillars: metrics (quantitative measurements), logs (discrete event records), and traces (request paths through distributed systems). Strong candidates will explain that monitoring tells you when something is wrong; observability helps you understand why.
22. How would you set up alerting that minimizes alert fatigue?
Difficulty: Senior
What to look for: Alert on symptoms, not causes (e.g., alert on user-visible error rates, not CPU usage). Use severity tiers: pages for critical user-impacting issues, tickets for non-urgent issues. Set meaningful thresholds based on SLOs, not arbitrary values. Implement proper alert routing (on-call rotations, escalation policies). Regularly review and prune alerts — if an alert fires frequently and is always ignored, either fix the underlying issue or remove the alert. Strong candidates will reference Google’s SRE book principles.
23. Explain distributed tracing. Why is it important for microservices?
Difficulty: Mid
What to look for: Distributed tracing tracks a request as it flows through multiple services, creating a trace that shows the full request lifecycle, including latency at each service, dependencies between services, and where errors occur. Tools include Jaeger, Zipkin, AWS X-Ray, and OpenTelemetry (the emerging standard). It is important for microservices because a single user request may touch 5-20 services, and without tracing, diagnosing latency or failures requires manually correlating logs across services.
24. Describe your incident response process for a production outage.
Difficulty: Senior
What to look for: A structured approach: detection (monitoring/alerting triggers), triage (assess severity and user impact), communication (status page update, stakeholder notification), diagnosis (systematic investigation using observability tools), mitigation (restore service, even if the root cause is not yet understood), root cause analysis (post-incident investigation), and prevention (implementing fixes and process improvements). Senior candidates should discuss blameless post-mortems, incident commanders, communication protocols, and how they balance speed of mitigation with thoroughness of diagnosis.
25. What SLIs, SLOs, and SLAs would you define for a customer-facing API?
Difficulty: Senior
What to look for: SLIs (Service Level Indicators) are specific metrics: request latency (p50, p95, p99), error rate, availability (percentage of successful requests). SLOs (Service Level Objectives) are target values for SLIs: 99.9% availability, p99 latency under 500ms. SLAs (Service Level Agreements) are contractual commitments with consequences for violations. The candidate should discuss how to choose meaningful SLIs that reflect user experience, how to set SLOs that balance reliability with velocity (error budgets), and the relationship between internal SLOs and external SLAs.
Cloud Platform Questions
26. Compare the networking models of AWS, Azure, and GCP at a high level.
Difficulty: Senior
What to look for: All three use VPC/VNet concepts with subnets, but differ in scope and defaults. AWS VPCs are regional with AZ-scoped subnets; Azure VNets are regional with region-scoped subnets; GCP VPCs are global with region-scoped subnets. The candidate should discuss routing, peering, and connectivity options (Transit Gateway, Azure Virtual WAN, GCP Cloud Interconnect). Deep expertise in one platform is more valuable than shallow knowledge of all three, but a senior engineer should understand the conceptual mapping between platforms.
27. How do you implement least-privilege IAM policies in AWS or Azure?
Difficulty: Mid
What to look for: Start with zero permissions and grant only what is needed. Use managed policies where available, but create custom policies for fine-grained control. Leverage IAM Access Analyzer (AWS) or Azure AD access reviews to identify unused permissions. Implement service-linked roles for AWS services, managed identities for Azure. Use conditions (IP restrictions, MFA requirements, time-based access) to further restrict access. Avoid wildcard permissions in production. The candidate should understand that IAM is the most important security control in any cloud environment.
28. What is the shared responsibility model, and how does it affect DevOps practices?
Difficulty: Junior
What to look for: Cloud providers are responsible for security of the cloud (physical infrastructure, hypervisor, network infrastructure). Customers are responsible for security in the cloud (data, applications, identity management, network configuration, OS patching for IaaS). The boundary shifts depending on the service model (IaaS vs PaaS vs SaaS). DevOps teams must understand what they are responsible for at each layer and automate security controls accordingly (patch management, vulnerability scanning, compliance checks).
29. Explain the trade-offs between serverless (Lambda/Functions) and container-based architectures.
Difficulty: Senior
What to look for: Serverless advantages: no infrastructure management, automatic scaling to zero, pay-per-execution pricing. Serverless disadvantages: cold start latency, execution time limits, vendor lock-in, limited runtime customization, difficulty with stateful workloads, and debugging complexity. Containers offer more control, portability, predictable performance, and better cost efficiency at sustained high throughput. The right choice depends on workload characteristics: event-driven, bursty workloads with low throughput favor serverless; steady-state, high-throughput workloads favor containers. Many production systems use both.
30. How do you manage cloud costs as a DevOps engineer?
Difficulty: Mid
What to look for: Implement tagging standards for cost allocation. Use reserved instances or savings plans for predictable workloads. Rightsize instances based on actual utilization data. Set up budget alerts. Automate shutdown of non-production environments outside business hours. Use spot/preemptible instances for fault-tolerant workloads. Regularly review and delete unused resources (unattached disks, old snapshots, idle load balancers). The candidate should view cost management as an ongoing practice, not a one-time exercise.
Security and Compliance Questions
31. What is DevSecOps, and how does it differ from traditional security practices?
Difficulty: Junior
What to look for: DevSecOps integrates security practices into every phase of the software delivery lifecycle rather than treating security as a gate at the end. This includes automated security scanning in CI/CD pipelines (SAST, DAST, SCA, container image scanning), infrastructure security validation (IaC security scanning, compliance as code), and shifting security left so vulnerabilities are caught earlier when they are cheaper to fix. The cultural aspect is important: security is everyone’s responsibility, not just the security team’s.
32. How do you implement container image security in a CI/CD pipeline?
Difficulty: Mid
What to look for: Use minimal base images (distroless, alpine). Scan images for vulnerabilities (Trivy, Snyk, Prisma Cloud) in the pipeline and block deployment if critical vulnerabilities are found. Sign images and enforce signature verification at deployment time. Use read-only root filesystems. Run containers as non-root users. Implement image pull policies that prevent running unscanned images. Regularly rebuild images to incorporate base image security patches.
33. Describe how you would implement network segmentation in a cloud environment.
Difficulty: Senior
What to look for: Use VPCs/VNets to isolate environments (production, staging, development). Implement subnet-level controls with network ACLs. Use security groups for instance-level firewall rules. Deploy private endpoints for cloud services to keep traffic off the public internet. Implement service mesh (Istio, Linkerd) for service-to-service mTLS and network policies in Kubernetes. Use VPC flow logs for traffic analysis and anomaly detection. The candidate should discuss the principle of least connectivity and how to balance security with operational complexity.
34. How do you handle compliance requirements (SOC 2, HIPAA, PCI-DSS) in a DevOps environment?
Difficulty: Senior
What to look for: Implement compliance as code using tools like Open Policy Agent, Sentinel, or AWS Config Rules. Automate evidence collection for audits. Use IaC to ensure infrastructure configurations are compliant and auditable. Implement comprehensive logging and audit trails. Encrypt data at rest and in transit. Manage access through centralized identity providers with MFA. The candidate should understand that compliance is not a one-time audit but an ongoing operational practice that must be baked into the delivery pipeline.
35. What is a supply chain attack in the context of software delivery, and how do you protect against it?
Difficulty: Senior
What to look for: Supply chain attacks target the software delivery pipeline itself — compromised dependencies, malicious packages, tampered build artifacts, or compromised CI/CD infrastructure. Protection measures include dependency pinning and lock files, vulnerability scanning of dependencies (Dependabot, Snyk), SBOM (Software Bill of Materials) generation, build provenance verification (SLSA framework, Sigstore), private package registries, and CI/CD pipeline hardening (ephemeral runners, minimal permissions).
Linux and Networking Questions
36. A server is running slowly. Walk me through your troubleshooting process.
Difficulty: Mid
What to look for: Systematic approach using USE (Utilization, Saturation, Errors) or RED (Rate, Errors, Duration) methodology. Check CPU (top, htop, mpstat), memory (free, vmstat), disk I/O (iostat, iotop), network (netstat, ss, iftop), and process state (ps, strace). Look at system logs (dmesg, journalctl). Check for resource contention from other processes or containers. The candidate should have a repeatable methodology, not just throw random commands at the wall.
37. Explain the difference between TCP and UDP. When would you use each?
Difficulty: Junior
What to look for: TCP is connection-oriented, provides guaranteed delivery with ordering and error checking, and uses a three-way handshake. UDP is connectionless, provides no delivery guarantee, and has lower overhead. TCP is used for HTTP/HTTPS, SSH, database connections — anything where data integrity is critical. UDP is used for DNS, video streaming, VoIP, and gaming — where speed matters more than guaranteed delivery. Strong candidates will mention that HTTP/3 (QUIC) uses UDP as its transport layer.
38. What happens when you type a URL into a browser? Explain the full network path.
Difficulty: Mid
What to look for: DNS resolution (recursive resolver, root servers, TLD servers, authoritative servers, caching), TCP connection (three-way handshake), TLS handshake (for HTTPS), HTTP request, server processing, HTTP response, browser rendering. This classic question reveals how well the candidate understands the full networking stack. Strong candidates will discuss CDNs, load balancers, reverse proxies, and connection pooling as they appear in production architectures.
39. How do you troubleshoot DNS resolution issues in a Kubernetes cluster?
Difficulty: Mid
What to look for: Check CoreDNS pods are running and healthy. Verify the pod’s DNS configuration (/etc/resolv.conf). Test resolution from within the pod using nslookup or dig. Check CoreDNS logs for errors. Verify network policies are not blocking DNS traffic (port 53). Check if the issue is with cluster-internal DNS (service names) or external DNS. Understand the DNS resolution chain: pod -> CoreDNS -> upstream DNS. Verify the kube-dns Service is correctly configured.
Behavioral and Scenario Questions
40. Describe a time when you had to balance speed of delivery with operational stability.
Difficulty: Mid
What to look for: Specific examples of trade-off decisions. Did the candidate implement guardrails (feature flags, canary deployments, automated rollback) to enable faster delivery without sacrificing stability? Did they push back on unreasonable timelines with data? Did they negotiate scope to meet deadlines without cutting quality? The best answers show an understanding that speed and stability are not inherently opposed — good DevOps practices enable both.
41. How do you approach learning a new tool or technology that you have never used before?
Difficulty: Junior
What to look for: The DevOps landscape changes rapidly, and the ability to learn quickly is more valuable than knowledge of any specific tool. Look for a structured approach: reading official documentation first, building a proof-of-concept in a sandbox environment, understanding how the tool fits into the broader ecosystem, and evaluating it against alternatives. Red flag: candidates who say they only learn from YouTube tutorials or never experiment hands-on.
42. Tell me about a production incident you caused. What happened, and what did you learn?
Difficulty: Mid
What to look for: Honesty and self-awareness. Every experienced DevOps engineer has caused at least one production incident. The value is in the aftermath: did they take responsibility, communicate transparently, lead or participate in a blameless post-mortem, and implement preventive measures? Candidates who claim they have never caused a production issue either have very limited production experience or lack self-awareness.
43. How do you handle disagreements with developers about deployment practices or architecture decisions?
Difficulty: Senior
What to look for: The candidate should demonstrate collaboration, not confrontation. Look for an approach that focuses on shared goals (reliability, velocity, user experience), uses data and evidence to support positions, seeks to understand the developer’s constraints and priorities, and finds compromises where possible. Senior candidates should give examples of times they changed their mind based on new information, not just examples of being right.
44. Describe your approach to documentation. What do you document, and what do you automate instead?
Difficulty: Mid
What to look for: A pragmatic approach that avoids both extremes (no documentation and exhaustive documentation that no one reads). Automate runbooks where possible (self-healing infrastructure, automated remediation). Document architecture decisions (ADRs), operational procedures that require human judgment, onboarding guides, and system boundaries/ownership. Keep documentation close to the code (README files, inline comments, IaC annotations). The candidate should understand that the best documentation is the code and automation itself.
Advanced and Architecture Questions
45. How would you design a disaster recovery strategy for a multi-region application?
Difficulty: Senior
What to look for: Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) based on business requirements. Discuss active-active vs active-passive configurations. Address data replication strategies (synchronous vs asynchronous, conflict resolution). Cover DNS failover (Route 53 health checks, Traffic Manager). Discuss infrastructure-as-code for rapid environment recreation. Test DR regularly through chaos engineering or game days. The candidate should understand that DR design is driven by business requirements and cost constraints, not just technical capability.
46. What is GitOps, and how does it differ from traditional CI/CD?
Difficulty: Mid
What to look for: GitOps uses Git as the single source of truth for declarative infrastructure and application configuration. A GitOps operator (ArgoCD, Flux) continuously reconciles the actual state of the system with the desired state defined in Git. Unlike traditional CI/CD where the pipeline pushes changes to the target environment, GitOps uses a pull model where the operator in the cluster pulls changes from Git. Benefits include auditability, easy rollback (git revert), and consistent environments.
47. How do you implement feature flags, and what role do they play in DevOps?
Difficulty: Mid
What to look for: Feature flags decouple deployment from release, allowing code to be deployed to production without being visible to users. This enables trunk-based development, canary releases, A/B testing, and instant rollback without redeployment. Tools include LaunchDarkly, Unleash, Split, or simple configuration-based approaches. The candidate should understand the operational overhead of feature flags (flag debt, testing complexity) and the importance of cleaning up old flags.
48. Describe how you would migrate a monolithic application to microservices from a DevOps perspective.
Difficulty: Senior
What to look for: A pragmatic, incremental approach — not a big-bang rewrite. Start with the strangler fig pattern: identify bounded contexts, extract services one at a time, and route traffic between the monolith and new services. From a DevOps perspective, this requires building CI/CD pipelines for each new service, implementing service discovery and inter-service communication, adding distributed tracing and centralized logging, and evolving the deployment strategy from monolithic to per-service. The candidate should discuss the operational complexity that microservices introduce and when a monolith might actually be the better choice.
49. What is platform engineering, and how does it relate to DevOps?
Difficulty: Senior
What to look for: Platform engineering is the discipline of building and maintaining internal developer platforms (IDPs) that provide self-service capabilities to development teams. It evolved from DevOps as organizations realized that asking every development team to manage their own infrastructure, CI/CD, and operations creates cognitive overload and inconsistency. Platform teams build golden paths (opinionated, well-supported workflows) that make it easy for developers to do the right thing. The candidate should discuss internal developer platforms, developer experience, and the balance between standardization and flexibility.
50. How do you approach capacity planning for a rapidly growing application?
Difficulty: Senior
What to look for: Establish baseline metrics (current resource utilization, growth rate, traffic patterns). Use monitoring data to identify bottlenecks before they become incidents. Implement auto-scaling with appropriate policies (scale-up fast, scale-down slowly). Conduct regular load testing to validate capacity under peak conditions. Plan for 2-3x current capacity to handle unexpected spikes. Consider horizontal vs vertical scaling trade-offs. The candidate should discuss how they balance the cost of over-provisioning against the risk of under-provisioning.
Bonus Questions for Specialized Roles
51. How do you implement policy-as-code for Kubernetes?
Difficulty: Senior
What to look for: Open Policy Agent (OPA) with Gatekeeper, or Kyverno, to enforce policies on Kubernetes resources at admission time. Examples: requiring resource limits on all pods, preventing privileged containers, enforcing image pull policies, requiring specific labels. The candidate should discuss the balance between enforcing policies and not blocking developer velocity.
52. Describe your experience with chaos engineering. What experiments have you run?
Difficulty: Senior
What to look for: Chaos engineering is the practice of deliberately injecting failures to test system resilience. Tools include Chaos Monkey, Litmus Chaos, Gremlin, and Chaos Mesh. The candidate should describe specific experiments (killing pods, introducing network latency, simulating AZ failures) and the insights they gained. The key principle is forming a hypothesis about expected behavior before running the experiment.
53. How do you manage database migrations in a continuous deployment environment?
Difficulty: Senior
What to look for: Backward-compatible migrations (expand-and-contract pattern). Run migrations before deploying new application code so rollback does not require database rollback. Use migration tools (Flyway, Liquibase, Rails migrations, Alembic) with versioned, idempotent scripts. Never perform destructive schema changes (dropping columns, renaming tables) in a single step — add the new structure, migrate data, deploy code that uses the new structure, then remove the old structure.
Building Your Interview Panel
A strong DevOps interview process goes beyond just asking technical questions. Here are recommendations for structuring your interview:
First screen (30-45 minutes): Use 5-8 questions from the Junior/Mid categories to establish baseline competence. Focus on fundamentals: Linux, networking, CI/CD basics, and containerization. This can be conducted by a recruiter with a scoring rubric or a mid-level engineer.
Technical deep-dive (60-90 minutes): Use 6-10 questions from the Mid/Senior categories, tailored to your tech stack. This should be conducted by a senior engineer or engineering manager who can evaluate the depth and nuance of answers. Include at least one architecture or design question.
Behavioral interview (45-60 minutes): Use 3-5 behavioral questions to assess collaboration, communication, incident response temperament, and learning ability. DevOps is as much about culture and collaboration as it is about technical skills.
Take-home or live exercise (2-4 hours): Provide a practical exercise: write a Terraform module, debug a broken CI/CD pipeline, or design a monitoring strategy for a given architecture. Evaluate the candidate’s approach, code quality, and documentation — not just whether they get the “right” answer.
The time investment in a thorough interview process pays for itself. As we discuss in our analysis of how long it should take to hire a DevOps engineer, rushing the interview process to save time often results in a bad hire that costs far more than the time saved.
Frequently Asked Questions
How many interview questions should I ask a DevOps candidate?
For a standard interview loop, plan for 15-20 technical questions across 2-3 interview sessions, plus 5-8 behavioral questions. Do not try to cover every category in a single interview — focus on the skills most relevant to your specific role and tech stack. Quality of conversation matters more than quantity of questions.
What is more important: tool-specific knowledge or foundational understanding?
Foundational understanding wins in almost every case. A candidate who deeply understands networking, Linux, distributed systems, and software delivery principles can learn any specific tool in weeks. A candidate who knows the commands for Terraform but does not understand state management, dependency graphs, or cloud networking will hit a ceiling quickly. Use tool-specific questions to validate hands-on experience, but weight conceptual understanding more heavily.
How do I assess DevOps candidates if I am not technical?
Focus on behavioral questions and problem-solving approach rather than specific technical answers. Ask candidates to explain their answers in plain language — strong engineers can explain complex concepts simply. Use a structured scoring rubric so you can compare candidates consistently. And most importantly, work with a recruiting partner that has technical evaluators who can assess DevOps skills on your behalf. We cover this in detail in our guide on how to evaluate technical candidates without being technical.
Should I include coding questions in a DevOps interview?
Yes, but they should be relevant to the role. Asking a DevOps engineer to implement a linked list reversal is not useful. Asking them to write a Python or Bash script that parses log files, a Terraform module that creates a VPC, or a Dockerfile for a multi-stage build is directly relevant and reveals real-world skills. The question should reflect the actual work the candidate will do.
How do I avoid bias in DevOps interviews?
Use structured interviews where every candidate receives the same (or equivalent) questions. Define evaluation criteria before the interview, not after. Use a diverse interview panel. Evaluate answers against a rubric rather than gut feeling. Be aware that certifications and specific tool experience can introduce bias toward candidates who had access to training resources — focus on demonstrated problem-solving ability over credentials.
Start Hiring Better DevOps Engineers
Strong DevOps interview questions are essential, but they are only one part of a successful hiring process. You also need a pipeline of qualified candidates, a competitive offer, and a hiring timeline that does not lose top talent to faster-moving companies.
If you are hiring DevOps or SRE engineers, our technical recruiting team specializes in finding candidates who can answer these questions from real production experience — not just textbook knowledge. Start your search today and speak with an engineer who understands the role.