Remoteria — Pre-Vetted Offshore Staffing

Q: Describe a time you disagreed with an engineer about cloud architecture. How did it resolve?

Brought data (cost, reliability, complexity), wrote a short ADR, respected final decision, not ego-driven.

Q: Tell me about a cost optimization project you led. What was the baseline, target, and outcome?

Specific numbers: "cut $40k/mo to $28k/mo by moving X to Graviton and adding SP coverage". Measured impact, did not break production capacity.

Syed Ali

Interview guide

Cloud Engineer Interview Questions & Answers Guide (2026)

A hiring-manager’s interview kit for cloud engineers — with specific “what to look for” notes on every answer, red flags to watch, and a practical test.

Skip the interviews — hire a pre-vetted cloud engineer

Key facts

Role: Cloud Engineer
Technical questions: 14
Behavioral: 7
Role-fit: 5
Red flags: 8
Practical test: Included

How to use this guide

Pick 4-6 technical questions across difficulties, 2-3 behavioral, and 1-2 role-fit for a 45-minute interview. For senior roles, weight harder technical and role-fit higher. Always close with the practical test so you are hiring on evidence, not impressions. The “what to look for” notes are a scoring rubric: strong answers touch most points, weak answers miss them or replace them with platitudes.

Technical questions — Easy

1. Explain the difference between an Application Load Balancer and a Network Load Balancer, and when you would pick each.

Easy

What to look for: ALB: L7, HTTP routing, WAF integration, host/path rules. NLB: L4, static IP, preserves source IP, millions of RPS, TLS passthrough. Picks NLB for non-HTTP, extreme perf, or need for static IP; ALB otherwise.

Technical questions — Medium

1. Design the AWS account structure for a 50-person SaaS company with production, staging, dev, and a separate audit/logging boundary. Walk me through the decisions.

Medium

What to look for: AWS Organizations, separate OUs for Prod/NonProd/Security/Infrastructure, SCPs at OU level, centralized logging account, separate log archive, payer vs workload accounts. Mentions Control Tower or homegrown landing zone. Explains why not one account.

2. A developer needs to call RDS from a Lambda in a VPC. Walk me through the IAM and networking setup.

Medium

What to look for: Lambda in private subnet with VPC config, security group allowing egress to RDS SG, RDS SG allowing ingress on 5432 from Lambda SG, IAM execution role with AWSLambdaVPCAccessExecutionRole, IAM DB auth or Secrets Manager for credentials. Flags cold-start implications of VPC Lambda.

3. What is IRSA (IAM Roles for Service Accounts) and why is it better than node IAM roles on EKS?

Medium

What to look for: Explains OIDC provider on EKS, service account annotation, pod-level IAM instead of node-wide. Fixes blast radius: a compromised pod does not get all node permissions. Mentions Workload Identity as GCP equivalent.

4. Your AWS bill jumped 40% last month with no new deploys. Walk me through the investigation.

Medium

What to look for: Cost Explorer grouped by service → linked account → usage type. Common culprits: NAT Gateway egress, cross-AZ traffic, CloudWatch Logs ingestion, S3 request charges, untagged EC2/EBS. Uses CUR for detailed analysis. Has a story.

5. Design a least-privilege IAM policy for a CI pipeline that needs to deploy a Lambda function, update an S3 bucket, and invalidate a CloudFront distribution.

Medium

What to look for: Resource-scoped ARNs (not *), specific actions (lambda:UpdateFunctionCode not lambda:*), condition keys where useful (aws:SourceVpc, aws:PrincipalTag), OIDC federation from GitHub Actions rather than long-lived access keys.

6. A Terraform module you wrote is used by 20 teams. You need to make a breaking change. How?

Medium

What to look for: SemVer with major version bump, publish new version to registry, old version still pinned by consumers, migration guide in CHANGELOG, deprecation warnings in current version first, moved blocks for internal refactors, coordinated rollout.

7. Compare AWS Reserved Instances, Savings Plans, and Spot. When do you use each?

Medium

What to look for: RI: specific instance family/region, steady workload. Compute SP: flexible across family/region, steady workload. Spot: interruptible batch, up to 90% off, needs Spot-aware app. Mentions commitment modeling based on 14-30 days of usage.

Technical questions — Hard

1. Your Terraform state file is corrupted and the lock is stuck. Walk me through recovery without nuking infrastructure.

Hard

What to look for: DynamoDB lock release via force-unlock with UUID, state backup from S3 versioning, terraform state list/pull to audit, import blocks to reconcile, last resort: hand-edit state with terraform state rm/import. Mentions why not to commit state to git.

2. Design a multi-region active-passive failover for a Postgres-backed web app. What is your RTO and RPO, and how do you hit them?

Hard

What to look for: Aurora Global Database or cross-region read replica for RPO < 1s, Route 53 health check + DNS failover or Global Accelerator, standby ALB and ASG in us-west-2, secrets replicated, how to promote replica, how to test it. Realistic RTO of 5-15min on managed services.

3. How do you handle secrets rotation for a database password used by 15 microservices?

Hard

What to look for: Secrets Manager with Lambda rotation function, apps read from Secrets Manager on each connection (or cached with TTL), no app restart required, staged rotation (AWSCURRENT/AWSPENDING), monitors failed rotations. NOT: rotating then restarting 15 services.

4. Walk me through mapping SOC 2 CC6 (logical access) controls to AWS services. What evidence does an auditor want?

Hard

What to look for: IAM policies + SCPs = access control, AWS SSO for centralized identity, CloudTrail for access logs, MFA enforcement via SCP, periodic access reviews (Access Analyzer), off-boarding runbook. Auditor wants logs, policy exports, review records — not screenshots.

5. Your org runs 30 EKS clusters. How do you manage upgrades across them safely?

Hard

What to look for: Canary cluster pattern, documented upgrade runbook, kube-no-trouble or pluto for deprecated API scanning, automated via Terraform or CAPI, staggered rollout, version skew policy between control plane and nodes. Not: click through 30 consoles.

6. Tell me about the biggest reliability incident you owned. What was the root cause, and what infrastructure guardrail did you add after?

Hard

What to look for: Specific incident, clear causal chain, blameless analysis, concrete infra change (limit, alert, IaC guardrail, automated test). Not vague "we added monitoring".

Behavioral questions

1. Describe a time you disagreed with an engineer about cloud architecture. How did it resolve?

What to look for: Brought data (cost, reliability, complexity), wrote a short ADR, respected final decision, not ego-driven.

2. Tell me about a cost optimization project you led. What was the baseline, target, and outcome?

What to look for: Specific numbers: "cut $40k/mo to $28k/mo by moving X to Graviton and adding SP coverage". Measured impact, did not break production capacity.

3. Walk me through a compliance audit you supported. What went well, what was painful?

What to look for: Specific framework (SOC 2 Type 2, HIPAA), evidence they collected, tooling (Drata, Vanta, Tugboat), lessons about preparation cadence.

4. How do you convince a product team to adopt your platform module instead of rolling their own?

What to look for: Meets them where they are, makes the golden path easier than custom, docs + examples, offers pairing. Does not force by decree.

5. Describe a time you had to explain a cloud architecture decision to non-technical stakeholders.

What to look for: Analogies, visuals, focused on business outcome (cost, risk, speed), not jargon.

6. How do you keep current on the big three clouds when they ship 100+ features a year?

What to look for: Specific: AWS re:Invent talks, release notes RSS, specific newsletters (Last Week in AWS), experimentation in a sandbox. Not vague.

7. Tell me about a production change you reverted.

What to look for: Detection, decision to revert vs fix forward, communication, postmortem follow-up. Not blaming.

Role-fit questions

1. Where do you see the line between Cloud Engineer and DevOps Engineer?

What to look for: Cloud = platform, services, landing zone, IAM, cost, networking. DevOps = pipelines, deploy tooling, on-call, release cadence. Acknowledges overlap without being turf-y.

2. AWS, Azure, or GCP — what do you actually prefer and why?

What to look for: Has an informed opinion grounded in services, tooling, ecosystem, pricing. Not religious. Can still work in the others.

3. We run on AWS with Terraform, EKS, and Datadog. What do you already know, and where would you ramp?

What to look for: Honest gap assessment with concrete plan. Fakery is a red flag.

4. How do you feel about owning the cloud bill and presenting it monthly to the CFO?

What to look for: Comfortable with FinOps accountability, understands spend as a product, can translate services into business language.

5. What does your first 30 days look like here?

What to look for: Read-only audit week, first Terraform fix week 2, cost baseline report week 3, Well-Architected draft month 1. Not passive.

Red flags

Any one of these alone is usually reason to pass, especially combined with weak answers elsewhere.

• Writes IAM policies with Action: * or Resource: * and shrugs when asked.
• Manages Terraform state by committing .tfstate to git.
• Has never done a restore test — "we have backups" without evidence.
• Cannot explain the difference between a security group and a NACL.
• Claims multi-cloud is always better without acknowledging operational cost.
• Names compliance frameworks (SOC 2, HIPAA) but cannot cite a specific control family they worked on.
• Uses root account credentials for day-to-day work.
• Does not know what a service control policy (SCP) is or confuses it with IAM policy.

Practical test

6-hour take-home: given a Terraform repo with an existing AWS landing zone and a deliberately over-permissioned IAM policy for a Lambda, (1) refactor the IAM policy to least-privilege with resource-scoped ARNs and justify each action in comments, (2) add a module that provisions a multi-AZ RDS Postgres with automated backups and a restore runbook, and (3) produce a 1-page cost and reliability analysis. Graded on: IAM rigor (30%), Terraform module quality and reusability (25%), backup/restore correctness (20%), the written analysis including trade-offs (15%), and commit hygiene (10%).

Scoring rubric

Score each answer 1-4: (1) Misses most of the rubric or gives platitudes; (2) Hits some points but cannot go deep when pressed; (3) Covers the rubric and can defend the answer under follow-ups; (4) Adds unprompted nuance, trade-offs, or real examples beyond the rubric. Hire at an average of 3.0+ across technical, behavioral, and role-fit, with zero red flags, and a pass on the practical test.

Cloud Engineer Interview Questions & Answers Guide (2026)

Key facts

How to use this guide

Technical questions — Easy

1. Explain the difference between an Application Load Balancer and a Network Load Balancer, and when you would pick each.

Technical questions — Medium

1. Design the AWS account structure for a 50-person SaaS company with production, staging, dev, and a separate audit/logging boundary. Walk me through the decisions.

2. A developer needs to call RDS from a Lambda in a VPC. Walk me through the IAM and networking setup.

3. What is IRSA (IAM Roles for Service Accounts) and why is it better than node IAM roles on EKS?

4. Your AWS bill jumped 40% last month with no new deploys. Walk me through the investigation.

5. Design a least-privilege IAM policy for a CI pipeline that needs to deploy a Lambda function, update an S3 bucket, and invalidate a CloudFront distribution.

6. A Terraform module you wrote is used by 20 teams. You need to make a breaking change. How?

7. Compare AWS Reserved Instances, Savings Plans, and Spot. When do you use each?

Technical questions — Hard

1. Your Terraform state file is corrupted and the lock is stuck. Walk me through recovery without nuking infrastructure.

2. Design a multi-region active-passive failover for a Postgres-backed web app. What is your RTO and RPO, and how do you hit them?

3. How do you handle secrets rotation for a database password used by 15 microservices?

4. Walk me through mapping SOC 2 CC6 (logical access) controls to AWS services. What evidence does an auditor want?

5. Your org runs 30 EKS clusters. How do you manage upgrades across them safely?

6. Tell me about the biggest reliability incident you owned. What was the root cause, and what infrastructure guardrail did you add after?

Behavioral questions

1. Describe a time you disagreed with an engineer about cloud architecture. How did it resolve?

2. Tell me about a cost optimization project you led. What was the baseline, target, and outcome?

3. Walk me through a compliance audit you supported. What went well, what was painful?

4. How do you convince a product team to adopt your platform module instead of rolling their own?

5. Describe a time you had to explain a cloud architecture decision to non-technical stakeholders.

6. How do you keep current on the big three clouds when they ship 100+ features a year?

7. Tell me about a production change you reverted.

Role-fit questions

1. Where do you see the line between Cloud Engineer and DevOps Engineer?

2. AWS, Azure, or GCP — what do you actually prefer and why?

3. We run on AWS with Terraform, EKS, and Datadog. What do you already know, and where would you ramp?

4. How do you feel about owning the cloud bill and presenting it monthly to the CFO?

5. What does your first 30 days look like here?

Red flags

Practical test

Scoring rubric

Related

Written by Syed Ali