The pager didn’t just beep. It screamed. It’s a specific frequency—somewhere between a dying hawk and a car alarm—that PagerDuty reserves for the kind of “Priority 0” events that end careers. I swiped the screen with a thumb that was shaking more from a lack of sleep than from fear.
03:14:22 UTC.
Error Code: AADSTS7000222
Message: The provided client secret keys are expired. Visit the Azure Portal to create new keys for your application, or consider using certificate credentials for added security.
I sat up. The room was freezing, the radiator having given up around midnight. I reached for the mug of espresso I’d left on the nightstand—cold, oily, and bitter. Just like the next six hours of my life were going to be.
The message was clear: our primary Service Principal, the one that handles the automated scaling for the entire prod-us-east-2 cluster, had just timed out. Its secret had hit its one-year expiration. Because we’re a “fast-moving” company, nobody had documented which vault held the source of truth, and the rotation script I’d written two years ago had been disabled by a “Security Architect” who thought it was too risky.
Now, the cluster was trying to scale out to handle the early morning batch processing for our European clients. It couldn’t. The nodes were redlining. The site was throwing 503s. And I was the only one awake.
Table of Contents
[INCIDENT-8821] The Portal is a Liar and a Thief of Time
My first instinct was the “Manual Hell” approach. I know, I know. “Use the CLI,” they say. “Stay in the terminal,” they say. But when your brain is 40% caffeine and 60% panic, you find yourself clicking into the Azure Portal like a moth to a bug zapper.
I logged into the tenant. The UI felt like it was moving through molasses. I navigated to App Registrations, searched for the SPN, and clicked Certificates & secrets. The spinning blue circle of death mocked me for a solid thirty seconds.
“Come on, you piece of…” I muttered, the words catching in my dry throat.
When it finally loaded, I saw the red text. Expired: 03:14:00 UTC. I generated a new secret. I copied the value. Now I just had to update the Key Vault. But wait—which Key Vault? We have fourteen of them in that subscription alone because “microservices” apparently means “give everyone a vault and let God sort them out.”
I tried to use the search bar. No results found. I tried to filter by tags. No tags found.
I spent forty-five minutes clicking through the UI, manually checking the Access Policies of every vault to see which one had the Microsoft.Compute/virtualMachines provider registered as a contributor. I was drowning in a sea of “Essentials” blades and “Activity Logs” that told me nothing. The Portal is designed to make simple things look easy and hard things impossible. It’s a dashboard for people who don’t actually have to fix things at 4 AM.
By 04:00 AM, the latency on our Standard_D2s_v3 instances in the staging environment—which I’d foolishly used as a canary—was hitting 4000ms. The disk IOPS were throttled because I’d picked a SKU that couldn’t handle the burst. I was losing the war.
[OPS-WAR-ROOM] The Transition to DevOps Azure Salvation
I closed the browser tab with a vengeance. The “Manual Hell” phase was over. If I was going to save this, I had to stop acting like a junior admin and start using the devops azure toolset properly, even if the transition felt like dragging my soul over broken glass.
I opened VS Code and pulled the repository for our infrastructure-as-code. We were in the middle of a half-baked migration from ARM templates to Bicep. Half the modules were legacy JSON garbage that looked like a bracket factory exploded, and the other half were Bicep files that were “clean” but lacked the specific parameters needed for our custom VNET injection.
I needed to update the Service Connection in Azure DevOps. This is where the “devops azure” workflow usually breaks down: the bridge between the identity provider (Entra ID, or whatever they’re calling it this week) and the orchestration engine.
TECHNICAL INTERLUDE #1: THE CLI TRUTH
I ran a quick check to see exactly what the state of my resources was. I didn’t trust the Portal anymore.
# Attempting to verify the service principal's current role assignments
az role assignment list --assignee "d3b07384-xxxx-xxxx-xxxx-xxxxxxxxxxxx" --output table
# Result:
# (stderr) AD Graph lookup failed. This can happen if the service principal
# has been deleted or if there are eventual consistency issues in Entra ID.
# (stdout) []
# Let's check the actual resource provider status for the VMSS
az resource show \
--resource-group "rg-prod-compute-001" \
--name "vmss-app-prod" \
--resource-type "Microsoft.Compute/virtualMachineScaleSets" \
--api-version "2023-09-01" \
--query "{Status:provisioningState, SKU:sku.name}"
# Result:
# {
# "Status": "Failed",
# "SKU": "Standard_D2s_v3"
# }
The provisioningState was Failed. Of course it was. The scale-set was stuck in a “Updating” loop because it couldn’t authenticate to the Key Vault to pull the disk encryption key. I had to fix the identity, and I had to do it through the pipeline, or the state file would be out of sync forever.
[DEPLOY-PIPELINE-ALPHA] The Bicep vs. ARM Cold War
I started rewriting the deployment module. We had been using a monolithic ARM template that was 4,000 lines of JSON. If you’ve never had to debug a nested copyIndex() function in an ARM template at 4:30 AM, count your blessings. It is a special kind of psychological torture.
I moved the logic to Bicep. Bicep is supposed to be the answer to our prayers, but in a devops azure environment, it’s just a prettier way to fail. The syntax is cleaner, sure, but the underlying API (Microsoft.Resources/deployments) still has the same quirks.
I wrote the Bicep module to handle the Managed Identity instead of the Service Principal. That was the goal: get away from secrets that expire. But Managed Identities have their own friction. You can’t just “create” one and use it; you have to wait for the identity to propagate through the Azure global graph. If your pipeline moves too fast, the next step—assigning permissions—will fail with a PrincipalNotFound error.
TECHNICAL INTERLUDE #2: THE YAML INDENTATION INCIDENT
I pushed the change to the azure-pipelines.yml file. I was confident. I was smug. I was an idiot.
stages:
- stage: Deploy
jobs:
- job: Infrastructure
pool:
vmImage: 'ubuntu-latest'
steps:
- task: AzureResourceManagerTemplateDeployment@3
inputs:
deploymentScope: 'Resource Group'
azureResourceManagerConnection: 'sc-prod-connection'
subscriptionId: 'sub-id-hidden'
action: 'Create Or Update Resource Group'
resourceGroupName: 'rg-prod-compute-001'
location: 'East US 2'
templateLocation: 'Linked artifact'
csmFile: '$(Build.DefinitionName)/infra/main.bicep'
overrideParameters: >
-vmSku "Standard_D2s_v3"
-adminPassword "$(vm-password)"
-managedIdentityName "id-prod-app"
The pipeline failed instantly. Error: Line 42, Column 1: Expected a mapping value.
I stared at line 42 for twenty minutes. It looked perfect. I deleted the spaces. I re-added them. I switched to a different text editor. It turned out that a stray tab character—likely introduced when I copied a snippet from a StackOverflow post from 2019—had infiltrated the file. YAML is a language designed by people who hate engineers. It’s a configuration format that treats whitespace as logic, which is like building a skyscraper where the structural integrity depends on how hard the wind is blowing.
In a devops azure context, your pipeline is only as strong as your last git push. And mine was currently a pile of garbage.
[INFRA-RECOVERY] Managed Identities and the “Standard_D2s_v3” Trap
Once the YAML was fixed, the deployment started. But then I hit the next wall: the SKU.
I had specified Standard_D2s_v3. In my head, this was fine. It’s a general-purpose SKU. But I forgot that in prod-us-east-2, the quota for the D series was nearly maxed out. The pipeline sat there for fifteen minutes in “Created” status before failing with:
Operation results in exceeding quota limits of Core. Maximum allowed: 100, Current in use: 98, Additional requested: 4.
I had to pivot. I needed to change the SKU to something available, but changing a VM SKU in a Scale Set isn’t always a “hot” swap. Sometimes it requires a full re-image of the nodes. If I did that now, I’d take down the remaining 5% of the site that was still limping along.
I had to manually go into the Bicep file and implement a conditional logic gate for the SKU selection based on the region’s capacity—something that should be handled by the cloud provider but instead falls on the shoulders of the SRE.
TECHNICAL INTERLUDE #3: THE BICEP FIX
Here is the Bicep snippet that finally stabilized the environment. Note the use of the existing keyword to reference the Key Vault—this is crucial for avoiding the “Resource Not Found” errors that plague devops azure deployments.
param location string = resourceGroup().location
param vmSku string = 'Standard_D2s_v3'
param managedIdentityName string
resource mi 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = {
name: managedIdentityName
location: location
}
resource kv 'Microsoft.KeyVault/vaults@2023-07-01' existing = {
name: 'kv-prod-secrets-001'
}
resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(mi.id, 'Key Vault Secrets User')
scope: kv
properties: {
principalId: mi.properties.principalId
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '4633458b-17de-408a-b874-0445c86b69e6')
principalType: 'ServicePrincipal'
}
}
resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2023-09-01' = {
name: 'vmss-app-prod'
location: location
sku: {
name: vmSku
tier: 'Standard'
capacity: 10
}
properties: {
overprovision: true
upgradePolicy: {
mode: 'Automatic'
}
virtualMachineProfile: {
extensionProfile: {
extensions: [
{
name: 'KeyVaultForLinux'
properties: {
publisher: 'Microsoft.Azure.KeyVault'
type: 'KeyVaultForLinux'
typeHandlerVersion: '2.0'
autoUpgradeMinorVersion: true
settings: {
secretsManagementSettings: {
pollingIntervalInS: '3600'
certificateStoreLocation: '/var/lib/waagent/Microsoft.Azure.KeyVault'
}
}
}
}
]
}
}
}
}
I triggered the pipeline again. 05:15 AM. The sun was starting to come up, a sickly orange glow filtering through the smog. I watched the logs.
Checking for existing deployment...
Validating Bicep module...
Deployment started...
Ten minutes later, the VMSS started churning. The new nodes were coming online with the Managed Identity. No secrets. No expirations. Just pure, unadulterated Entra ID tokens being passed through the metadata service.
I checked the metrics. The 503s were dropping. The latency was stabilizing. The site was back.
[POST-MORTEM-HARD-TRUTHS] Your Tools Won’t Save You
It’s 06:00 AM now. The rest of the team is starting to wake up, seeing the “resolved” notifications in Slack and pretending they would have known what to do. They’ll talk about “best practices” in the retro. They’ll talk about how we need to “embrace a devops azure culture.”
Let me tell you something about “culture.” Culture is what’s left after the fire has burned everything else away.
Here are the hard truths from the trenches:
- DevOps Azure is not a product you buy. It’s not a set of licenses for Boards and Pipelines. If you think installing a tool makes you an SRE, you’re the reason I don’t sleep. Most people use these tools to automate their own incompetence. They build faster pipelines to deploy worse code.
- The Portal is a drug. It’s easy at first, but it makes you weak. If you can’t rebuild your entire production environment using nothing but a terminal and a few YAML files, you don’t own your infrastructure—Azure does. And Azure is a fickle god.
- Managed Identities are the only way forward. If I see one more Service Principal secret stored in a plaintext Notepad file or a “shared” Key Vault, I’m going to lose my mind. Secrets are technical debt that eventually comes due at 3 AM.
- Abstraction is a lie. Bicep is better than ARM, but you still need to understand the underlying Resource Provider APIs. You still need to know that
Microsoft.Compute/virtualMachinesat version2023-09-01behaves differently than it did in2021-03-01. If you don’t read the API documentation, you’re just guessing. - SKUs matter. Don’t let a developer pick a VM size. They will pick the one with the most RAM and ignore the IOPS limits, the network bandwidth, and the temporary disk size. A
Standard_D2s_v3is a fine machine until you actually try to do work with it.
I’m going to finish this espresso. It’s cold, it’s gritty, and it tastes like failure. But the site is up. The “devops azure” workflow held together by the skin of its teeth.
Don’t ask me for a “comprehensive” report. Don’t ask me how we can “empower” the developers to avoid this next time. The answer is simple: hire people who actually care about how things work under the hood, and stop believing the marketing fluff that says the cloud is easy.
The cloud is just someone else’s computer, and right now, that computer is on fire. I’m just the guy with the bucket.
Status: Resolved.
Root Cause: Human arrogance and a stray tab character.
Next Action: Sleep. Maybe.
Related Articles
Explore more insights and best practices: