Surface ARM deployment root cause with actionable hints#6801
Surface ARM deployment root cause with actionable hints#6801spboyer wants to merge 1 commit intoAzure:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds root cause identification and actionable guidance for ARM deployment errors. When deployments fail, azd now attempts to identify the deepest error code in the nested error tree and provides targeted hints for 10 common ARM error codes. This addresses issue #6795, which notes that ARM errors represent 45.26% of all azd errors.
Changes:
- Added
armErrorHintsmap with user guidance for 10 common ARM error codes - Added
RootCause()andRootCauseHint()methods to surface the most specific error and provide actionable guidance - Updated
Error()method to append hints when available - Added 7 new tests covering various error tree structures
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| cli/azd/pkg/azapi/azure_deployment_error.go | Added hint map, RootCause/RootCauseHint methods, and findDeepestError helper to identify and provide guidance for root causes |
| cli/azd/pkg/azapi/azure_deployment_error_test.go | Added unit tests for root cause detection and hint retrieval functionality |
| cli/azd/pkg/azapi/testdata/arm_sample_error_01.txt | Updated expected output to include hint for Conflict error code |
Comments suppressed due to low confidence (2)
cli/azd/pkg/azapi/azure_deployment_error_test.go:174
- This test does not verify which error code is returned when multiple branches exist at different depths. It should assert that the deepest error (AuthorizationFailed at depth 2) is returned, not just that some non-empty code is returned. This would have caught the bug in findDeepestError where it returns the last error encountered rather than the deepest one. Add: require.Equal(t, "AuthorizationFailed", root.Code)
func Test_RootCause_MultipleBranches(t *testing.T) {
err := &AzureDeploymentError{
Details: &DeploymentErrorLine{
Code: "",
Inner: []*DeploymentErrorLine{
{
Code: "Conflict",
Inner: []*DeploymentErrorLine{
{Code: "AuthorizationFailed"},
},
},
{
Code: "ValidationError",
},
},
},
}
root := err.RootCause()
require.NotNil(t, root)
require.NotEmpty(t, root.Code)
}
cli/azd/pkg/azapi/azure_deployment_error.go:42
- The hint suggests using 'az vm list-usage' which is specific to VM quotas, but InsufficientQuota errors can occur for many Azure resource types (Storage, Networking, etc.). Consider a more generic hint like "Your subscription has insufficient quota for this resource type. Check your quotas in the Azure portal or request an increase." This avoids suggesting a VM-specific command that may not apply to the actual quota issue.
"InsufficientQuota": "Your subscription has insufficient quota. Check usage with 'az vm list-usage --location <region>' or request an increase in the Azure portal.",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2cc05e7 to
4ed2cbd
Compare
Add RootCause() method to AzureDeploymentError that finds the deepest error code in the ARM error tree. Add RootCauseHint() with guidance for the top-10 most common ARM error codes. Display hints in deployment error output to help users resolve failures faster. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4ed2cbd to
06a790a
Compare
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
|
|
||
| // Display ARM deployment root cause hint if available | ||
| var armErr *azapi.AzureDeploymentError | ||
| if errors.As(err, &armErr) { |
There was a problem hiding this comment.
Lets see if we can build this on top of my closed PR. I will re-open.
#6700
, Azure#6801 Remove placeholder/generated error rules and replace with real error scenarios backed by telemetry data: - ARM soft-delete conflicts (PR Azure#6810): FlagMustBeSetForRestore, ConflictError, Conflict/RequestConflict with soft-delete keywords - ARM root-cause hints (PR Azure#6801): InsufficientQuota, SkuNotAvailable, SubscriptionIsOverQuotaForSku, LocationIsOfferRestricted, AuthorizationFailed, InvalidTemplate, ValidationError, ResourceNotFound - PowerShell hook failures (PR Azure#6804): ExitError type matching with stderr patterns for module loading, Az module, execution policy, and error action preference issues - Keep only validated text patterns: AADSTS, BCP codes, QuotaExceeded, azure.yaml parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
, Azure#6801 Remove placeholder/generated error rules and replace with real error scenarios backed by telemetry data: - ARM soft-delete conflicts (PR Azure#6810): FlagMustBeSetForRestore, ConflictError, Conflict/RequestConflict with soft-delete keywords - ARM root-cause hints (PR Azure#6801): InsufficientQuota, SkuNotAvailable, SubscriptionIsOverQuotaForSku, LocationIsOfferRestricted, AuthorizationFailed, InvalidTemplate, ValidationError, ResourceNotFound - PowerShell hook failures (PR Azure#6804): ExitError type matching with stderr patterns for module loading, Az module, execution policy, and error action preference issues - Keep only validated text patterns: AADSTS, BCP codes, QuotaExceeded, azure.yaml parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing in favor of #6827 which implements a comprehensive YAML-driven error handling pipeline. All ARM root-cause patterns from this PR (AuthorizationFailed, InvalidTemplate, ValidationError, ResourceNotFound, soft-delete conflicts) are covered declaratively in the error_suggestions.yaml rules, and the reflect-based DFS traversal with DeploymentErrorLine.Unwrap() []error provides the same deep error tree walking. |
…errors. (#6827) * Custom error patterns * Extensible error handler pipeline with typed error matching - Add ErrorHandlerPipeline that evaluates YAML rules with three matching strategies: text patterns, error type via reflection, and property dot-path matching - Add ErrorHandler interface for named IoC-registered handlers that compute dynamic suggestions - Move ErrorWithSuggestion to pkg/errorhandler for extension visibility - Add errorType, properties, and handler fields to YAML schema - Add ARM deployment error rules (soft-delete, quota, SKU, auth) - Update ErrorMiddleware to use pipeline instead of direct service - Comprehensive tests for reflection matching and pipeline Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace sample rules with real scenarios from PRs #6810, #6804, #6801 Remove placeholder/generated error rules and replace with real error scenarios backed by telemetry data: - ARM soft-delete conflicts (PR #6810): FlagMustBeSetForRestore, ConflictError, Conflict/RequestConflict with soft-delete keywords - ARM root-cause hints (PR #6801): InsufficientQuota, SkuNotAvailable, SubscriptionIsOverQuotaForSku, LocationIsOfferRestricted, AuthorizationFailed, InvalidTemplate, ValidationError, ResourceNotFound - PowerShell hook failures (PR #6804): ExitError type matching with stderr patterns for module loading, Az module, execution policy, and error action preference issues - Keep only validated text patterns: AADSTS, BCP codes, QuotaExceeded, azure.yaml parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Support regex matching in properties (same as patterns) Property values now support the same matching conventions as patterns: case-insensitive substring by default, or regex: prefix for regular expressions. This enables filtering ExitError by Cmd field using regex:(?i)pwsh|powershell to avoid false positives on non-PowerShell commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace regex: prefix with regex flag, add missing PR scenarios - Add 'regex: true' boolean field on rules instead of 'regex:' prefix convention on individual patterns. Cleaner YAML, consistent behavior across patterns and properties. - Consolidate soft-delete Conflict/RequestConflict keyword rules into single regex rules covering all keywords from PR #6810: soft delete, soft-delete, purge, deleted vault, deleted resource, recover or purge - Add missing PowerShell hook scenarios from PR #6804: Connect-AzAccount auth expired, login token expired - Update docs, tests, and matcher to use useRegex flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add Unwrap support to AzureDeploymentError and DeploymentErrorLine Make AzureDeploymentError and DeploymentErrorLine idiomatic Go errors: - DeploymentErrorLine now implements error interface (Error() string) - DeploymentErrorLine.Unwrap() []error returns Inner children, enabling errors.As to traverse the full ARM error tree - AzureDeploymentError.Unwrap() []error returns both Inner error and Details tree for complete error chain traversal - findErrorByTypeName now supports multi-unwrap (Unwrap() []error) via depth-first stack traversal This means YAML rules like: errorType: DeploymentErrorLine properties: { Code: FlagMustBeSetForRestore } now match error codes buried 3-4 levels deep in ARM deployment error trees without any special traversal logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use DeploymentErrorLine for deep ARM error code matching Switch YAML rules from 'errorType: AzureDeploymentError' with 'Details.Code' to 'errorType: DeploymentErrorLine' with 'Code'. Since DeploymentErrorLine now implements Unwrap() []error, the pipeline's findErrorByTypeName traverses the full ARM error tree and finds DeploymentErrorLine nodes at any depth. This means error codes like FlagMustBeSetForRestore buried 3-4 levels deep are now matched correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add end-to-end tests for deep nested ARM error matching Add 7 integration tests using real AzureDeploymentError JSON that verify the full pipeline finds DeploymentErrorLine codes at any depth in the ARM error tree: - FlagMustBeSetForRestore 3 levels deep - InsufficientQuota under DeploymentFailed - Conflict code + soft-delete keyword in message - No match when code differs - First matching rule wins with multiple codes - Matching through fmt.Errorf wrapper - ValidationError 4 levels deep Also fix findErrorByTypeName to check properties during traversal rather than only on the first type match. This ensures that when multiple DeploymentErrorLine nodes exist in the tree (some with empty Code due to DeploymentFailed stripping), the search continues until it finds one where both type AND properties match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update cli/azd/docs/error-suggestions.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Reorder YAML rules by specificity: most specific first - Add ordering header explaining the specificity principle - Move bare 'Conflict' code rule AFTER Conflict + keyword rules - Move broad text patterns (AADSTS, quota) to very bottom - Remove overly broad 'OperationNotAllowed' text pattern (too generic, could match non-quota errors) - Group text patterns: specific first, broad/generic last - Typed error rules (errorType + properties) naturally come first since they are more specific than text-only patterns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add SkuNotAvailableHandler as custom handler example Demonstrates the named ErrorHandler pattern: - Handler registered in IoC as 'skuNotAvailableHandler' - YAML rule references it via handler: 'skuNotAvailableHandler' - Handler dynamically includes current AZURE_LOCATION in suggestion and provides az CLI command to list available SKUs - Falls back to generic guidance when no location is set - Unit tests for both with/without AZURE_LOCATION scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update error-suggestions docs to reflect current implementation - Update examples to use DeploymentErrorLine instead of AzureDeploymentError - Update property paths from Details.Code to Code - Add real SkuNotAvailableHandler example with code - Add specificity ordering best practice - Add sku_handler.go to file layout table - Note multi-unwrap traversal in architecture section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace SkuNotAvailableHandler with ResourceNotAvailableHandler - Query ARM Providers API for available regions per resource type - Extract resource type from error message via regex - Move ARM SDK implementation to pkg/azapi/resource_type_locations.go - Use ResourceTypeLocationResolver interface to avoid import cycles - Both SkuNotAvailable and LocationIsOfferRestricted use new handler - Add comprehensive tests for resource type extraction and suggestions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove soft_delete_hint.go in favor of error pipeline, fix review feedback - Remove cmd/middleware/soft_delete_hint.go and tests (superseded by YAML rules) - Fix concurrency: add sync.RWMutex to PatternMatcher regex cache - Fix misleading test comment about LLM feature disabled vs no-prompt mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add LocationNotAvailableForResourceType rule for Static Web Apps - Add YAML rule matching DeploymentErrorLine with Code LocationNotAvailableForResourceType using resourceNotAvailableHandler - Add integration test with real ARM validation error JSON - Add unit test with mock resolver for the full handler flow - Covers the exact error returned when deploying Static Web Apps to an unsupported region (e.g. eastus) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add ResponseError rule for LocationNotAvailableForResourceType When BeginValidateAtSubscriptionScope fails immediately (before polling), the error is an azcore.ResponseError wrapped in fmt.Errorf — not an AzureDeploymentError. Add a rule matching ResponseError.ErrorCode so the handler fires for both code paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix resource type extraction to prefer quoted form over URLs extractResourceType now first looks for 'resource type Microsoft.X/Y' in the error message before falling back to bare matches. This prevents matching Microsoft.Resources/deployments from the ARM URL instead of the actual resource type like Microsoft.Web/staticSites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use azd environment instead of OS env vars for handler Inject EnvironmentResolver interface into ResourceNotAvailableHandler so it reads AZURE_LOCATION and AZURE_SUBSCRIPTION_ID from the azd environment (.env file) rather than OS environment variables. The lazyEnvironmentResolver adapter in cmd/container.go wraps the scoped lazy environment, falling back gracefully when no project is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Replace DocUrl with Links for multiple reference links - ErrorWithSuggestion now supports Links []ErrorLink (URL + optional Title) - UX display renders links as bulleted list with hyperlinks via WithHyperlink - Drop indentation from error display for cleaner output - Convert all YAML rules from docUrl to links with titles - Update all tests and middleware mapping Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Pass matching rule to error handlers for link merging Update ErrorHandler.Handle signature to accept the matching ErrorSuggestionRule, giving handlers access to links and other static data defined in the YAML. The ResourceNotAvailableHandler now merges rule links into its output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove duplicate short-circuit block in error middleware Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove dead ErrorSuggestionService and MatchedSuggestion types These were superseded by the ErrorHandlerPipeline and only used in their own tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Move ResponseError test to errorhandler, rename integration tests Address review feedback: pipeline tests belong in errorhandler package. ResponseError test moved with local mock type to avoid azapi import. DeploymentErrorLine tests stay in azapi (need NewAzureDeploymentError). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rollback extension changes * Add errorhandler and Getenv to cspell dictionary Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Validate and improve YAML reference links - Replace aka.ms/azure-dev/azure-yaml (sign-in gate) with direct learn.microsoft.com/azure/developer/azure-developer-cli/azd-schema - Add Azure products-by-region link to resource availability rules - Improve link titles for clarity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Updates links * Add JSON schema for error_suggestions.yaml Provides intellisense for YAML authors: field descriptions, validation (anyOf patterns/errorType required), link format, etc. Schema lives next to the YAML in resources/ as a local reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix Az-module pattern over-broad matching, restore skip guard Address spboyer review feedback: - Combine Az module patterns into single regex requiring both 'Az.<cmdlet>' and 'is not recognized' to match together - Restore skipAnalyzingErrors block for control-flow errors (interrupt, environment already initialized, etc.) before the AI agent flow Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Kristen Womack <5034778+kristenwomack@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Adds root cause identification and actionable guidance to ARM deployment errors. When a deployment fails, azd now highlights the deepest error code and provides targeted hints for the top-10 most common ARM error codes.
Fixes #6795
Changes
azure_deployment_error.go
azure_deployment_error_test.go
Data Context
ARM errors account for 45.26% of all azd errors (~57,956). This improves UX by surfacing root cause with actionable next steps.