D365 Copilot Testing: How to Validate AI-Generated Actions in Dynamics 365

Testing D365 Copilot features requires validating AI-generated decisions, not just UI states. Learn how to validate Dynamics 365 Copilot output across Finance, Supply Chain, Sales, and Business Central.

Dynamics 365 Copilot is no longer an optional add-on or an experimental feature. With Wave 1 2026,

Microsoft has made Copilot central to how users interact with D365, embedded in Finance, Supply Chain, Sales, Customer Service, and Business Central, generating summaries, suggesting actions, coding invoices, routing cases, and in Business Central, executing entire AP and sales order workflows autonomously.

And most D365 QA teams have no strategy for testing any of it.

That is not a criticism. Testing AI-generated outputs is a genuinely new problem. Traditional ERP test automation, task recordings, RSAT, scripted UI tests, works by asserting that a known input produces a known output. Copilot does not work that way. Its outputs are probabilistic, context-dependent, and grounded in live ERP data. The test approach that catches a broken form submission cannot catch an AP agent that assigned the wrong GL code to a hundred invoices before anyone noticed.

This post covers what testing D365 Copilot features actually requires: the specific validation challenges each module introduces, the three testing approaches available and their trade-offs, and how AI agent-based testing closes the gap that scripted tools cannot reach.

For context on how Copilot testing fits within the broader D365 test automation picture, the Dynamics 365 test agents overview covers the full testing landscape across Finance, Supply Chain, Sales, and Business Central.

Traditional ERP test automation is deterministic. Given the same input, the same output should always appear. A journal with debit account 4000 and credit account 2000 should always produce exactly that ledger entry. A purchase order with vendor X, item Y, and quantity 10 should always generate the same purchase order line. This predictability is what makes scripted test automation feasible, you can assert the expected output and fail the test if it does not match.

Copilot breaks this model in two ways.

First, Copilot outputs are probabilistic. The same invoice processed by the BC Payables Agent today and tomorrow may produce slightly different internal reasoning, even if the GL coding decision is the same. The Copilot sidecar in Finance may summarise a period close in different language across two runs of the same data. Any test that asserts on the exact text of a Copilot response will be brittle, it will fail for the wrong reasons and pass for the wrong ones.

Second, what matters about a Copilot action is not the text it generated but the ERP state it created. When the BC Payables Agent processes an invoice, the question that matters is not whether the agent said “I have matched this invoice to PO-1234.” The question is whether the resulting AP ledger entry used the correct GL account, the correct vendor, the correct payment terms, and the correct posting period. The outcome lives in the database. The Copilot narrative is commentary.

“Testing Copilot means testing what it did to the data, not what it said about the data? Those are different problems that require different tools.”

This distinction is why D365 Copilot testing cannot be solved by adding more RSAT recordings or extending existing test scripts. It requires an approach that can validate ERP data outcomes after an AI action, which is precisely what AI agent-based testing is designed to do. The agentic ERP testing framework explains why this architectural difference matters for any ERP QA team working with AI-generated workflows.

Before building a Copilot testing strategy, it helps to map exactly what each D365 module’s Copilot features produce and where the specific failure risks live.

D365 ModuleKey Copilot features (Wave 1 2026)What to validate in testing
D365 FinanceAI-extended Copilot sidecar with Client Actions, MCP-powered agent responses, financial period summaries, GL anomaly detectionGL account suggestions, posting period accuracy, MCP agent decision outputs, period summary numerical accuracy
D365 Supply ChainAI demand forecasting, Copilot sidecar with natural language queries, MCP data operations, AI-powered supplier suggestionsDemand plan accuracy, Copilot query response correctness, supplier suggestion validation against master data
D365 Business CentralPayables Agent (AP end-to-end), Sales Order Agent, Agent Designer for custom agents, bank reconciliation CopilotAP agent invoice matching accuracy, GL coding decisions, Sales Order data integrity, bank reconciliation match rate
D365 SalesAI-assisted lead scoring, opportunity summaries, email drafting grounded in CRM data, meeting intelligenceLead score accuracy vs. CRM data, opportunity summary correctness, email draft factual accuracy, CRM record updates
D365 Customer ServiceAI case routing, supervisor assistance agent, Work IQ M365 Copilot integration, case summarisationCase routing decision accuracy, escalation threshold correctness, case summary factual grounding, resolution suggestions

The Business Central row carries the highest financial risk because BC’s Payables Agent and Sales Order Agent are the furthest along the autonomy spectrum, they do not suggest actions for a human to approve, and they execute them. <CITE>The Payables Agent automates accounts payable end-to-end, reading invoices, matching vendors and accounts, and preparing invoices for approval with human oversight.</CITE> “Human oversight” means a human reviews before final posting, but the matching, coding, and preparation all happen autonomously. If the agent’s matching logic or GL coding is wrong, it will be wrong consistently across every invoice it processes before the oversight step catches it.

The Finance MCP row is worth highlighting separately. <CITE>Wave 1 2026 introduced Model Context Protocol (MCP) for Finance and Supply Chain, allowing agents to understand business logic without explicit instruction and navigate ERP data intelligently.</CITE> MCP agents operating in Finance can trigger actions based on natural language instructions grounded in live GL data. A mis-configured MCP agent does not produce an error message. It produces a transaction.

Finance and Supply Chain: validating Copilot sidecar and MCP outcomes

In Finance and Supply Chain, Copilot operates primarily through the sidecar chat experience and through MCP-enabled autonomous agents. Testing Copilot in this context means validating two distinct output types: informational outputs (period summaries, demand forecasts, anomaly alerts) and action outputs (transactions triggered by MCP agents or Client Actions).

Informational output validation requires checking that Copilot’s responses are correctly grounded in the underlying ERP data, that a period summary reflects the actual trial balance, that a demand forecast is based on the correct historical demand data that an anomaly alert corresponds to an actual anomaly in the GL. A Copilot summary that references the wrong figures is a compliance risk if it informs a decision before anyone checks the underlying data.

Action output validation, when Copilot triggers an ERP transaction through MCP or Client Actions, requires the same financial data validation as any other ERP transaction: correct GL account, correct financial dimensions, correct period, and balanced entry. The fact that Copilot triggered the action rather than a human does not change what the correct outcome looks like.

Business Central: validating AI agent decisions at scale

BC’s Payables Agent and Sales Order Agent require the most rigorous testing framework because they operate autonomously at scale. A human processes invoices one at a time and catches errors in the process. The Payables Agent processes them in batches. Testing strategy needs to reflect that difference: not spot-checking one output, but building assertion-based coverage that validates every transaction the agent produces against the expected GL structure.

Key validation points for the BC Payables Agent:

  • Invoice-to-PO matching accuracy, did the agent match the invoice to the correct purchase order line?
  • GL account assignment, does the assigned GL code correspond to the correct account for this vendor category and item type?
  • Financial dimension carry-through, do the dimensions from the purchase order carry correctly to the AP entry?
  • Payment terms application, did the agent apply the correct payment terms for this vendor?
  • Posting period assignment, did the agent post to the correct open period, not the prior period?

For BC specifically, the Business Central test automation guide covers the Agent Designer, AL extension regression, and how to build test coverage for BC AI workflows introduced in Wave 1 2026.

Sales and Customer Service: validating Copilot suggestion accuracy

In Sales and Customer Service, Copilot operates more in an advisory capacity, suggesting next best actions, drafting emails, scoring leads, summarizing cases. Testing here is less about validating transaction outcomes and more about validating grounding: is Copilot’s output factually consistent with the CRM data it was supposed to draw from? A lead score that ignores the most recent activity log, or a case summary that misquotes the customer’s stated issue, is a quality problem even if it does not produce a wrong ERP transaction.

There are three approaches D365 teams are using to test Copilot features in 2026. Each has genuine strengths and genuine limitations. Most teams will need a combination of all three, applied to different Copilot feature types.

ApproachHow it worksStrengthsLimitations
Manual spot-checkingQA team manually reviews a sample of Copilot outputs after each release waveHuman judgement · catches obvious hallucinationsInconsistent · not scalable · misses edge cases · provides no audit trail
Assertion-based testingTests assert that Copilot output contains specific expected values (e.g., correct GL code, correct vendor)Consistent · automatable · repeatableFragile on probabilistic outputs · breaks when Copilot phrasing changes · requires constant maintenance
AI agent outcome validationAgent validates the ERP state after Copilot acts, checking the data layer, not the Copilot response textSelf-healing · data-layer accurate · audit-grade evidence · scales with coverageRequires an agent testing platform, not achievable with scripts or recording tools

The key insight from the table above is that the right approach depends on what type of Copilot output you are validating. Manual spot-checking is appropriate for informational outputs that are difficult to assert programmatically, narrative summaries, email drafts, suggested next actions. Assertion-based testing works for structured outputs with defined expected values, specific GL codes, specific lead scores, and specific case classifications. AI agent outcome validation is the right approach for any Copilot feature that triggers an ERP action, AP agent postings, MCP workflow transactions, Sales Order Agent order creation.

For the Finance and BC scenarios that carry compliance and audit risk, only the third approach produces the field-level evidence that auditors require. A manual spot-check log and an assertion-based test suite both produce confidence. Only agent outcome validation, asserting that the GL entry used the correct account, the correct dimensions, and the correct period, produces evidence.

The distinction between evidence and confidence in D365 Finance testing is covered in detail in the D365 Finance testing guide. The same principle applies to Copilot-generated Finance transactions.

A Copilot testing framework for D365 is not a single test suite, it is a tiered approach matched to the risk profile of each Copilot feature in your environment. Here is a practical framework to build on:

Tier 1, Informational Copilot outputs (manual review + grounding checks)

Features: period summaries, demand forecasts, Copilot chat responses, email drafts, case summaries. These are advisory outputs that a human reviews before acting on. Testing strategy: periodic manual grounding checks against the underlying ERP data. Confirm that key figures in Copilot summaries match the actual ledger balances, forecast models, or case records they claim to reflect. Flag discrepancies, they indicate hallucination or stale data grounding.

Tier 2, Structured Copilot suggestions (assertion testing for key field values)

Features: lead scores, case routing classifications, GL code suggestions, demand plan recommendations. These are Copilot suggestions that feed into ERP decisions. Testing strategy: build assertion tests that validate the suggestion output against expected values for defined input scenarios. A lead from a known high-value segment should score above threshold. An invoice from a vendor in category X should receive a GL suggestion from the correct account range. These tests are repeatable and automatable.

Tier 3, Autonomous agent actions (AI agent outcome validation)

Features: BC Payables Agent invoice processing, BC Sales Order Agent order creation, MCP workflow transactions in Finance and Supply Chain, any Copilot feature that triggers an ERP write. Testing strategy: AI agent outcome validation that asserts the ERP data state after the Copilot action. This tier requires a testing platform that can access ERP data at the record level, not just read the Copilot response, and assert specific field values across GL entries, AP postings, inventory records, and dimension sets.

This third tier is where most D365 Copilot testing frameworks currently have the largest gap, and where the financial exposure is highest. As Microsoft expands autonomous agent capabilities in Wave 2 2026, this tier will grow faster than either of the first two.

The tier that most teams are missing is Tier 3, and it’s the tier with the highest financial consequences.   An AP agent that processes invoices correctly 98% of the time will generate wrong transactions at scale. Manual review catches individual errors. Agent outcome validation catches systematic ones.

Validate D365 Copilot actions at the data layer, not just the UI.

Sofy’s D365 agents test Copilot and AI agent outputs at the outcome level, GL entries, AP ledger records, financial dimensions, producing field-level evidence across Finance, Supply Chain, Sales, and Business Central.

What is D365 Copilot testing?

D365 Copilot testing is the process of validating that D365 Copilot features, including AI-generated summaries, suggestions, and autonomous agent actions, produce accurate, correctly grounded, and financially correct outputs in your D365 environment. It is distinct from traditional ERP testing because Copilot outputs are probabilistic and context-dependent, requiring a different validation approach than deterministic UI interaction testing.

Can RSAT or scripted tools be used to test D365 Copilot features?

No. RSAT and scripted UI automation tools record and replay human UI interactions. D365 Copilot features, particularly autonomous agents like the BC Payables Agent or MCP workflows, do not interact with the UI the way a human does. Their outputs live in ERP data records, not in UI states. Testing Copilot requires validating ERP data outcomes, which requires an approach that can access and assert against the database layer, not just the screen layer.

What should I validate when testing the D365 Copilot Payables Agent?

For the BC Payables Agent, validate five things at the data layer for each invoice the agent processes: (1) the invoice was matched to the correct purchase order, (2) the GL account assigned is correct for the vendor category and item type, (3) the financial dimensions from the purchase order carry through to the AP entry, (4) the payment terms applied match the vendor master, and (5) the posting period is the correct open period. None of these can be validated by reading the agent’s narrative output, all require direct assertion against the AP ledger entry.

How do AI test agents validate D365 Copilot output?

Sofy’s D365 agents validate Copilot output by checking the ERP data state after a Copilot action executes, not by parsing the Copilot response text. After the BC Payables Agent processes an invoice, Sofy queries the resulting AP ledger entry and asserts that each field value matches the expected outcome: correct GL account, correct dimensions, correct period, correct vendor. This produces a field-level assertion log that auditors can review and that catches systematic errors across batch processing, not just individual ones.