Copilot Studio Evaluations

Share
A feature I’ve been getting more hands-on with recently is Copilot Studio evaluations. These give us the ability to run pre-defined tests against an agent to determine the strength of outputs.
Are the responses accurate? Is the agent making up its own responses? Is anything sensitive sneaking through custom or out-of-the-box guardrails? Are there gaps in knowledge?
Evaluations attempt to help us solve these questions. Is it a 100% substitute for red teaming your agents? Nope, but it can get us some of the way there.
In this first of a few articles, the focus will be on evaluating single responses.
Table of Contents
ToggleAccessing Evaluations
Copilot Studio evaluations are now generally available and visible inside every agent build. You can find evaluations in the menu across the top when in design mode:

Here, you can access all your created test sets and add news ones:

Creating a new Evaluation
Selecting New evaluation in the top left takes you to the setup wizard. We can design test sets for single responses or full conversations (in preview in time of writing).

There are different ways to create your test set, depending on your needs and data.
CSV upload
You can work on a csv file of generic/popular questions and expected responses. Once complete, you can upload the file into your Copilot Studio evaluation.
For best results, use the CSV template that’s linked above the upload box:

I’ve noticed from a few videos I’ve watched that this template has changed a few times. That said, it’s a good guide for what you need to do. As it stands, the current template has a few sample questions in it to get you started. Make sure you take note of, and follow the instructions at the top of the template. For example, max of 100 questions, max 500 characters per question. You can use the LEN() function in a csv/excel file to check the length of text in a cell. The number returned includes spaces.

You don’t need to fill in the expectedResponse for each question. It largely depends on what type of test you want to run for each question. I cover the different evaluation types later in this article.
In this crazy world of AI, we still can’t beat a good old csv or xlsx file 😀
Quick question set
Speaking of AI, this option will use AI to generate 10 questions based on your agents description & instructions. This is another good reason to make sure your instructions are of 5-star quality.
Clicking this option will navigate you to the setup experience to show the generated questions:

From there, you can update the test set with any expected responses to the questions it’s generated. It will pay to have those closest to the data & processes to help flesh this out.
Keep in mind, this is AI generated so may not be 100% accurate.
Full question set
This goes beyond the 10-question set generation and focuses on generating with available knowledge sources or topics instead:

As you can see, the above shows I have no supported sources. However, I’ve connected to both SharePoint and manually uploaded files, following the guidance here, but still not recognised. So, it can be a bit hit & miss sometimes (or I’m doing something wrong!).
You have more control here as to how many questions you’d like generated, in increments of 10.

Selecting Topics to generate a test case will give you the same option to control how many are created.
Use your test chat conversation
This is pretty cool, as it can ingest your most recent chat and use it to create questions in a new test case. It will capture all the questions you’ve asked the agent in that test session, and the agents cumulative responses.
You can also use this icon at the top of a test session to generate the test set that way.

Evaluating multi-turn conversations is currently a preview aspect that I’ll cover in another post.
Evaluation methods
Each test case can use one or multiple evaluation methods to help assess quality. You can mix and match regardless of what method you’ve used to generate the case.
The default for Copilot Studio evaluations is always General quality, you can keep or remove this.

Use the Add test method option to add others. Any previously selected will be greyed out.

I’ve played around with all of these, so sharing my experiences and thoughts for each one below.
General meaning
This will cover any open-ended questions where there’s no single correct and exact answer, but the response is coherent, accurate and well-structured.
Use this method if you want to assess a response and ensure it’s meeting expected standards. Likely best for user questions that are geared towards “explain”, “summarise”, “what is” and “where is” scenarios.
For the expected response, put some suggestions of the key points that should feature. Remember, an agent using generative AI won’t generate the same response each time, but it should/must include the key points and themes.
Example:
Question: “What pension contributions are available in my company?”
Expected response: “An explanation that must cover the levels – 3%, 5% and 7% and the corresponding business match – 2%, 4%, 6%”
The wording might change from evaluation to evaluation, but I’d expect it to show the percentages each time to show the general meaning qualifies as a match.

Compare meaning
This tests whether an agent can respond with the same meaning. Again, it won’t ever be an exact match if generative AI is creating the answers each time, but the intent must be similar. If you want to get fancy, this is also known as semantic equivalence.
This method is great for assessing paraphrased things like policy facts where wording flexibility is fine, but the substance & tone must be on point.
You can set the pass score threshold for this one when creating the test case:

You can edit the threshold at any time by opening the test case and selecting this icon:

Tool use
This method assesses whether a user input will invoke a tool or topic within the agent.
Extending agents with tools and topics is increasingly common as our maturity with the capabilities grows. So, it’s going to be very important for us to be able to validate that a users’ natural language enquiry can trigger them as needed. It’ll then allow us to evaluate the action or response is correct too.
The configuration allows an additional ‘Tools’ column, where you can select one or multiple tools or topics relating to the question:

The only thing to be careful of here would be who is on the receiving end of these tools. For example, a tool to send an email to HR is going to be great for a Production agent, but does it need a different email address for testing and evaluation runs?
Keyword match
This method checks that the response generated contains one, or several keywords. Could be a great fit for things like acronyms, named policies or references; things very specific that need to be in a non-deterministic response.
When adding this as a method, you can choose whether the response should contain any or all of your specific keywords:

At this stage, worth noting how this works. I tried to use the file template to upload a test set, separating my keywords with semi-colons. Based on experience across the Microsoft stack, I thought the semi-colon approach would work here too:

The upload reads this literally, showing the expected response as the strings defined in the upload file. In hindsight, should have expected this when using a csv file:

This approach doesn’t work and will cause your evaluation to 100% fail. Instead, you need to leave the Expected response column blank in the csv template and add each keyword in manually, one at a time:

Caught me out a couple of times, so hopefully reading this means you won’t be.
Text similarity
Use this method when you want a correct answer to match, or almost match what’s generated. According to the documentation, this is great when paired with the Compare meaning method.
With General, you have a lot of flex with how a response is scored compared to the expected answer. If you’d like a more thorough scoring of the response and refining your expected answer, Text similarity is the winner here.
As with other methods, you can determine the pass rate score out of 100.
Exact match
As it says on the tin: is the response generated by the agent an exact match to the expected response, every time?
In my experience so far, this isn’t such a great fit for any AI-generated answers, that of course won’t be an exact match, word for word, every time. If you want to determine whether keywords exist in a response, opt for the Keyword match method.
Exact match seems to be stronger for deterministic scenarios. For example, if you have a topic containing a hard-coded response to a question, this will yield an exact match every time.

Custom
If none of the above methods work for you, building a custom one might be your way forward. This lets you test agent responses using your own specific criteria.
For this to work, firstly you must configure suitable instructions. As per Microsoft’s documentation, these should be goal orientated, use only allowed characters and specify criteria with bullet points.
From there, you can determine the custom outputs, known as Labels, with the relevant success or failure outcomes.

User authentication
In Copilot Studio evaluations, connections and authentication control what your agent can access during testing. You choose a test profile (user account) to simulate different users, allowing you to see how responses might vary based on permissions; for example, a manager vs. their direct reports. However, multi-profile testing only works for agents that don’t rely on authenticated connectors.
A key point is that tool connections always use your currently signed-in account, not the selected test profile. If your logged-in account doesn’t have the required connections configured, the evaluation will fail. So in practice, your maker account must already be authenticated to any tools the agent uses.
Finally, evaluations run using real permissions, meaning test cases can surface sensitive data accessible to that account. Since anyone with access to the agent can view its test sets, it’s important to use the right accounts and apply proper governance controls when testing.
You can run a test set without authentication should you wish, however the recommendation is to provide a connection:

Results and test sets
The main page for Copilot Studio evaluations shows your configured test sets and recent runs. As per the documentation, the agent stores test results for 89 days:

The overview gives quick insights into test runs (most recent at the top) and their overall score:

You have some options available when hovering over a test run. Exporting results will download that instance into a csv file for you. There’s also a dedicated option to re-evaluate, plus a sub-menu with additional options:

Selecting an evaluation will show the results. Depending on the method, this might show pass/fail (ie for Tool use) or deeper insights such as Pass score (ie for Text similarity):

Clicking on a specific record within the evaluation will show further insights into the scoring, what the question was, knowledge sources, response time and more:

There is also the opportunity to compare a selected evaluation to a previous one of the same method. At present, this only shows for Exact match evaluations types. The documentation suggests this should be an option for all types, so hoping that’s the case soon.

Clicking the blue Compare button shows a popup with a side by side comparison. You can use the arrows in the top right to cycle through each question & response from the test set:

Comparing test sets is hopefully going to be a great way to see regression. This is going to be important whenever your knowledge sources changes (ie new documents added), or when the agents default model changes.
Solution bug
A good practice is to create a solution in your development environment, then kick off a new agent build from there:

Then, when building your agent, all its components are auto-added to the solution; system and custom topics, tools, knowledge etc.
However, this doesn’t always happen for test cases so you’ll need to add them manually. This is an important step if you want to promote your agents to test/production and keep the evaluations as part of the deployment process.
To add them manually, go to your solution and filter by agents. Select the agent, then Advanced menu, select Add required objects:

This will then add the test cases and all their questions into the Agent Components section of the solution. Keep in mind though, this adds both the test case (parent) and all the questions. If you’re using the same question in different test cases, it will be difficult to see what relates to what:

This is where a good naming convention for your test cases might come very handy indeed. Hopefully the above solution issue will be resolved in due course.
Other things to note
Couple of other things worth mentioning when it comes to Copilot Studio evaluations:
- To repeat – this is not a substitute for robust red teaming
- You can only run one evaluation at a time, per agent.
- Sometimes evaluations can take a while to run, so be patient 🙂
- Be sure to test for things that SHOULD fail
I think the last point is especially important. Yes, we want to validate answers the agent should give us. But what if users ask questions that shouldn’t be answered?
This can be a way of testing Responsible AI filters are working, or simply that the agent isn’t using its own knowledge for answer generation.
Therefore, consider at least 1 test set where you expect every response to fail:

I believe Copilot Studio evaluations should be mandatory for every agent build. It’s a strong addition to the product and ensures our agent responses remain of high quality.
Evaluations have their own API too, meaning we can bring this automated testing into pipelines or via scheduled flows. I’ll cover this in my next article.
I also enjoy seeing others shine a light on these kind of features – not always the headline grabbers but of upmost importance. Check out this post by Vivian Voss for additional insights into evaluations.
Thanks for reading. If you liked this article and want to receive more helpful tips about Power Platform / Agent governance, build & architecture, don’t forget to subscribe or follow me on socials 😊

