Getting to know the Bot

As explained in the previous section, our agent has been told what to do using natural language. As such, there is some ‘fuzziness’ in the process between our interpretation of the instructions given and how the agent actually behaves. To prove that it is behaving as we want it to, we must test it with queries and evaluate its responses.

We need to demonstrate that our agent interprets a variety of prompts as they might be worded by students. We used some authentic queries from previous students on the unit but also brainstormed hypothetical questions. We did not solicit student feedback at this stage as we are initially more concerned with the accuracy of the information given by the agent.

As with all prompts, wording affects the outcome. We must be cognisant that students may articulate the same query with different vocabulary, grammar, spelling and inference depending on their background. When evaluating a student-facing agent we should test various ways of phrasing including idioms, subtext and spelling errors on technical terminology.

Given a prompt, the agent can essentially do one of four things depending on whether the desired information is in the knowledge sources (Figure 1);

Figure 1: Potential outcomes of a user prompt when asking questions of a knowledge source.

We obviously want the agent to reliably recall information present in the knowledge sources, and similarly test for when it fails to recall, or more nefariously, misrepresents the content in the source.

Similarly, we have directed the agent to not use outside information. If the answer is not in the knowledge source we want the agent to refer the user to the unit director or administrators (i.e. direct to a human) as we do not want the agent to be a ‘dead end’ to student queries. We absolutely do not want the AI generating its own information beyond what is in the source documents as it risks providing conflicting or incorrect guidance beyond our supervision. Thus the agent must be tested with seemingly irrelevant prompts.

Generally, our agent performed well with articulate queries, retrieving relevant information and conveniently providing a link to the source document for the user. However, it quickly became apparent that, despite being prompted not to use outside sources in the initial configuration, the Copilot agent was enabled by default to use its own general knowledge. It was found that this had deleterious outcomes with the agent generating information far beyond its remit. For our purposes, the agent’s general knowledge needed to be explicitly disabled with a separate option in the Copilot settings (Figure 2), which significantly improved the output.

Figure 2: In the settings menu of an agent in Copilot Studio there are specific toggles for the use of ‘general knowledge’and information from the web. For an agent that recalls information from specific sources these must be disabled.

Example Prompts and Responses

Over the course of a few hours, the agent was tested with various user prompts. Here are some examples of outputs, however, the response from generative AI is dynamic and we can get a wide variety of answers from the same prompt. Identical prompts were tested repeatedly and we generally observed similar responses, though some replies would be more acceptable than others.

Another problem we noticed was that the behaviour of our agent changed over a period of weeks – presumably due to updates to the Copilot Studio platform and the LLM used. This does raise concerns about the consistency of an agent once deployed (and also makes it quite hard to write a blog about it). Interpret these examples as what can happen, rather than what will necessarily happen for a given prompt.

	A good response from the agent for a typical student query regarding the deadline of the assessment. It is clear and concise as directed, rephrasing the information as stated from one document from the knowledge sources. Note, however, that it does not present all relevant detail from the information document- namely that the poster is submitted in week 16 followed by the presentation in week 17. As with most Co-Pilot outputs, a convenient link to the source is provided, though this actually opens as an unformatted text file.
	For this prompt the agent performs again performs well, summarising the guidance from multiple source documents that is reflective of what we mean in our instructions.
	This is perhaps a lazy question to ask a chatbot but one that is often asked in various ways by students in person. The agent correctly referred the user to a list of provided examples, however, we would prefer that it fully states the stipulations in full- namely that students can present on any CNS-acting drug as long as it is in clinical use or being investigated for clinical use. This requirement does appear in proximity to the example list in the documents used to train our agent. What we see is that while the agent can effectively parse and present relevant information from a source document, it may be selective, omitting anticipated information.
	Here the agent’s response is a little misleading. The students are not so much encouraged as told explicitly that they must present a drug which acts on the CNS. The concern here is that, from the wording of this response alone, a student could justify writing about any non-CNS drug.
	When the prompt uses the word ‘worried’, the agent correctly directs students to wellbeing support with the ‘topics’ functionality behaving as anticipated.
	Gratifyingly, the agent also responds correctly for prompts which convey the sense of concern but do not necessarily include the phrases that were used to generate the topic (stressed, scared, worried, overwhelmed, sad, upset). For example, these prompts, which include synonyms of these key phrases, all correctly refer the students to wellbeing services.
	Some more inferential prompts that are suggestive of concern are also correctly directed by the topic function again demonstrating the ability of the agent to infer context and intention.
	However some prompts which asked more focussed hypothetical questions or contained subtext were not consistently directed to the wellbeing message. The responses to these particular prompts were very variable even across the same chat session. On the face of it, the agent returns a reasonable answer, however, in my experience when a student asks what the point of something is, it is not because they wish to discuss pedagogy.
	Here, the agent unfortunately does not recognise the inferred concern. Even worse, it provides the user with the lower end of the marking descriptors as a consequence of presenting poorly.
	Disappointingly, for this prompt the agent actually tries to suggest why someone might be bad at presenting. Such information is not present in the student guidance documents and these concepts have been generated despite the agent being explicitly told not to use general knowledge and internet sources. While some of the information here may be useful, it has not been evaluated by teaching staff and could be misleading. Even worse, this response misses the concerning subtext of the prompt. Similar prompts using words like ‘rubbish’, ‘bad’ more consistently referred the student to wellbeing through the topic function.
With the last three examples, the preferred outcome is to get a student talking to a tutor or academic for support. There are obvious dangers in a student feeling that they are unsupported or, even worse, given responses which actively contribute to their distress. One thing that we noticed over the course of testing these prompts (repeatedly) was that the referrals to wellbeing became less reliable as the conversation with the agent continued. This conversational drift is a known phenomena in AI chatbots. Put overly-simplistically, AI chatbots (like humans) can forget what they were talking about if they are not reminded occasionally.
	Information about extensions and exceptional circumstances is not included as part of the documents we provide students (though they signposted to this information elsewhere on the OLE). The agent behaves correctly by not providing an answer, however, it fails to refer the user to the Unit Director’s Padlet or the Unit administrator email as per our initial instructions.
	The agent response to this query highlights the commerce-centric nature of Copilot Studio. In this case, the agent is not generating text but is returning a message from a ‘system topic’ that is created by default for all Copilot agents. Its actually behaving as it should. Like the topic we created to direct students to wellbeing support, this built-in topic responds to phrases such as ‘talk, call, customer service’ with the placeholder message we see here. System topics can be modified so we can edit this message to direct to the UD and administrators. We can also disable this ‘escalation topic’ such that the agent then evaluates the query like any other, however, this can cause erratic behaviour from the agent!
	Interestingly the agent determined that aspirin would not be a suitable drug based on the given criteria demonstrating reasoning based on external knowledge sources. Unfortunately, it’s also incorrect – Aspirin does have effects on the central nervous system and would be an acceptable subject for a student poster in this assessment. We do not want our agent to ‘overthink’ or use external information for exactly this reason. The agent also steers the student towards ‘suggested drugs’ for the assessment. The nuance here is that we provide a list of suggestions but state explicitly that we encourage the students to be creative and demonstrate independent research.
	The agent correctly interprets the misspelled drug, morphine, but again draws on external knowledge and reasoning to state that it is an acceptable topic for the poster. The agent also provides a verbose list of what to include in the poster. While these general requirements are in the knowledge sources, the specific integration with morphine is not. The response is somewhat at odds with our instruction to be concise and unfortunately also circumvents some of the thinking we want our students to demonstrate when building their poster.