Search

There are clear benefits to the adoption of generative AI but also challenges with adoption and concerns that need to be monitored.

Copilot use was moderate and focused on a few use cases

Use of Copilot was moderate. However most trial participants across classifications and job families were optimistic about Copilot and wished to keep using it.

Only a third of trial participants across classifications and job families used Copilot daily.
Copilot was predominantly used to summarise information and re-write content.
Copilot in Microsoft Word and Teams were viewed favourable and used most frequently.
Access barriers prevented Copilot use in Outlook.

Perceived improvements to efficiency and quality

Trial participants estimated time savings of up to an hour when summarising information, preparing a first draft of a document and searching for information.

The highest efficiency gains were perceived by APS levels 3-6, Executive Level (EL) 1 staff and ICT roles.

The majority of managers (64%) perceived uplifts in efficiency and quality in their teams.

40% of trial participants reported their ability to reallocate their time to higher-value activities such as staff engagement and strategic planning.

There is potential for Copilot to improve inclusivity and accessibility in the workplace and in government communication.

Adoption requires a concerted effort to address barriers

There are key integration, data security and information management considerations agencies must consider prior to Copilot adoption, including scalability and performance of the GPT integration and understanding of the context of the large language model.

Training in prompt engineering and use cases tailored to agency needs is required to build capability and confidence in Copilot.

Clear communication and policies are required to address uncertainty regarding the security of Copilot, accountabilities and expectation of use.

Adaptive planning is needed to reflect the rolling feature release cycle of Copilot alongside governance structures that reflect agencies’ risk appetite, and clear roles and responsibilities across government to provide advice on generative AI use. Given its infancy, agencies would need to consider the costs of implementing Copilot in its current version. More broadly this should be a consideration for other generative AI tools.

Broader concerns on AI that require active monitoring

There are broader concerns on the potential impact of generative AI on APS jobs and skills, particularly on entry-level jobs and women.

Large language model (LLM) outputs may be biased towards western norms and may not appropriately use cultural data and information.

There are broader concerns regarding vendor lock-in and competition, as well as the use of generative AI on the APS’ environmental footprint.

Recommendations

The overarching findings reveal several considerations for the APS in the context of future adoption of generative AI.

Detailed and adaptive implementation

1.1 Product selection

Agencies should consider which generative AI solution are most appropriate for their overall operating environment and specific use cases, particularly for AI Assistant Tools.

1.2 System configuration

Agencies must configure their information systems, permissions, and processes to safely accommodate generative AI products.

1.3 Specialised training

Agencies should offer specialised training reflecting agency-specific use cases and develop general generative AI capabilities, including prompt training.

1.4 Change management

Effective change management should support the integration of generative AI by identifying ‘Generative AI Champions’ to highlight the benefits and encourage adoption.

1.5 Clear guidance

The APS must provide clear guidance on using generative AI, including when consent and disclaimers are needed, such as in meeting recordings, and a clear articulation of accountabilities.

Encourage greater adoption

1.6 Workflow analysis

Agencies should conduct detailed analyses of workflows across various job families and classifications to identify further use cases that could improve generative AI adoption.

1.7 Use case sharing

Agencies should share use cases in appropriate whole-of-government forums to facilitate the adoption of generative AI across the APS.

Proactive risk management

1.8 Impact monitoring

The APS should proactively monitor the impacts of generative AI, including its effects on the workforce, to manage current and emerging risks effectively.

Evaluation objectives

The evaluation assessed the use, benefits, risks and unintended outcomes of Copilot in the APS during the trial.

The Digital Transformation Agency (DTA) designed 4 evaluation objectives, in consultation with:

the AI in Government Taskforce
the Australian Centre for Evaluation (ACE)
advisors from across the APS designed four evaluation objectives.

Employee-related outcomes

Evaluate APS staff sentiment about the use of Copilot, including:

staff satisfaction
innovation opportunities
confidence in the use of Copilot
ease of integration into workflow.

Productivity

Determine if Copilot, as an example of generative AI, benefits APS productivity in terms of:

efficiency
output quality
process improvements
agency ability to deliver on priorities.

Adoption of AI

Determine whether and to what extent Copilot, as an example of generative AI:

can be implemented in a safe and responsible way across government
poses benefits and challenges in the short and longer term
faces barriers to innovation that may require changes to how the APS delivers on its work.

Unintended consequences

Identify and understand unintended benefits, consequences, or challenges of implementing Copilot as an example of generative AI and the implications on adoption of generative AI in the APS.

There are both benefits and concerns that will need to be actively monitored.

Benefits

Generative AI could improve inclusivity and accessibility in the workplace particularly for people who are neurodiverse, with disability or from a culturally and linguistically diverse background.

The adoption of Copilot and generative AI more broadly in the APS could help the APS attract and retain employees.

Concerns

There are concerns regarding the potential impact of generative AI on APS jobs and skills needs in the future. This is particularly true for administrative roles, which then have a disproportionate flow on impact to marginalised groups, entry-level positions and women who tend to have greater representation in these roles as pathways into the APS.

Copilot outputs may be biased towards western norms and may not appropriately use cultural data and information such as misusing First Nations images and misspelling First Nations words.

The use of generative AI might lead to a loss of skill in summarisation and writing. Conversely a lack of adoption of generative AI may result in a false assumption that people who use it may be more productive than those that do not.

Participants expressed concerns relating to vendor lock-in, however the realised benefits were limited to specific features and use cases.

Participants were also concerned with the APS’ increased impact on the environment resulting from generative AI use.

Methodology

To ensure breadth and depth of insight through the evaluation, a mixed-methods approach was used. Qualitative and quantitative data collection methods were leveraged, including:

a centralised issues register
outreach interviews during the initial stages of the trial
pre-use, post-use and pulse surveys
post-trial interviews with key stakeholders
focus groups.

A desktop review of reports provided by agencies and other documents relevant to the trial was also undertaken. The evaluation engaged with over 50 agencies and more than 2,000 trial participants between January to July 2024 across various engagement streams.

Information was gathered using several methods of evaluation.

Document/data review

The evaluation synthesised existing evidence, including:

government research papers on Copilot and generative AI
the trial issue register
6 agency-led internal evaluations.

Consultations

It also involved thematic analysis through:

24 outreach interviews conducted by the DTA
17 focus groups facilitated by Nous Group
8 interviews facilitated by Nous Group.

Surveys

Analysis was conducted on data collected from:

1,556 respondents in pre-use survey
1,159 respondents in pulse survey
831 respondents in post-use survey.

A thematic, frequency and comparative analysis of both qualitative and quantitative data was undertaken. Evaluation objectives and KLEs shaped the thematic analysis completed on qualitative data. In addition to this, frequency analysis provided insight into the majority sentiment of participants. Where possible, a comparative analysis was undertaken on survey responses. A total of 330 responses from the pre-use and post-use survey were linked via a unique survey ID.

Off

To ensure breadth and depth of insight through the evaluation, a mixed-methods approach was used. Qualitative and quantitative data collection methods were leveraged, including:
- a centralised issues register
- outreach interviews during the initial stages of the trial
- pre-use, post-use and pulse surveys
- post-trial interviews with key stakeholders
- focus groups.
A desktop review of reports provided by agencies and other documents relevant to the trial was also undertaken. The evaluation engaged with over 50 agencies and more than 2,000 trial participants between January to July 2024 across various engagement streams.
Information was gathered using several methods of evaluation.
Document/data review
The evaluation synthesised existing evidence, including:
- government research papers on Copilot and generative AI
- the trial issue register
- 6 agency-led internal evaluations.
Consultations
It also involved thematic analysis through:
- 24 outreach interviews conducted by the DTA
- 17 focus groups facilitated by Nous Group
- 8 interviews facilitated by Nous Group.
Surveys
Analysis was conducted on data collected from:
- 1,556 respondents in pre-use survey
- 1,159 respondents in pulse survey
- 831 respondents in post-use survey.
A thematic, frequency and comparative analysis of both qualitative and quantitative data was undertaken. Evaluation objectives and KLEs shaped the thematic analysis completed on qualitative data. In addition to this, frequency analysis provided insight into the majority sentiment of participants. Where possible, a comparative analysis was undertaken on survey responses. A total of 330 responses from the pre-use and post-use survey were linked via a unique survey ID.

Limitations

Evaluation fatigue may have reduced trial participants’ engagement with the evaluation.

During the trial period, agencies and individual trial participants were involved in a variety of research activities managed internally by their own agencies as well as those driven centrally by the DTA. Research fatigue was a key challenge that influenced participation rates across focus groups, interviews and the post-use survey. Lower response rates in the post-use survey (n = 831) and for those who completed both the pre-use and post-use survey (n = 330) may impact how representative the data is of the trial population. However, the total number of responses means we were still able to effectively test changes in proportions before and after at the 5% level of significance. Where possible, the evaluation has drawn on insights from agency-specific evaluations to complement the evaluation findings.

This means that final evaluation research activities may not have captured the full spectrum of experiences and perspectives.

The non-randomised sample of trial participants may not reflect the views of the broader APS.

When comparing the proportion of trial participants with the population of the broader APS, there is an overrepresentation of Executive Level (EL)1s, EL2s and Senior Executive Services (SES) participants. In addition to this, there was a lower representation of junior APS classifications (APS1 to 4).

Trial participants voluntarily chose to take part in the trial, which may have led to a selection bias. While there were efforts made during the trial to invite participants from a range of backgrounds and experience with generative AI, there was a high proportion of trial participants who contributed to the evaluation who have previous experience with generative AI (66%) and are generally optimistic about Copilot (73%).

This means that results identified through this evaluation may not be fully representative of the views held by the entire APS.

There was an inconsistent rollout of Copilot across agencies.

The experience and sentiment of trial participants may be affected by when their agency began participating in the trial and their agency’s version of Copilot. Agencies received their Copilot licences between 1 January and 1 April 2024. Agencies that joined the trial later may not have been able to contribute to early evaluation activities, such as the pre-use survey or initial interviews, therefore excluding their perspective and preventing later comparison of outcomes.

Since the trial began, Microsoft has released 60 updates to Copilot to enable new features – including rectifying early technical glitches. Due to either information security requirements or a misalignment between agency update schedules, the new features of Copilot may have been inconsistently adopted across participating agencies or at times, not at all.

This means that there could be significant variation in Copilot’s functionality across agencies, and ability for agencies to build capability and identify use cases for Copilot.

The impact of Copilot relies on trial participants’ self-assessment of productivity benefits.

The evaluation methodology relied on trial participants self-assessing the impacts of Copilot, which may naturally under or overestimate the benefits – particularly time savings. Where possible, the evaluation compared its productivity findings against other APS agency evaluations and external research to verify the productivity savings put forth by trial participants.

Nevertheless, there is a risk that the impact of Copilot – in particular the productivity estimates from Copilot use – may not accurately reflect Copilot’s actual productivity impacts.

A comprehensive overview of the evaluation’s methodology and limitations is detailed in Appendix B.

8. Contestability

9. Accountability

10. Human-centred values

11. Internal review and next steps

2. Purpose and expected benefits

3. Threshold assessment

4. Fairness

5. Reliability and safety

6. Privacy protection and security

7. Transparency and explainability