Talking with Conversational Agents in Collaborative Action

This one-day workshop intends to bring together both academics and industry practitioners to explore collaborative challenges in speech interaction. Recent improvements in speech recognition and computing power has led to conversational interfaces being introduced to many of the devices we use every day, such as smartphones, watches, and even televisions. These interfaces allow us to get things done, often by just speaking commands, relying on a reasonably well understood single-user model. While research on speech recognition is well established, the social implications of these interfaces remain underexplored, such as how we socialise, work, and play around such technologies, and how these might be better designed to support collaborative collocated talk-in-action. Moreover, the advent of new products such as the Amazon Echo and Google Home, which are positioned as supporting multi-user interaction in collocated environments such as the home, makes exploring the social and collaborative challenges around these products, a timely topic. In the workshop, we will review current practices and reflect upon prior work on studying talk-in-action and collocated interaction. We wish to begin a dialogue that takes on the renewed interest in research on spoken interaction with devices, grounded in the existing practices of the CSCW community.


Introduction
Many of the recent personal mobile devices released to market, such as smartphones, tablets, and watches, have embraced the use of automatic speech recognition as a viable form of device interaction. The devices typically feature a speech interface, often referred to as a conversational agent or intelligent personal assistant, that embodies the idea of a virtual butler [14]. These interfaces listen to spoken commands and queries, and respond accordingly by performing a broad range of tasks on the user's behalf, such as to provide facts and news, and set reminders and alarms. Furthermore, the systems are often anthropomorphised by being given names (e.g. Cortana or Siri, see Figure 1 for screenshots) and endowed with humanlike behaviours [21] such as humour. However, recent research shows that despite grand promises made by manufacturers, existing commercial systems fail to meet users' expectations [10].
Amazon and Google have further embraced this trend by launching standalone hardware in the form of the Amazon Echo (see Figure 2) and Google Home. These devices are specifically designed to be placed in social spaces such as kitchens tables, within everyday settings such as the home. While these devices function in much the same way as their mobile equivalents, they rely entirely upon speech for interaction to support their broad range of abilities. Given the growing IoT trend, it is hardly surprising that these devices also support the ability to control connected domestic appliances such as lights, thermostats, and kettles.
Given the ever-expanding abilities of the conversational agents, the scope for their use and thus the range of social and collaborative contexts in which their use can become occasioned is also expanding. This provides a renewed impetus for research in HCI and CSCW to explore the practical and social implications of the use of these systems not just in the lab, but to explore the everyday use of conversational agents in vivo. Recently, researchers have begun to explore the use of personal assistants in public settings such as cafés [15], or workplaces [11]. Inspired by this recent work, this workshop seeks to explore the use, study, and design of conversational agents in social and collaborative settings. In the following, we highlight some understudied features of conversational agent use in social settings, to topicalise and energise research activity in this space.

Conversational Agents in Social Settings
The growing pervasiveness of speech-enabled technologies revitalises existing research on the practices of face-to-face conversation, on interacting with speech-enabled technologies, and on the use of ubiquitous technologies in multi-device ecologies. To explore this space, the workshop invites contributions across a range of topical interests to examine and collect exemplars of current and future challenges the CSCW community face in researching interactions with and around conversational agents. The broad and  diverse sociotechnical topics that exist with new speech-enabled technologies are illustrated by two exemplar challenges relating to the transition of conversational agent technologies from touchscreen to screenless devices, and from single-user to multi-user.

From Touch to Speech
Interacting with personal assistants on mobile devices relies on both touching the device and speaking to it, allowing users to hold and interact with devices in tune with the prevalent interaction paradigm for touchscreen devices. However, with the transition to screenless devices (see Figure 2) fundamentally different qualities emerge that common visual-touch does not occasion [15], related for instance to the potentially more 'natural' accountability of the ongoing interaction to everyone within earshot. By forming a more in depth understanding of the conversational use of people talking to agents, HCI designers can inform and tailor design [13], and potentially support collaborative collocated interactions [5] in different environments. In turn, how such mundane practice occurs with devices that feature no screen remains as-yet unexplored.
To answer how researchers can uncover the social and technological challenges of interacting with conversational agents in multi-party settings, we can examine existing CSCW work on collocated interactions with technology. For example, numerous pieces explore the everyday social practice and the challenges around group interactions with personal devices in settings as diverse as the pub [16], around the television [19], and during family mealtimes [2,12], drawing upon analytic perspectives such as EMCA (Ethnomethodology and Conversation Analysis). This work serves as an important methodological grounding to understanding how people use speech-enabled technologies, how others can co-manage the use of the technology in a face-to-face setting, and of the broader social implications of the use of these systems and devices.

From Single-User to Multi-User
For devices that feature visual displays, the display may be used to communicate state with the user. However, in multi-party settings this state is likely not observable-reportable to those nearby, and therefore relies on 'the user' to account for the on-screen information themselves [15]. Collocated interactions research has long-explored how to support multi-device and multi-user interactions for collaborative activity (e.g. [1,7,17]). Wooffitt et al. remind us, engaging in dialogue is itself a co-operative effort in negotiation of meaning [4:14], but how this cooperation can be embraced between multiple users while engaging with speech-enabled technologies remains an open question. Although speaking is naturally observable and reportable to all present to its production, it is also transient [20] and requires attention from members to listen and interpret as the action unfolds.
Devices that adopt a multi-user perspective in their design, such as Amazon Echo and Google Home, provide even greater deviation from accustomed mobile devices be eschewing displays altogether. Additionally, through the combination of multiple far-field microphones and speakers, and a tubular design, the devices support interactions by multiple users from different angles, even those who aren't close to the device in all directions. This new-wave of speech-only technologies encourages us to consider how their use becomes embedded within everyday talk. By revealing how interactions with speech-enabled technologies are

Themes and Goals
We invite contributions related but not limited to any of the following: cooperatively attended to by users and spectators, and how talking to a speech-enabled technology is collaboratively managed though talk-in-interaction in face-to-face settings, we can shape the design of future technologies. Given the significant potential that speech-only technologies possess to enrich collaborative work, play, and living-it seems particularly pertinent and timely topic for CSCW to explore the challenges involved.
In summary, we have briefly outlined underexplored concerns of speech-based interaction in social settings relating to recent technological trends, which have in turn raised numerous open-ended questions. By adopting methodical practice within existing CSCW work, many of these questions could be explored. The overarching goal of the workshop is to reflect on the insights and established practice of research across this diverse domain, to understand the impact of interacting with technology through speech, and even more so, how we can design and study such interactions?

Workshop Plan
The one-day workshop is structured into a series of segments, designed to introduce participants to each other's work, and to foster intellectually stimulating interaction. The day will begin with sessions devoted to acquainting attendees with each other and their work. Each participant will be asked to briefly present their position paper or poster to the group -this will be fastpaced and time-limited in relation to the number of submissions. We will follow this activity with activities around mapping the design space, and discussion of the key challenges raised within the submissions and through related work to prepare the development of opportunities and design ideas in the afternoon.
The afternoon will focus on breakout sessions to discuss the challenges emerged in the morning. Each participant will write down the main challenges they face in their work and share their experience. They can share their main methodology choices, experiments and approaches of their work. Main themes, trends, areas will emerge and organisers will take notes (cards, post-its). Participants, in groups, will think and share their experience and knowledge identifying CSCW opportunities and design ideas to be explored in each area previous discussed. This discussion will be materialised in a fictional scenario, and make use of paper prototyping materials. Participants will discuss the main elements present on a collaborative conversational system -task, context, single or multiuser, language, tone, content etc. -and will create a scenario envisioning a design idea or an opportunity.
Towards the end of the day, we will reconvene to discuss and reflect on the outcomes of the workshop activities. This will include a demonstration of the breakout activities, facilitating and stimulating a discussion amongst participants and organisers. This session will allow attendees to consider and reflect upon the challenges and themes raised during the morning, and to explore how the CSCW can respond to these challenges. The outcomes of the day, including this discussion, will be recorded to allow for follow-up activities. We will orient towards forming tangible outcomes of the interactive sessions of workshop, including an agenda for research in this space. Follow up activities from the workshop could include further related workshops, joint publications, and potentially a special-issue of a journal.

Organisers
The organisers have recently co-organised a range of workshops that explored different topics within collocated and social interaction [3,6,8,9,18]. We seek to build on these recent experiences, as well as practice and research on interacting with speech recognition systems from academic and industrial practice.