Getting closer to Multimodal Interaction

Humans are complex being, every living being is complex in its own way and the complexity is linked to amount of data stored in brain of the living being. Consider an example of a dog pup which becomes an adult dog in less than 3 years and it learns almost everything that is required for its survival. This is true for any other living being where their life span is relatively small compared to humans. Considering the learning capabilities of a puppy, fawn, pony, calf, humans’ rate is too slow you cannot compare a 2-3-year-old infant with most of animal kingdom, but humans beat almost every living being in the long-term learning and his life span without which there’s no doubt it would be real version of planet of the apes.

Household activities
Image courtesy of Csanády Szilvia

Now comes the next great thing, living in sophisticated society, humans interact and communicate with other humans namely your parents, siblings, friends, colleagues & pets if they are not with family, whether it is for a want or a need like baby crying for milk or you are standing in grocery store with cart filled with stuff that needs to be billed. So, we perceive things from various senses governed by our brain what to do next. Standing in queue you cautiously think what to do next, who is in front of you, what is the space between you and the other person in queue, you see little space you slowly move forward so that you get closer to the billing counter.

Shopping in supermarket

Let’s just analyze this situation you look for the gap and push the cart slowly front or back depending on how you are placed, the angle, some obstacle, the rung along the side way, the partition that divides the counters. If your cart is heavy you push a bit hard, being cautious not to ram the person in front of you and even if you did push the person ahead of you, you apologize in advance or soon after you crash into them. You might also prepare yourself to pay which mostly consist of keeping your credit card in front pocket in advance, so you don’t waste your time, the cashier’s time and the person behind you. So, this is the level of interaction and thought goes in a small event of buying your groceries.

Get CPD Accredited Courses at £10

ArtworkImage courtesy of Jagoda Jankowska 

We could extend this to anything we do in the real world whether you are filling a form, watering a plant, cooking your favorite meal, tuning your guitar, driving car in an unknown area we collaborate with things which helps us to get our things done. It’s very common now a days to cook checking Google home hub or controlling your smart bulb or curtain or windows to open or close. Looking at these examples what is understood is we like to do more in less, you name it multitasking or smart automation the idea always is to make things easier, faster and simpler.

What is Multimodal Interaction or MMI? 
Multimodal Interaction or Multimodal human-computer interaction refers to the “interaction with the virtual and physical environment through natural mode of communication”, here users are facilitated with multiple modes to interact with the system. This means users have 2 or more combined input modes to interact with this could be a combination of touch, gesture, gaze, speech, facial expression to provide simple, efficient, pleasant experience to user while they are performing their task.

How does Multimodal interaction work? 
Multimodal interaction is similar to any other interaction, but the user has multiple option to interact with again this comes at its own price designing and developing. The user could use any of the available modes to interact at his convenience, the main idea of multimodal interaction is to allow user with multiple option depending on user’s convenience. Consider you are in middle of an activity and you want to multitask without stopping something completely but continue both the activity. Consider your playing a chord on your guitar and want to go to next sheet or you are thinking aloud and dictating the report that you want to type.

Multimodal interaction modelArtwork by Author

Table below shows the most commonly used Interaction and Perceiving modes acquiring knowledge about the state of the system and taking action. The below list is for general users and technology that are most commonly used, for special users with impairment, elderly users and technology specific devices the list would be specialized for the specific user, associated technology.

5 Senses
Table by Author

The main idea of multimodal interaction is to give more freedom and control to the user by provide more than one way to interact with the system as the idea of interaction constrained to one type is always not beneficial to users as we don’t know what the situation of user is at that very instant. You could be driving, cooking, running, walking, exercising, showering, with occupied hands, hearing to something, looking at something, travelling, where in you cannot stop the thing you are doing.

Top UX Courses at Udemy

Multimodal interaction & senses
Artwork by Author

The above illustration shows the possible interaction that a User can have with the system based on available degrees of freedom for interaction, limitation of technology with the system and situation of the user using the device (system).

The different interactions available for user is not to over whelm but to help in successful task completion making it simple, efficient and pleasant to user as we are do not know the users context.

Do I really need multimodal interaction? 

Before even questioning about do we really need multimodal interaction lets understand how we interact with different devices and applications depending on need and on daily basis. Most of the devices are designed for multimodal interaction be it a sophisticated computer, a smart tv and what not. Multimodal interaction is constantly evolving The below list gives an idea of available devices that most of the us interact with.

Usability Testing Template

Stationary Connected Devices
• Devices - Desktop computers, Home appliances with display panels, embedded devices , smart home hubs
• Connectivity - Wired networks, wifi, paired devices
• Portability - Users are accustomed using these devices in the same location and setting on a habitual basis
• Interaction - Main interaction consist of an Screen or Monitor with input device namely mouse, keyboard, pen, remote, touch screen. Quasi-standardized methods of voice interaction between similar device genres (desktop computers vs. connected hubs like Google Home/Amazon Alexa vs. smart thermostats).

• Devices - Android, I-phones, Phablet, blackberry,
• Connectivity — Cellular networks, wifi, paired devices
• Portability – Used on the go, mostly not stationary. Environmental context has a substantial impact on voice interactivity
• Interaction - Users are accustomed to using voice interaction. Allows interaction through visual, auditory, and tactile feedback Interaction methods are fairly standardized across models.

• Devices - watch, fitness band, smart shoes, ear pods
• Connectivity — Cellular networks, wifi, paired devices
• Portability - Used on the go, mostly not stationary
• Interaction - Users may be accustomed to using voice interaction, visual and tactile are more passive with no explicit user interaction

Non-Stationary Computing Devices (Non-Phones) 
• Devices - Laptops, tablets, transponders, automobile infotainment systems,
• Connectivity — Wireless networks, wired networks (not common), wifi, paired devices
• Portability – fixed to limited mobility
• Primary input mode is typically not voice, has tactile, connected/integrated input devices


Based on interaction the devices normally seen are:

Screen-First devices
Devices that allow interaction from the GUI here the devices could be touch or devices which takes input including keyboard, mouse, pen, digital, analog buttons and embedded devices.

Voice Agents In Screen-First Devices
Devices that use primarily GUI and has voice actions as enhancement that adds up to the GUI., the user here interact with voice and the touchscreen.

Voice-Only Devices 
These devices don’t have visual displays and users rely on audio for both input and output. This makes them limited to specific function and are aimed for specific task including some standard usage like knowing weather, reading mails, playing music, movies.

Voice-First Devices
Here the devices accept voice commands as primary input and do come with integrated screen display which allows users to interact seamless in both touch and voice depending on the users comfort. This is much simpler, efficient and reduces cognitive load from user while perceiving and interacting. These are example of multimodal interaction that gives freedom and control to users.

Most of the devices are multimodal friendly but it’s up to you to decide if your user want multimodal interaction or Graphical user interface or Voice user interface as all these comes with its own pros and cons.

MMI Ecosystem 
The diagram below shows the architecture of a MMI ecosystem which shows the communications between a smart house, car and interacting devices used by user namely smartphone, wearables through web and cloud.

Multimodal interaction Ecosystem
Image courtesy of 

The MMI Architecture, along with EMMA (Extensible Multi-Modal Annotations), provides a layer that virtualizes user interaction on top of generic Internet protocols such as WebSockets and HTTP, or on top of more specific protocols such as ECHONET. The below diagram shows the communication layer between the user, multimodal interface, device, application and the integrating technology.

Multimodal interaction Ecosystem
Image courtesy of 

Where is multimodal Interaction used? 
To start with “where is multimodal interaction used?”, The answer is everywhere but it just depends on the user if he wishes to use it or not and the application which the user is interacting. Most of the general applications on smart phones are MMI as google maps, chrome browsers, inbuilt application on smart phones. Most of daily used gadgets are capable of handling multimodal interaction be it Desktops, laptops, smart TV, thermostats, IOT devices and the list goes on. Manufacturing, Heath domain, Automobiles, Hi-tech Construction, IT enterprise solutions, Embedded digital devices are the places for MMI.

Strategy for use MMI

Diverse abilities. MMI provides users to use the most efficient mode of interaction among those proposed by the system depending on the constitutional position of the user.

Personalization. One input mode fits all context is not always true, with MMI the interaction could be customized for given task in an application as preferred by user.

Interaction patterns. Different users interact with application differently and designers should be aware for whom they are designing knowing the wide range of users.

Independence. Multimodal interfaces should empower older users to independently interact with the technology, even when there is a specific impairment (for example hearing loss or reduced sight).

Technology reliability. Users should be able to rely on the multimodal technology, especially in the case of assistive technology. For this reason, multimodal processing should be accurate and robust.

Privacy & Context of use. Users have privacy & various concerns in public spaces when using gesture & voice command, voice commands in noisy environment, inter use of speech, gesture, tactile when interacting and visual over audio depending on the context.

Oviatt’s ‘‘Ten Myths of Multimodal Interaction’’ (Oviatt, 1999) offers useful insights for those researching and building multi- modal systems, with a few especially apropos:

Myth: If you build a multimodal system, users will interact multimodally. Rather, users tend to intermix unimodal and multimodal interactions. Fortunately, multimodal interactions are often predictable based on the type of action being performed.

Iconfinder 50% off

Myth: Multimodal input involves simultaneous signals. Multi- modal signals often do not co-occur temporally, and much of multimodal interaction involved the sequential (rather than simultaneous) use of modalities.

Myth: Multimodal integration involves redundancy of content between modes. Complementarity of content may be more significant in multimodal systems than redundancy.

Myth: Enhanced efficiency is the main advantage of multimodal systems. Multimodal systems may increase efficiency, but not always. Their main advantages may be found in other aspects, such as decreased errors, increased flexibility, or increased user satisfaction.

Myth: Individual error-prone recognition technologies combine multimodally to produce even greater unreliability. In an approprivately flexible multimodal interface, people determine how to use the available input modes most effectively; mutual disambiguation of signals may contribute to a higher level of robustness.

Reeves et al. (2004) defined the following guidelines for multi- modal user interface design:

• Multimodal systems should be designed for the broadest range of users and contexts of use. Designers should support the best modality or combination of modalities anticipated in changing environments (for example, private office vs. driving a car).

Top UX Courses at Udemy

• Designers should take care to address privacy and security issues in multimodal systems. For example, non-speech alternatives should be available in a public context to prevent others from overhearing provide information or conversations.

• Maximize human cognitive and physical abilities, based on an understanding of users’ human information processing abilities and limitations.

• Modalities should be integrated in a manner compatible with user preferences, context, and system functionality. For example, match the output to acceptable user input style, such as constrained grammar or unconstrained natural language.

• Multimodal interfaces should adapt to the needs and abilities of different users, as well as different contexts of use. Individual differences (for example, age, preferences, skill, sensory or motor impairment) can be captured in a user profile and used to determine interface settings.

• Be consistent – in system output, presentation and prompts, enabling shortcuts, state switching, etc.

• Provide good error prevention and error handling; make functionality clear and easily discoverable.

Usability Testing Template

Limitations in multimodal interaction

Everything comes with limitation and trade-offs so is multimodal interaction the limitation could be unknown context of user, insufficient testing, technology constraints, processing information, integration, slip, delay, consistency, scalability.

Complexity vs Simplicity. Providing user with all possible interaction’s might increase the complexity in term of designing solution as you have many right option to consider while designing.

Personalization vs Customization. Multimodal interaction can be tailored to the specific preferences or needs of the user. This process might end up in an over- personalization of the interaction, making it difficult to the user to discover or experiment with alternative interaction modalities.

Independency vs Assistance. The cognitive effort required by different users varies and needs personalization depending on the user type, users with impairment and older users have different needs.

Ambiguity: Ambiguity arise when more than one interpretation of input is possible. This happens normally when a gesture, speech command or touch input overlaps with each other or when one modality has more than one interpretation and this could be intended or un-intended in the environment.

Ambiguity can be solved by three methods namely:

Prevention. Imposes users to follow predefined interaction behaviour according to the required input to allow change in the state.

A-posterior resolution. This method uses a mediation approach where the users are allowed to confirm, repeat, delayed reaction, undo & repair.

Approximation resolution. Here ambiguity is removed by using a fuzzy logic, Markov random field, Bayesian network system and eliminate the ambiguity.

The above content gives an understanding about how multimodal interaction can help users in competing given task effectively. MMI help in making interaction simple, fast and efficient providing users more freedom and control to interact with different applications, devices & technology. Designers needed to determine the most intuitive and effective combinations of input and output modalities for different users, applications and usage contexts, as well as how and when to best integrate those modalities.

About Author

UX Author Balaji C P

Balaji C P 
User Experience Specialist 



Post a Comment

Popular Posts