5/5/17

Making Sense of Sensors – Types and Levels of Recognition

There are multiple types of sensors, ranging from inertial to proximity/location to audio/visual. Let’s go over the three types of sensors (inertial, audio and visual) and describe how they generally work.

Inertial measurement unit

The accelerometer is the best place to start understanding inertial measurements. An accelerometer essentially measures the force (proper acceleration) along each axis. Typically, such devices are referred to as 3-axis accelerometer since they provide force along x, y and z-axes. A gyroscope helps determine orientation by measuring rotation across a given axis. Figure 1 provides an illustration of the accelerometer and gyroscope data capture. The two together can provide acceleration and orientation and may be referred to as a 6-axis sensor. Depending on the usage model, data from these sensors are typically captured at rates ranging from few Hz to KHz. Raw data from an accelerometer/gyroscope is typically noisy and the use of filters is common to ensure that these are smoothened based on multiple data points. Such sensors can be applied for a wide variety of use cases, examples below:

Understanding of position/orientation helps mobile phones today re-orient the screen in portrait or landscape mode and reverse direction as required.
These sensors also enable simple gestures to be recognized based on buffering continuous data and looking at the change in force and orientation. These gestures could be as simple as “shake” vs. “roll” vs. a “circle” motion along a certain axis.
Sensors are also used for navigation purposes but are relational since they provide force in a given direction but do not provide absolute location in any field. Typically, this approach is known as dead reckoning and requires other sensor information to ensure that there is not a significant drift caused over time in the relative positioning due to the noisy sensor data.

Figure 1. Intro to accelerometer, gyroscope and IMU

Audio Sensors

The typical audio sensor used in most devices is a standard microphone. Microphones convert sound to electrical signals and are prevalent in many applications ranging from the typical public addressing systems to movies to laptops to mobile, wearable, and IoT devices. In this section, we primarily focus on the microphones used in lower-end devices for IoT and wearable applications. Most of the microphones used in these devices are MEMS (microelectromechanical systems) sensors and are likely to be analog or digital. The output of the microphone is typically pulse-density modulated (PDM) or pulse code modulated (PCM) and data is captured from 4-bit to 64-bit and can be tuned for signal-to-noise ratio and quality of the capture. The typical frequency response of microphones range from 20Hz to 20KHz. To improve the quality of the recording and recognition, sometimes two or three microphones are used in mobile phones, for example. This allows for processing the data from each of these microphones and reducing the noise to deliver higher quality output (e.g. voice clarity when making a phone call).

Audio sensors are used for not only capturing and recording content, but also for audio classification and speech recognition (see Figure 2). Here are some of the predominant usage scenarios for single or multiple microphones in mobile, wearable, and IoT systems:

Audio classification: A common IoT use case for microphones is to classify the environment that the sound was captured in. For example, capturing audio every so often from a microphone in a kitchen can provide information on what type of activity is currently going on—idle, dishwashing, water running in the sink, cooking, etc. Researchers are using machine listening techniques such as this to go as far as to disambiguate the different noises in the background. For example, audio captured from a phone could provide information about not only the foreground human voice but also what is happening in the background.
Voice Activity detection: Another common use case is voice activity detection. Here the focus is on attempting to determine whether there is voice in the captured audio. This is helpful when a phone or other device is completely off except for the microphone that is capturing audio at low rates. Once there is voice activity in the audio, then the audio subsystem is powered on and more processing (as described below) is done.
Speaker recognition: Speaker recognition, sometimes also referred to as voice recognition, attempts to determine who is speaking. This can be useful to identify the speaker in an audio transcript or identify the speaker as part of an authentication approach. When used as part of an authentication approach, it is important to differentiate between speaker identification (identifying one amongst multiple speakers) and speaker verification (determining that a specific speaker whose signature has been captured before was the one who spoke).
Keyword recognition: Keyword recognition is probably the simplest form of speech recognition where the focus is to ensure whether a particular word was uttered. Keyword recognition can be speaker-dependent (trained for a particular speaker) or speaker-independent (generally applicable for all). Keyword recognition can also be generalized to keyphrase recognition and both are typically used as triggers for additional activity such as starting a session of commands or bringing up an application.
Command and Control: Command and control refers to using a small set of phrases in speech recognition. For illustration, this could include a set of commands to control a toy car such as “move forward,” “move backward,” “go faster,” “go slower,” “turn right/left,” etc.
Large vocabulary continuous speech recognition (LVCSR): LVCSR and CSR in general refer to the continuous recognition of speech as it is fed into the speech recognition system. This typically involves a moderate to large vocabulary that is recognized. Of all of the above, this is the most challenging and computationally complex speech recognition problem and there have been a number of advances in this area recently enabling reduced error rates and better general usage.

Figure 2. Audio sensors and recognition capability examples

Building on the above techniques, applications that are more interesting and capabilities that may be of potential interest to the reader are natural language processing and language translation capabilities. These capabilities are now emerging in different market solutions.

Visual Sensors

The most common visual sensor is the camera. Cameras are well understood since they have been around for ages for photographic purposes. In this book, we focus on their use as a visual sensor to recognize what a device can automatically see, understand and act upon.

Unlike inertial and audio data, visual data introduces a spatial dimension to the data, since it can be 2D or 3D in nature. A camera can be used for instantaneous 2D capture (still image) without any temporal information, or as part of a video capture with rich temporal data capturing multiple frames over time. Generally, data can be captured at extremely low fidelity (e.g., QVGA that is 320x240 as a still image or at few frames per second) to high fidelity (e.g. HD and 4K at 30 or more frames per second). The capture rate depends on the usage model that needs to consider whether human consumption (as in replay) is the key requirement or whether only machine recognition of some visual aspect is sufficient

Visual recognition can be used for many different purposes ranging from object recognition, face recognition and scene recognition to similarity/anomaly detection, understanding motion or scale, and video summarization. Some examples of these are provided in Figure 3 and are listed below for illustration:

Object Recognition: Object recognition refers to the basic idea of identifying objects in an image and potentially matching them to a pre-existing database of objects that have been captured before. For example, an augmented reality application can identify a monument or tourist attraction in an image and provide additional information about this object to the viewer. Similarly, an object in a retail store can be recognized and additional information about health, price, and content can be provided to the user.
Face recognition: Face recognition includes detection of a face in an image as well as matching that face against a database to label the face accordingly. Face detection is useful by itself for digital photography to help the user take a better picture. Face recognition is useful for many purposes ranging from authentication (logging into a platform) to social networking applications (such as Facebook).
Gesture recognition: Gesture recognition refers to the recognition of static poses or moving gestures either specific to the hand/ arm or the human body. Recent game consoles commonly use examples of these where a player uses his hands and entire body to interact with the game. Static poses are easier than dynamic gestures since it generally involves processing a still image and matching it against a pre-existing set of captured still gestures. Gesture recognition can also be user-dependent or userindependent, where the former requires the system to be trained for a specific user whereas the later builds in enough modeling to accommodate for any arbitrary user.
Scene recognition: Scene recognition is extremely complex and an ongoing research problem. The simplest form of scene recognition is to take an entire scene and match it against known ones. A moderate form of scene recognition involves identifying multiple objects, faces, and people in an image and using that information to determine the likely activity or context. The more complex form of scene recognition requires the system to differentiate two scenes accurately despite similar objects being in the same two scenes.
Similarity/Anomaly Detection: Anomaly detection is a common challenge in visual recognition especially in scenarios where cameras are used for surveillance, including home monitoring as well as traffic monitoring. Here, the key is to identify if any anomaly occurred which should trigger additional analysis. Such solutions focus on identifying a set of known signatures statically or dynamically and thereby determining if any significant changes in the frame have occurred since.
Video Summarization: Video summarization is a meta application that can use many of the above techniques in order to summarize the salient aspects of a long video stream. This includes scene changes, key scenarios and objects/characters that are the focus of the video. Video summarization enables the user to jump to specific parts of a video or quickly identify which video is being looked for amongst a set of existing videos.

Figure 3. Visual sensors and usages

It should also be noted that the inertial, audio, and video sensors and recognition techniques can be used in conjunction for multi-modal recognition.

Pick up Omesh Tickoo's book, Making Sense of Sensors, to learn more about studying the recognition techniques for each independently with accompanying examples of combining them together.

About the Author

Omesh Tickoo is a research manager at Intel Labs. His team is currently active in the area of knowledge extraction from multi-modal sensor data centered around vision and speech. In his research career Omesh has made contributions to computer systems evolution in areas like wireless networks, platform partitioning and QoS, SoC architecture, Virtualization and Machine Learning. Omesh is also very active in fostering academic research with contributions toward organizing conferences, academic project mentoring and joint industry-academic research projects. He has authored 30+ conference and journal papers and filed 15+ patent applications. Omesh received his PhD from Rensselaer Polytechnic Institute in 2015.

This article is excerpted from Omesh Tickoo's book Making Sense of Sensors (ISBN 9781430265924).