Robot voice

The understanding of robot-directed speech commands decreases further with an increase in the number of speech recognition errors. Most methods of semantic parsing in NLP do not have the capability to resolve recognition errors in a sentence, and thus, a robot's understanding of a spoken command may be constrained. By contrast, conventional studies in the area of NLP have tended to ignore the existence of speech recognition errors. An overview of this process is described in Figure 1.Ī practical and critical issue in this area is the inevitable occurrence of errors in the results of the speech recognition obtained by the ASR systems, and although significant progress has been made in this field and the performances of such systems have improved considerably, speech recognition errors cannot be completely eliminated. The syntactic and semantic parsing for service robots involves a mapping of a recognized sentence to a sequence of commands that is written in an artificial language that can be understood and carried out by the robots ( Poon, 2013). The former part corresponds to the ASR task, and the latter corresponds to the NLP task.

Next, the robot applies syntactic and semantic parsing and determines the sequence of commands that it is expected to carry out. The spoken commands given by a human user are conventionally recognized and understood by a robot in the following manner: First, the robot recognizes a sentence spoken by a human user by applying an automatic speech recognition (ASR) system such as Google Cloud Speech-to-Text API 1, CMU Sphinx 2, or Julius 3. Many studies in the area of robotics and natural language processing (NLP) ( Thomason et al., 2015 Misra et al., 2016 Xu et al., 2017) have been conducted to enable a robot to understand the linguistic commands given by human users. In numerous types of human-robot interactions, it is assumed that the human user will initiate an interaction by giving a spoken command to a service robot at home, in an office, or in a factory. Speech recognition errors are significant in practical tasks provided by service robots. Moreover, in this paper we describe an experiment conducted to evaluate the influence of the injected noise and provide a discussion of the results. In addition, Seq2Seq-NI enables a robot to understand a spoken command even when the speech recognition by an off-the-shelf ASR system contains recognition errors. The results of the experiment show that the proposed method, namely, sequence to sequence with noise injection (Seq2Seq-NI), outperforms the baseline methods. We implemented the method and evaluated it using the commands given during a general purpose service robot (GPSR) task, such as a task applied in which is a standard service robot competition for the testing of service robots. We demonstrate that the use of neural networks with a noise injection can mitigate the negative effects of speech recognition errors in understanding robot-directed speech commands i.e., increase the performance of semantic parsing. The noise is injected into phoneme sequences during the training phase of encoder-decoder neural network-based semantic parsing systems. We propose the use of encoder-decoder neural networks, e.g., sequence to sequence, with noise injection.

However, if errors occur during the process of speech recognition, a conventional semantic parsing method cannot be appropriately applied because most natural language processing methods do not recognize such errors. In a conventional approach, speech signals are recognized, and semantic parsing is applied to infer the command sequence from the utterance. In numerous instances, the understanding of spoken commands in the area of service robotics is modeled as a mapping of speech signals to a sequence of commands that can be understood and performed by a robot. This paper describes a new method that enables a service robot to understand spoken commands in a robust manner using off-the-shelf automatic speech recognition (ASR) systems and an encoder-decoder neural network with noise injection.

Emergent Systems Laboratory, College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan.

Yuuki Tada, Yoshinobu Hagiwara, Hiroki Tanaka and Tadahiro Taniguchi *