When we hear about artificial intelligence and machine learning, we imagine a computer that can think and learn on its own. That is, almost like a human being.
We perceive the world through a combination of several senses, such as sight, hearing, and speech recognition. But machines interpret the world using data processed by algorithms. That is, when a machine sees a photo, it translates it into a set of metadata that it uses to accomplish the task. And if you complicate the task, that is, where in addition to the image there will be sound, then AI can get lost.
“The main problem here is how the machine can reconcile these different modalities? It’s easy for us humans. We see a car and then we hear the sound of a passing car, and we know it’s the same thing. But it’s not so easy for machine learning, “said Alexander Liu, a graduate student at the Laboratory of Computer Science and Artificial Intelligence (CSAIL) and the first author of an article on the problem.
Liu and his colleagues have developed an artificial intelligence learning method that can represent data in a way that captures concepts that are common to visual and sound modalities. For example, their method can learn that the action of a child’s crying in a video is related to the “crying” in an audio clip. Using this knowledge, their machine learning model can identify and notice where an action occurs in a video.
This method may one day be used to help robots learn about concepts in the world through perceptions more similar to how humans do. The model still has some limitations that scientists hope to overcome in future work. So far, their research has focused on data from two modalities at once, but in the real world, people are simultaneously confronted with many data streams that they process without thinking. Machines still have to learn this.