The ideal goal of developing AI is to have AI that can perfectly mimic the way humans think, learn, and make decisions. We, humans, learn automatically from the world around us. If AI can do the same, it will open rooms for more and more advanced innovations.
Nowadays, AI has come closer to that goal. Several tech companies have introduced technologies that enable AI to learn from video. Videos uploaded on social media represent events happening around the world. Learning from videos allows AI to learn like humans do, which develops AI to be more like humans.
Today, we would like to introduce you to 4 technologies that are an attempt to provide AI with the ability to learn from video by using either datasets or a self-supervised learning technique.
The Moments in Time dataset by MIT-IBM Watson Lab
Even though there are datasets that teach AI to recognise actions in videos, they can only understand a specific action. AI can’t explain the sub-actions that make up those actions. For humans, deconstructing a specific action is just a piece of cake, but for AI, it has been a big challenge. AI knows a high jump but doesn’t understand that a high jump consists of basic actions, running, jumping, arching, falling, and landing. Using video snippets of labeled basic actions data, including sounds like clapping sounds, allows the development of multi-modal models that helps AI understand the sub-actions.
This Moments in Time dataset can also recognise the same action in several different environments. For example, opening the door, opening the curtain, and the dog opens his mouth are all categorised to be ‘opening’ by using temporal-spatial transformation.
Symbol-Concept Association Network (SCAN)
In 2017, DeepMind tried to train AI to learn from videos by itself, without any labeled data from humans. The inputs are the stills from the videos and 1-second audio clips from the same point of the stills. SCAN algorithm consists of three separated neural networks for recognising images, recognising sounds, and matching images to sounds. If the model found a picture of similar actions, it will pair them with what it has learned.
DeepMind also developed a neural network called SCAN. The system can learn a new concept and combine it with something familiar. For example, how the system recognises apples is not by remembering the picture and comparing it to other images. It understands the actual size, shape, and color of apples. When the system sees a photo of apples that is not the same as it saw before, it can automatically recognise them.
CLEVRER and NS-DR System by IBM
In the past, if you showed AI a video of a man hitting a ball with a bat and asked what would happen if he missed the ball or which direction the ball would go, AI would not be able to answer them. AI could only recognise the object but knew nothing about the motion, gravity, or impact.
Researchers from IBM, MIT, Harvard, and DeepMind have introduced a new dataset called CLEVRER and a hybrid AI system, NS-DR. CLEVRER consists of videos of objects moving and colliding. What AI agent needs are abilities to recognise objects and events, model the dynamics and causal relations between the objects and events, and understand the symbolic logic behind the questions
The researchers developed NS-DR because, unlike other models, it requires less data and suits CLEVRER’s limited and controlled environment. NS-DR is a combination of a neural network and symbolic AI, the old-fashioned AI with symbolic-reasoning ability. NS-DR brings out the strengths of both systems and overcomes the weakness.
Self-Supervised Learning from Videos by Facebook
Facebook has just launched a new project called 'Learning from Videos', which created self-supervised learning AI that learns automatically from videos uploaded publicly on its platform. This technology overcame the obstacle that AI models need to use data labeled by humans and speeded up the training process. Videos uploaded by Facebook users are culturally diverse. Training AI with these videos will result in adaptive AI that fits the fast-pacing world.
Facebook applied this technology to Instagram Story by adding the Auto Captions feature that automatically generates subtitles in our video. They also used the technology with Instagram Reels recommendation systems (The feature that allows users to create short creative video clips like TikTok) to find more matched videos for users.
Another feature Facebook has planned is Digital Memories. This feature allows us to find videos using only a keyword phrase, for example, ‘Birthday Party’. AI will go through every type of data and match ‘Birthday Party’ to people singing Happy Birthday songs, cakes, candles, etc. Digital Memories is designed to feature in smart eyeglasses, another big project that mainly facilitates people to capture and revisit the memories through their eyes.