Multimodal AI Modeling is the Future,But It’s Also Pandora’s Box

What’s it going to take to create an AI that can understand? We keep trying to make computers that act like brains, but we don’t have an easy path to building computers that can understand things like brains do. And what happens if we get what we asked for — intelligent systems that are smarter and faster than we are?

One key goal of neural networks and AGI (Artificial General Intelligence) is to mimic the fluent, responsive functions of the human brain to process complex information in real-time. We want the computer to understand what we want it to do. Right now, neural nets like the GPT-3 and DALL-E can respond to natural-language queries, and produce human-quality sentences and even snippets of intelligible computer code. But they lack awareness. They don’t do well with subtext. They struggle to understand.

Understanding is, ironically, not all that well understood. To understand what’s going on around us, we often need to be aware of direct sensory input from all our senses, as well as remembered context, and we need to know what information to filter out. (The human body does the filtering part very well — so well, in fact, that until right this moment you probably weren’t consciously aware of your posture, or your tongue.) Scientists have made huge leaps in analyzing the connectomes of humans and other species. In the process we’ve learned that the brain integrates many different streams of information at once, all the while comparing them to memories already stored.

The discipline of building or teaching a neural net to have this kind of multifaced awareness is called multimodal modeling. Tools like DALL-E are designed to generate images based on text descriptions, while CLIP (Contrastive Language-Image Pre-training) is intended to associate text and images more robustly than current AI models. Both are built by OpenAI, which writes:

Although deep learning has revolutionized computer vision, current approaches have several major problems: typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests.

Using these sophisticated models gives us unprecedented analytical and creative power. They’re in use in fields from medicine to architecture, retail, finance, law enforcement and more. But in the wrong hands, a multimodal neural net can create a lot of problems. Imagine an AI that can create deepfake videos with falsified metadata, while watching out for patterns it’s learned people will detect. What kind of tools might it take to counter such an adversary? But then, any power you award to yourself, you give to your enemy. It’s not a question of whether AI is weaponized, it’s just when and how. It’s all the same ancient arms race.

I joke a lot about the machine uprising, because we’re both closer to it and much, much further away than it looks. Today’s AIs represent a glimmer of awareness, but like the brain cells in a Petri dish that recently made headlines for sort of playing Pong, the scope of their abilities is narrow and tightly constrained. Cells in a Petri dish aren’t conscious, let alone are they about to wrest control from humans. Similarly, the most powerful AIs of today are still, fundamentally, correlation engines more than they can really be considered a robust, humanlike “intelligence.” The long-term target is called artificial general intelligence, and we’re just not there yet.

Humans are, debatably, the smartest creatures on the planet, so our kind of intelligence is the one we’re using as a yardstick. Human intelligence, though, is notoriously prone to bias, and the machine intelligences we’ve we developed often reflect that bias in their work. Human bias is the reason facial recognition software designed for Silicon Valley struggles with faces of color. Human bias, not inherent machine hostility, is the reason predictive policing can so easily replicate historic patterns of discrimination, even when the explicit goal is to end them. In pop culture, the worst of the hostile AIs know just how humans will act, and how to outsmart us using our own tricks and biases against us.

Values matter to this bias problem, in a concrete sense. It’s such a low-tech part of the solution to a high-tech problem. Deepfakes are here, and to get software to identify deepfakes reliably, we are going to have to teach computers how to identify and understand us better — and there’s no putting that back in the box. As AI progresses, we’re going to leave our comfort zone over and over. It’s an inevitable consequence of making tools with near-human capabilities. What matters is how our values are reflected in the intelligent systems we create. Maybe we don’t need to worry about a robot uprising. Maybe the dystopia is coming from inside the house.

Now Read:

wfoojjaec Explains: What is a Neural Net?
Scientists Built an Artificial Intelligence to Finish Beethoven’s Tenth Symphony
OpenAI’s ‘DALL-E’ Generates Images From Text Descriptions

Continue reading

Multimodal AI Modeling is the Future, But It’s Also Pandora’s Box

We keep trying to make computers that act like brains. But what's it going to take to make computers that can understand things like brains do? And what happens if we get what we asked for — intelligent systems smarter and faster than we are?