Saltar al contenido

Computer vision and robotics: Teaching machines to see and act 

Marc Blanchon
Marc Blanchon
Jul 10, 2025

Robotics and computer vision are two complex fields that have existed for decades. Yet in the past ten years, things have shifted – and continue to evolve rapidly.

Robotics, once limited to basic automation and repeatable motions in isolated environments, is now expanding to address broader challenges. Traditional industrial robots operated at a safe distance, executing predefined tasks in static environments. 

Meanwhile, computer vision, once fragmented into subdomains like image processing, geometry, and optics, has undergone a transformation. The rise of artificial intelligence has unified these domains and propelled computer vision to the forefront of innovation. 

Today, a new convergence is taking shape — one that merges perception, reasoning, and physical action into integrated systems. This is the promise of Physical AI: the ability for machines not only to process information intelligently, but to act upon it in the real world. And at the heart of this evolution lies the rise of Vision-Language-Action (VLA) models — architectures that combine what a robot sees, what it understands through language, and how it decides to move or manipulate its environment accordingly. 

We’re already seeing early signs of this shift. For example, new-generation robots can now interpret a voice command like “pick up the red cable next to the panel,” visually locate the object in context, and perform the action — all thanks to VLA architectures that connect perception to natural language and motor execution. 

In industrial settings, robots once confined to repetitive welding behind safety cages are now operating side by side with humans — navigating busy factory floors, identifying parts, adapting to shifting workflows, and contributing dynamically to production without the need for constant reprogramming. 

Though often treated as separate disciplines, robotics and vision are deeply intertwined. Today’s robotics is no longer just about repetition — it’s about adaptability in dynamic, unpredictable environments and what better way to enable intelligent action than through perception? After all, around 80% of the information processed by the human brain comes from visual cognition. It’s only logical to equip robots with powerful vision systems if we want them to act meaningfully in the world. 

When vision meets movement 

The fusion of sight and motion is redefining how robots interact with the world around them. 

A robot that interacts intelligently and adapts to its environment relies primarily on its ability to perceive, interpret, and understand the world around it. Much like humans reconstruct their environment from limited focal information, vision systems must extract meaning from incomplete, noisy, and ambiguous data. 

In both humans and machines, vision is not passive — it’s an active process of interpretation, selection, and decision-making. And this principle applies directly to robotics. An efficient humanoid robot must incorporate biomimetic principles, enabling it to understand and act upon its surroundings as humans do. 

That’s why giving robots the ability to “see” is not just an enhancement — it’s a requirement for safe navigation, interaction, and decision-making. In collaborative environments, such as modern industrial settings where humans and robots coexist, real-time perception is essential to avoid collisions and adapt to changing conditions. 

We are moving from conventional robotics and siloed vision systems to intelligent robotics powered by integrated perception. Where traditional robots acted blindly within controlled environments, AI-driven robotics must now interpret complex scenes and operate in the real world — fluid, noisy, and often unpredictable. 

Applications across industries 

From factories to farms, vision-powered robots are reshaping work across every sector. 

Thanks to breakthroughs in both robotics and computer vision, it’s increasingly plausible to anticipate radical changes in how we design, manufacture, and operate across countless industries. 

Many tasks that are still carried out manually — repetitive, sometimes non-standard, and often labor-intensive — could be augmented or replaced by intelligent robots. For instance, repetitive part handling is physically demanding and costly. Delegating such tasks to machines allows humans to focus on less exhausting, more meaningful work. 

A more complex case is visual inspection. Today, for each inspection station, there’s a dedicated process — sometimes manual, sometimes automated, often a mix of both. But with computer vision and robotics, we can envision versatile, autonomous visual inspection systems capable of adapting across product types and conditions. 

And these examples extend well beyond quality control in operations: think of hazardous operations, where robotic systems can prevent human exposure to danger, or required round-the-clock tasks, where robots can operate continuously without fatigue avoiding dangerous error. 

From perception to autonomy 

Seeing is just the beginning – true autonomy emerges when machines understand what they see. 

Attaching cameras to a robot and detecting a few objects doesn’t make it autonomous. While the progress in computer vision is undeniable, real autonomy lies in the transition from raw detection to contextual scene understanding. 

Detection allows a system to identify known elements — objects, markers, obstacles — typically in controlled environments. But the real world is rarely so clean. In industrial settings, in cities, or in natural environments, robots face variability, ambiguity, and noise. That’s where true autonomy begins: not just recognizing what’s in front of them, but understanding what it means, how it changes, and what to do about it. 

This shift requires a deeper integration of perception, cognition, and action. For example, in a fulfillment center scenario, a robot must move from: 

  • Identifying a box to understanding that it’s fragile and just fell off a conveyor belt 
  • Seeing a person to predicting their trajectory and adjusting behavior safely 
  • Detecting a machine to interpreting that it’s idle and requires assistance 

It’s about reasoning, prioritizing, and reacting in real time, based on complex visual input. And this isn’t just a matter of better algorithms — it requires: 

  • Multi-modal fusion (combining vision with sound, touch, or contextual data) 
  • Learning on the edge (to adapt quickly to new situations without retraining centrally) 
  • Generalization (being able to apply learned behaviors to unseen environments) 

In other words, we move from reactive systems to proactive agents capable of operating in the unknown. This is especially vital in dynamic or high-stakes environments — from co-working with humans on factory floors to exploring disaster zones or navigating crowded streets. 

Autonomy is not binary — it’s a spectrum. And the closer we get to human-like understanding of space, intent, and consequence, the more fluid, intelligent, and reliable robotic behavior becomes. 

Ultimately, perception is the lens but autonomy is the leap. 

From seeing to thinking and doing: The rise of physical AI 

Perception alone is not enough — intelligent robots must connect vision, language, and action into one seamless cognitive loop. 

A new wave of intelligent robotics is taking shape — one where vision alone isn’t enough. The frontier is now Physical AI: systems that combine what a robot sees, what it understands, and what it does. At the heart of this evolution are Vision-Language-Action (VLA) models, which merge visual perception, natural language understanding, and physical execution into one unified architecture. This enables robots to go beyond detecting objects — they can now follow instructions, understand goals, and adapt their actions accordingly. 

These models open the door to more intuitive, adaptive robotics in factories, hospitals, and homes — creating machines that collaborate, learn, and act in complex environments. While still an emerging field, Physical AI is rapidly becoming the foundation of truly intelligent autonomy. 

Challenges in the loop 

More intelligence means more complexity – and a greater need for safety, ethics, and control. 

With increasing perceptual capabilities come significant challenges. One key issue is robustness: computer vision systems can be vulnerable to variations in lighting, background, and unexpected events. 

There’s also the challenge of trust and explainability. When robots make decisions based on complex visual input, humans must understand why and how those decisions are made — especially in safety-critical environments. 

Additionally, there’s a computational burden: processing high-resolution video streams in real time, running deep models at the edge, and doing so efficiently and sustainably is still an ongoing technical frontier. 

Moreover, and perhaps most importantly from an ethical perspective, we must ask: What tasks should we delegate to machines? How do we ensure that intelligent robots augment human work in responsible ways? 

Shaping the future together 

Empowering the next generation of robots starts with the choices we make today. 

The fusion of computer vision and robotics is one of the most promising frontiers in technological innovation. It offers a glimpse into a future where machines are not just tools but perceptive collaborators. 

To realize this future, organizations must invest not only in algorithms and hardware, but in talent, infrastructure, and governance. It requires cross-disciplinary collaboration — between engineers, ethicists, designers, and decision-makers. 

Those who act now — by embracing intelligent technologies, fostering experimentation, and building trust — will shape the future of robotics not as a distant vision, but as a practical, human-centered reality. 

Meet the author

Marc Blanchon

Marc Blanchon

Computer Vision Specialist
Marc is a computer vision specialist and pre-sales architect at Hybrid Intelligence, Capgemini Engineering. With 9+ years of experience and a Ph.D., he leads technical teams in designing and industrializing AI-driven Computer Vision solutions across industries. He is passionate about AI and actively contribute to research, offer development, and pre-sales activities to support clients and innovation initiatives.