Search — hardware
Issues
25 matches- nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A Jetson AGX Orin report indicates ISP1 fails to power on at boot. This prevents expected camera/ISP functionality from being available after startup.
jetsonagx-orinispbootcamera - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A user reports two cameras on Jetson Orin Nano are not working. No further diagnostic information is included.
jetsonorin-nanocamerasbringupsensors - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A Jetson AGX Orin Developer Kit is reported completely dead with no power response. No additional context is provided.
jetsonagx-orinpowerbootdevkit - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A USB microphone reportedly cannot record properly on the Jetson Thor platform. No additional details are provided in the post.
jetsonthorusb-audiomicrophonemultimodal - github:newton-physics/newton5/13/2026crashes-stability
A dexterous hand imported via URDF cannot grasp a bottle reliably; the bottle slides and cannot be lifted. The reporter notes the Franka example can lift the same object, implying a hand-specific contact/friction issue.
crashusdrenderinghardwaremanipulationisaac-labnewtonwarp - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A user reports random PCIe signals between an SBC and an NVMe device. The post contains no additional diagnostic content.
pcienvmesignal-integrityjetsoncarrier-board - nvidia-forum:isaac-ros5/13/2026hardware-integration
A user asks about Nova Orin initialization for the Nova Carter Robot in the Isaac ROS forum. The post contains no additional details.
hardware - nvidia-forum:isaac5/13/2026hardware-integration
A user asks about Nova Orin initialization for the Nova Carter Robot in the Isaac forum. The post contains no additional details.
hardware - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
When using a GMSL camera, image acquisition may fail midway. No additional details are provided.
gmslcameraimage-capturejetsonstability - Jetson Orin NX Super 16GB not powering on after reverse polarity - D65/Q25 suspected (P3768-A04)Blockernvidia-forum:robotics-edge-computing5/13/2026hardware-integration
Jetson Orin NX Super 16GB reportedly does not power on after reverse polarity, with suspected component damage (D65/Q25). This blocks device operation.
jetsonorin-nxpowerhardware-damagebringup - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user reports a CAN0 data bitrate issue. No additional details are provided in the post.
canbuscan0bitratejetsonrobotics-io - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user reports an EEPROM error on AGX T5000. No additional context is provided.
eepromagxprovisioningjetsonhardware - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A new sensor capture image reportedly fails with Jetson Linux 36.3.0. This indicates a capture pipeline problem on that release.
jetson-linux36.3.0cameraimage-captureregression - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user reports a Jetson Orin Nano booting issue. No further details are included.
jetsonorin-nanobootstabilitybringup - github:NVIDIA/warp5/12/2026hardware-integration
Warp plans work to support GPU kernel launches with hardware-coherent CPU memory, pinned CPU arrays, and peer GPU arrays when directly addressable. The issue emphasizes preserving clear diagnostics for invalid cases.
hardwaredocswarp - Orin Nano Super Dev Kit - MSS SDRAM init failure (err 0x48480112) - module previously functionalBlockernvidia-forum:robotics-edge-computing5/12/2026hardware-integration
Orin Nano Super Dev Kit shows an MSS SDRAM init failure (err 0x48480112) though the module was previously functional. This prevents the system from booting normally.
jetsonorin-nanosdrambootinit-failure - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user requests DRAM supplier consistency (Hynix) for Orin NX 16GB. This indicates concerns about BOM variability impacting deployments.
jetsonorin-nxdramsupply-chainfleet - Carrier-Board PCIeFrictionnvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user asks about carrier-board PCIe. The post has no additional details.
carrier-boardpciejetsonhardware-designintegration - MGBE to ethernet RJ45Frictionnvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user asks about connecting MGBE to an ethernet RJ45. No further information is included.
mgbeethernetrj45jetsoncarrier-board - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
User asks how to display DDR memory information such as manufacturer. This is a diagnostics/visibility request.
jetsonddrmemorydiagnosticsmanufacturing - github:isaac-sim/IsaacLab5/12/2026asset-pipeline
Relative texture paths do not work in the IsaacLab Beta according to the report, even when the image is in the same folder as the USD. Loading via IsaacLab code triggers errors.
usdrenderinghardwareisaac-simisaac-lab - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A USB 3.0 device is recognized as a USB 2.0 device. This reduces bandwidth and can break high-throughput peripherals.
jetsonusb3usb2link-speedperipherals - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
Jetson ORIN is reported as not coming out of recovery. This blocks normal boot and device provisioning.
jetsonorinrecovery-modeflashingboot - Orin nx 16G boot failedBlockernvidia-forum:robotics-edge-computing5/12/2026hardware-integration
User reports Orin NX 16G boot failed. This prevents use of the module/device.
jetsonorin-nxboot-failurebringupreliability - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
User reports a first boot issue with Jetson Orin Nano dev kit. This blocks initial setup.
jetsonorin-nanodevkitfirst-bootsetup
Papers
16 matches- A Prototyping Framework for Distributed Control of Multi-Robot Systems2605.150495/14/2026Junaid Ahmed Memon, Allan Andre Do Nascimento, Kostas Margellos, Antonis Papachristodoulou
This paper presents a prototyping framework for distributed control of multi-robot systems, aimed at bridging theory and practical testing of distributed optimization algorithms. Using the Single Program, Multiple Data (SPMD) paradigm, the framework emulates distributed control on a single computer, with each core running the same algorithm using local states and neighbour-to-neighbour communication. We demonstrate the framework on a four-quadrotor position-swapping task using a non-cooperative game-theoretic distributed algorithm. Computational time and trajectory data are compared across the supported dynamics levels: a point-mass model, a high-fidelity quadrotor model, and an experimental hardware testbed using Crazyflie quadcopters. The results show that the framework provides a low-cost and accessible approach for validating distributed algorithms.
- Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model2605.149505/14/2026Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu …
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
deploymentmanipulationperceptionvla - Manipulation Planning for Construction Activities with Repetitive Tasks2605.137545/13/2026Wangyi Liu, Dasharadhan Mahalingam, Fanru Gao, Ci-Jyun Liang …
In this paper, we study the problem of manipulation skill acquisition for performing construction activities consisting of repetitive tasks (e.g., building a wall or installing ceiling tiles). Our approach involves setting up a simulated construction activity in a Virtual Reality (VR) environment, where the user can provide demonstrations of the object manipulation skills needed to perform the construction activity. We then exploit the screw geometry of motion to approximate the demonstrated motion as a sequence of constant screw motions. For performing the construction activity, we generate the sequence of manipulation task instances and then compute the joint space motion plan corresponding to each instance using Screw Linear Interpolation (ScLERP) and Resolved Motion Rate Control (RMRC). We evaluate our framework by executing two representative construction tasks: constructing brick walls and installing multiple ceiling tiles. Each task is performed using only a single demonstration, a pick-and-place action for the bricks, and a single ceiling tile installation. Our experiments with a 7-DoF robot in both simulation and hardware demonstrate that the approach generalizes robustly to arbitrarily long construction activities that involve repetitive motions and demand precision, even when provided with just one demonstration. For instance, we can construct walls of arbitrary layout and length by leveraging a single demonstration of placing one brick on top of another.
manipulation - Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles2605.137515/13/2026Yizhuo Xiao, Haotian Yan, Ying Wang, Zhongpan Zhu …
Establishing trustworthy safety assurance for autonomous driving systems (ADSs) requires evidence that failures arise from avoidable system deficiencies rather than unavoidable traffic conflicts. Current adversarial simulation methods can efficiently expose collisions, but generally lack mechanisms to distinguish these fundamentally different failure modes. Here we present CARS (Context-Aware, Responsibility-attributed Scenario generation), a framework that integrates responsibility attribution directly into adversarial scenario generation. CARS combines context-aware adversary selection with a generative adversarial policy optimized in closed-loop simulation to construct collision scenarios that are both physically feasible and diagnostically attributable. Across benchmark datasets spanning heterogeneous national traffic environments, CARS consistently discovers feasible collision scenarios with high attribution rates under multiple regulation-prescribed careful and competent driver models. By coupling adversarial generation with normative responsibility assessment, CARS moves simulation testing beyond collision discovery toward the construction of interpretable, regulation-aligned safety evidence for scalable ADS validation.
crashrlhardware - Real-Time Whole-Body Teleoperation of a Humanoid Robot Using IMU-Based Motion Capture with Sim2Sim and Sim2Real Validation2605.123475/12/2026Hamza Ahmed Durrani, Suleman Khan
Stable, low-latency whole-body teleoperation of humanoid robots is an open research challenge, complicated by kinematic mismatches between human and robot morphologies, accumulated inertial sensor noise, non-trivial control latency, and persistent sim-to-real transfer gaps. This paper presents a complete real-time whole-body teleoperation system that maps human motion, recorded with a Virdyn IMU-based full-body motion capture suit, directly onto a Unitree G1 humanoid robot. We introduce a custom motion-processing, kinematic retargeting, and control pipeline engineered for continuous, low-latency operation without any offline buffering or learning-based components. The system is first validated in simulation using the MuJoCo physics model of the Unitree G1 (sim2sim), and then deployed without modification on the physical platform (sim2real). Experimental results demonstrate stable, synchronized reproduction of a broad motion repertoire, including walking, standing, sitting, turning, bowing, and coordinated expressive full-body gestures. This work establishes a practical, scalable framework for whole-body humanoid teleoperation using commodity wearable motion capture hardware.
sim2reallocomotionsensorsmujocohumanoidunitree - A Proprioceptive-Only Benchmark for Quadruped State Estimation: ATE, RPE, and Runtime Trade-offs Between Filters and Smoothers2605.116745/12/2026Ylenia Nisticò, João Carlos Virgolino Soares, Joan Solà, Claudio Semini
We compare three state-of-the-art proprioceptive state estimators for quadruped robots: MUSE [1], the Invariant Extended Kalman Filter (IEKF) [2], and the Invariant Smoother (IS) [3], on the CYN-1 sequence of the GrandTour Dataset [4]. Our goal is to give practitioners clear guidance on accuracy and computation time: we report long-term accuracy (Absolute Trajectory Error, ATE), short-term accuracy (translational and rotational Relative Pose Error, RPE), and per-update computation time on a fixed hardware/software stack. On this dataset, RPEs are broadly similar across methods, while IEKF and IS achieve a lower ATE than MUSE. Runtime results highlight the accuracy-latency trade-offs across the three approaches. In the discussion, we outline the evaluation choices used to ensure a fair comparison and analyze factors that influence short-horizon metrics. Overall, this study provides a concise snapshot of accuracy and cost, helping readers choose an estimator that fits their application constraints, with all evaluation code and documentation released open-source at https://github.com/iit-DLSLab/state_estimation_benchmark for full reproducibility.
locomotiondocs - RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning2605.115645/12/2026Pablo Ortega-Kral, Eliot Xing, Arthur Bucker, Vernon Luk …
Despite recent efforts to collect multi-task, multi-embodiment datasets, to design recipes for training Vision-Language-Action models (VLAs), and to showcase these models on different robot platforms, generalist cross-embodiment robot capabilities remains a largely elusive ideal. Progress is limited by fragmented infrastructure: most robot code is highly specific to the exact setup the user decided on, which adds major overhead when attempting to reuse, recycle, or share artifacts between users. We present RIO (Robot I/O), an open source Python framework that provides flexible, lightweight components for robot control, teleoperation, data formatting, sensor configuration, and policy deployment across diverse hardware platforms and morphologies. RIO provides abstractions that enable users to make any choice and to switch between them, with minimal reconfiguration effort. We validate RIO on VLA deployment workflows across three morphologies (single-arm, bimanual, humanoid) and four hardware platforms with varying grippers and cameras. Using teleoperated data collected with RIO, we fine-tune state-of-the-art VLAs including $π_{0.5}$ and GR00T on household tasks such as pick-and-place, folding, and bowl scrubbing. By open sourcing all our efforts, we hope the community can accelerate their pace of robot learning on real-world robot hardware. Additional details at: https://robot-i-o.github.io
rldeploymentmanipulationgroothumanoidvla - Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations2605.114855/12/2026Lasse Peters, Laura Ferranti, Javier Alonso-Mora, Andrea Bajcsy
Imitation learning powered by generative models has proven effective for modeling complex single-agent behaviors. However, teaching multi-agent systems, like multiple arms or vehicles, to coordinate through imitation learning is hindered by a fundamental data bottleneck: as the joint state-action space grows exponentially with the number of agents, collecting a sufficient amount of coordinated multi-agent demonstrations becomes extremely costly. In this work, we ask: how can we leverage single-agent demonstration data to learn multi-agent policies? We present Coordinated Diffusion (CoDi), a framework that couples independently trained single-agent diffusion policies through a user-defined multi-agent cost function, without requiring any coordinated demonstrations. We derive a new diffusion-based sampling scheme wherein the diffusion score function decomposes into independent, single-agent pre-trained base policies plus a cost-driven guidance term that coordinates these base policies into cohesive multi-agent behavior. We show that this guidance term can be estimated in a gradient-free manner, making CoDi applicable to black-box, non-differentiable cost functions without additional training. Theoretically and empirically, we analyze the conditions under which this composition can faithfully approximate a target multi-agent behavior. We find a complementary role for demonstration data versus the cost function: single-agent demonstrations must cover the support of the desired multi-agent behavior, while the cost function must promote desired behavior from this product of single-agent policies. Our results in simulation and hardware experiments of a two-arm manipulation task show that CoDi discovers robust coordinated behavior from single-agent data, is more data-efficient than multi-agent baselines, and highlights the importance of joint guidance, base policy support, and cost design.
rlmanipulationmulti-agent - VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation2605.106965/11/2026Lingwei Zhang, Jiaming Wang, Tianlin Zhang, Zhitao Song …
Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.
locomotion - Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation2605.104575/11/2026Rabin Gajmer, Joonas Haapala, Zoltan Beck
Real-time Light Detection And Ranging (LiDAR) simulation must find, per emitted ray, the closest intersecting triangle even in dynamic scenes containing large numbers of moving and deformable objects. Dominant acceleration-structure approaches require rebuilding each frame for dynamic geometry -- a cost that compounds directly with scene dynamics and cannot be amortized regardless of how little actually changed. This paper presents the Gajmer Ray-Casting Algorithm (GRCA), which inverts the question: instead of asking what does each ray hit? it asks which rays can each triangle possibly hit? GRCA geometrically models spinning LiDAR emitters as rotation-traced cones or planes and uses each triangle's emitter-centric apparent area to cull, per triangle, which channels and the rays within those channels can possibly reach it -- without any acceleration structure. GRCA is compute-based and vendor-agnostic by design, targeting highly dynamic, high-resolution simultaneous multi-sensor simulation. At its core, GRCA is a general-purpose ray-casting algorithm: the emitter-centric inversion applies to any setting where rays originate from a known position, not only LiDAR. Benchmarks evaluate 2-8 simultaneous 128x4096-ray LiDARs (360deg/180deg) over complex dynamic scenes -- with just two sensors casting ~1M rays per frame. With range culling inactive, GRCA reaches up to 7.97x over hardware-accelerated OptiX (GPU) and 14.55x over Embree (CPU). Two independent extensions further boost performance even in the most complex scene (~22M triangles, ~9M of which are dynamic, 8 LiDARs): range culling at realistic deployment ranges (10-100m) reaches up to 7.02x GPU and 9.33x CPU; a hybrid pipeline -- GRCA for dynamic geometry, OptiX/Embree for static -- reaches up to 10.5x GPU and 19.2x CPU.
deploymentsensors - Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving2605.100345/11/2026Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler …
Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.
crashrlhardwaremulti-agent - Muninn: Your Trajectory Diffusion Model But Faster2605.099995/11/2026Gokul Puthumanaillam, Hao Jiang, Ruben Hernandez, Jose Fuentes …
Diffusion-based trajectory planners can synthesize rich, multimodal robot motions, but their iterative denoising makes online planning and control prohibitively slow. Existing accelerations either modify the sampler or compress the network--sacrificing plan quality or requiring retraining without accounting for downstream control risk. We address the problem of making diffusion-based trajectory planners fast enough for real-time robot use without retraining the model or sacrificing trajectory quality, and in a way that works across diverse state-space diffusion architectures. Our key insight is that diffusion trajectory planners expose two signals we can exploit: a cheap probe of how their internal trajectory representation changes across steps, and analytic coefficients that describe how denoiser errors affect the sampler's state update. By calibrating the first signal against the second on offline runs, we obtain a per-step score that upper-bounds how far the final trajectory can deviate when we reuse a cached denoiser output, and we treat this bound as an uncertainty budget that we can spend over the denoising process. Building on this insight, we present Muninn, a training-free caching wrapper that tracks this uncertainty budget during sampling and, at each diffusion step, chooses between reusing a cached denoiser output when the predicted deviation is small and recomputing the denoiser when it is not. Across standard benchmarks Muninn delivers up to 4.6x wall-clock speedups across several trajectory diffusion models by reducing denoiser evaluations, while preserving task performance and safety metrics. Muninn further certifies that cached rollouts remain within a specified distance of their full-compute counterparts, and we validate these gains in real-time closed-loop navigation and manipulation hardware deployments. Project page: https://github.com/gokulp01/Muninn.
manipulation - Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts2605.090555/9/2026Quilee Simeon, Justin M. Wei, Yile Fan
Recent agentic-robotics systems, from Code-asPolicies to modern vision-language-action (VLA) foundation models, presuppose that drivers, SDKs, or ROS-style primitives for the target hardware already exist. Writing those primitives is the dominant engineering cost of bringing up new hardware for agent control. We present Octopus Protocol, a system that collapses that cost to a single shell command. Given only raw OS access and a language-model API key, a coding agent executes a five-stage pipeline--PROBE, IDENTIFY, INTERFACE, SERVE, DEPLOY--to discover connected devices, infer their capabilities, generate a Model Context Protocol (MCP) server with typed tools, and deploy it as a live HTTP endpoint. A persistent daemon then monitors the system, heals broken code, and perceives physical state through the camera tools it generated for itself. Two architectural principles make this work: protocols are prompts, not code, and the coding agent is the runtime. We validate the system on three heterogeneous platforms (PC/WSL, Apple Silicon macOS, Raspberry Pi 4) and on a commercial 6-DOF robotic arm with USB camera feedback. One command onboards the hardware in ~10-15 minutes and exposes up to 30 MCP tools; an MCP-compliant client then performs closed-loop visual-motor control through tools no human wrote.
deploymentsensorsvla - MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware2605.059455/7/2026Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor, Pratyush Patnaik …
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
sensorsfoundation-modelvla - CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation2605.026005/4/2026Berk Çiçek, Mert K. Er, Özgür S. Öğüz
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
manipulationvla - Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion2605.014775/2/2026Jeffrin Sam, Nguyen Khang, Yara Mahmoud, Miguel Altamirano Cabrera …
We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.
isaac-simhumanoidunitree