Search — deployment
Issues
25 matches- nvidia-forum:isaac-ros5/15/2026docs-onboarding
User reports they cannot install the ROS2 Jazzy package `ros-jazzy-isaac-ros-foundationpose` in an Isaac ROS environment. Without the package, they can’t proceed with FoundationPose workflows.
isaac-rosros2-jazzyfoundationposeinstallationpackaginglinux - Isaac Sim 5.1.0 crashes shortly after startup on Windows Server 2025 with RTX Pro 6000 BlackwellBlockernvidia-forum:simulation5/13/2026crashes-stability
Report that Isaac Sim 5.1.0 crashes shortly after startup on Windows Server 2025 with an RTX Pro 6000 Blackwell GPU. User cannot run the simulator in this environment.
isaac-sim5-1-0windows-server-2025rtx-pro-6000blackwellstartup-crashdrivers - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A Jetson AGX Orin report indicates ISP1 fails to power on at boot. This prevents expected camera/ISP functionality from being available after startup.
jetsonagx-orinispbootcamera - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A user reports two cameras on Jetson Orin Nano are not working. No further diagnostic information is included.
jetsonorin-nanocamerasbringupsensors - nvidia-forum:robotics-edge-computing5/13/2026deployment
A user cannot download/install PyTorch with CUDA support on Jetson Orin NX (JetPack 6.0) due to network issues. This blocks setting up a CUDA-enabled ML stack on device.
jetsonorin-nxjetpack-6pytorchcudanetwork - nvidia-forum:robotics-edge-computing5/13/2026docs-onboarding
A user asks about Nova Orin initialization for the Nova Carter Robot. The post contains no additional details.
novaorincarterinitializationonboarding - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
A user requests DRAM supplier consistency (Hynix) for Orin NX 16GB. This indicates concerns about BOM variability impacting deployments.
jetsonorin-nxdramsupply-chainfleet - nvidia-forum:robotics-edge-computing5/12/2026deployment
User reports that mmapi encoder dequeue capture is not blocked. This suggests unexpected non-blocking behavior in the encoder pipeline.
jetsonmultimediammapiencodervideo-pipeline - nvidia-forum:robotics-edge-computing5/12/2026crashes-stability
On AGX Thor / JetPack 7.1, UEFI leaves DCE mailbox state dirty, causing RmInitAdapter failure on specific units. This can prevent GPU driver initialization and system usability.
jetsonthorjetpack-7-1uefirminitadaptergpu-init - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
User reports difficulty booting from Thor IGX Mini. This blocks platform setup and testing.
jetsonthorigx-minibootstorage-boot - nvidia-forum:robotics-edge-computing5/12/2026deployment
Jetson Orin Nano Super shows RTSP stream lag, high RAM usage, and slow face detection. This suggests performance and memory efficiency issues in a common edge AI workload.
jetsonorin-nano-superrtspmemory-usageperformanceface-detection - nvidia-forum:robotics-edge-computing5/11/2026hardware-integration
User cannot control GPIO on JetPack 6.2.1. This blocks basic hardware I/O functionality.
jetsonjetpack-6-2-1gpiodrivershardware-io - nvidia-forum:robotics-edge-computing5/11/2026deployment
User reports intermittent flash failures when using cloned images on Jetson. This creates unreliable provisioning and blocks repeatable deployment.
jetsonflashingcloned-imagesmanufacturingprovisioningreliability - nvidia-forum:robotics-edge-computing5/11/2026deployment
User asks (in Chinese) how to modify the default boot order on a system adapted for JetPack 6.1.2. This is a boot configuration question impacting deployment.
jetsonjetpack-6-1-2boot-orderuefideployment - nvidia-forum:robotics-edge-computing5/11/2026deployment
PKC key revocation reportedly does not work on L4T R36.5.0. This undermines secure provisioning and lifecycle management.
jetsonl4t-r36-5-0securitypkckey-revocationsecure-boot - github:isaac-sim/IsaacLab5/11/2026rendering
In a CloudXR + OpenXR setup, frames stream correctly but inbound messages and hand-tracking poses are silently dropped between client and Isaac Sim’s OpenXR plugin. This blocks teleop commands and hand tracking for interactive workflows.
renderinghardwaredeploymentintegrationisaac-simisaac-lab - nvidia-forum:robotics-edge-computing5/11/2026deployment
User has an issue creating an application partition using remaining SSD space during the flash/build process. This blocks desired storage layout for deployment.
jetsonflashingpartitioningssddeploymentstorage - Nv-tegra.nvidia.com down?Blockernvidia-forum:robotics-edge-computing5/11/2026deployment
User reports nv-tegra.nvidia.com may be down. If true, this blocks access to resources needed for Jetson development and deployment.
jetsonnv-tegraoutagedownloadsinfrastructure - nvidia-forum:simulation5/11/2026deployment
User cannot start IsaacSim 5.1.0 because of the ROS2 Bridge. This prevents launching the simulator with ROS integration enabled.
deploymentisaac-sim - github:isaac-sim/IsaacLab5/9/2026synthetic-datasynthetic-datarldeploymentdocsintegrationfeature-requestisaac-simisaac-lab
- github:isaac-sim/IsaacSim5/8/2026training-infrarlusdrenderinghardwaredeploymentlocomotionisaac-simunitree
- github:newton-physics/newton5/8/2026deploymentdeploymentnewtonwarp
- github:NVIDIA/warp5/7/2026deploymentdeploymentdocswarp
- github:isaac-sim/IsaacSim5/7/2026asset-pipelineusdhardwaredeploymentisaac-simisaac-lab
- [WebRTC Streaming Client 1.1.5] Entering IP:Port in Server Field Fails to Connect — Plain IP WorksPainnvidia-forum:simulation5/7/2026deploymentdeployment
Papers
25 matches- Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model2605.149505/14/2026Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu …
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
deploymentmanipulationperceptionvla - Learning Cross-Coupled and Regime Dependent Dynamics for Aerial Manipulation2605.148055/14/2026Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy …
Accurate dynamics models are critical for aerial manipulators operating under complex tasks such as payload transport. However, modeling these systems remains fundamentally challenging due to strong quadrotor-manipulator coupling, delayed aerodynamic interactions, and regime-dependent dynamics variations arising from payload changes and manipulator reconfiguration. These effects produce residual dynamics that are simultaneously cross-coupled, history-dependent, and nonstationary, causing both analytical models and purely offline learned models to degrade during deployment. To address these challenges, we propose a structured encoder-decoder framework for adaptive residual dynamics learning in aerial manipulators. The proposed nonlinear latent encoder captures cross-variable coupling and temporal dependencies from state-input histories, while a lightweight linear latent decoder enables online adaptation under regime-dependent nonstationary dynamics. The linear-in-parameter decoder structure permits closed-form Bayesian adaptation together with consistency-driven covariance inflation, enabling rapid and stable adaptation to both transient and slowly varying dynamics changes while remaining compatible with real-time model predictive control (MPC). Experimental results on a real aerial manipulation platform demonstrate improved residual prediction accuracy, faster adaptation under changing operating conditions, and enhanced MPC-based trajectory tracking performance. These results highlight the importance of jointly modeling coupled temporal dynamics and deployment-time nonstationarity for reliable aerial manipulation.
deploymentmanipulation - SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis2605.147005/14/2026Ben Wei Lim, Minh Duc Le, Thang Truong, Thanh Nguyen Canh
Generating robot simulation environments remains a major bottleneck in simulation-based robot learning. Constructing a training-ready MuJoCo scene typically requires expertise in 3D asset modeling, MJCF specification, spatial layout, collision avoidance, and robot-model integration. We present SR-Platform, a production-deployed agentic system that converts free-form natural language descriptions into executable, physically valid MuJoCo environments. SR-Platform decomposes scene synthesis into four stages: an LLM-based orchestrator that converts user intent into a structured scene plan; an asset forge that retrieves cached assets or generates new 3D geometry through LLM-to-CadQuery synthesis; a layout architect that assigns object poses and verifies industrial constraints; and a bridge layer that assembles the final MJCF scene and merges the selected robot model. The system is deployed as a nine-service Docker stack with WebSocket progress streaming, MinIO-backed mesh storage, Qdrant-based semantic asset retrieval, Redis job state, and InfluxDB telemetry. Using 30 days of production telemetry covering 611 successful LLM calls, SR-Platform generates five-object scenes with a median end-to-end latency of approximately 50 s, while cache-accelerated scenes complete in approximately 30-40 s. The asset forge shows an 11.3% first-attempt retry rate with automatic recovery, and cached asset retrieval removes per-object LLM calls for previously generated object types. These results show that agentic scene synthesis can reduce the manual effort required to create diverse robot training environments, enabling users to produce executable MuJoCo scenes from plain English prompts in under one minute.
crashusddeploymentintegrationmujoco - Ergodic Imitation for Adaptive Exploration around Demonstrations2605.139965/13/2026Ziyi Xu, Cem Bilaloglu, Yiming Li, Sylvain Calinon
In robotics, a common challenge in imitation learning is the mismatch between training and deployment conditions, caused, for example, by environmental changes or imperfect observation and control. When a robot follows a nominal trajectory under such mismatch, it may become stuck and fail to complete the task. This calls for adaptive online exploration strategies that remain grounded in demonstrations. To this end, we propose an adaptive ergodic imitation approach that constructs a target distribution from the geometry of the retrieved demonstrations and uses it to generate trajectories that adaptively interpolate between tracking and exploration. Our method extends ergodic control beyond its traditional role in area-coverage and search by incorporating demonstrations into a retrieval-based receding-horizon framework for adaptive imitation.
deployment - Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs2605.137785/13/2026Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang …
Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.
deploymentvla - TinySDP: Real Time Semidefinite Optimization for Certifiable and Agile Edge Robotics2605.137485/13/2026Ishaan Mahajan, Jon Arrizabalaga, Andrea Grillo, Fausto Vega …
Semidefinite programming (SDP) provides a principled framework for convex relaxations of nonconvex geometric constraints in motion planning, yet existing solvers are too computationally expensive for real-time control, particularly on resource-constrained embedded systems. To address this gap, we introduce TinySDP, the first semidefinite programming solver designed for embedded systems, enabling real-time model-predictive control (MPC) on microcontrollers for problems with nonconvex obstacle constraints. Our approach integrates positive-semidefinite cone projections into a cached-Riccati-based ADMM solver, leveraging computational structure for embedded tractability. We pair this solver with an a posteriori rank-1 certificate that converts relaxed solutions into explicit geometric guarantees at each timestep. On challenging benchmarks, e.g., cul-de-sac and dynamic obstacle avoidance scenarios that induce failures in local methods, TinySDP achieves collision-free navigation with up to 73% shorter paths than state-of-the-art baselines. We validate our approach on a Crazyflie quadrotor, demonstrating that semidefinite constraints can be enforced at real-time rates for agile embedded robotics.
crashrldeployment - Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models2605.136325/13/2026Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang …
In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/
rldeploymentvla - Uncertainty-Aware 3D Position Refinement for Multi-UAV Systems2605.135005/13/2026Hosam Alamleh, Damir Pulatov
Reliable real-time 3D localization is essential for multi-UAV navigation, collision avoidance, and coordinated flight, yet onboard estimates can degrade under GNSS multipath, non-line-of-sight reception, vertical drift, and intentional interference. This paper presents a decentralized, lightweight 3D position-refinement layer that improves robustness by fusing each Unmanned Aerial Vehicle (UAV)'s local estimate with neighbor-shared state summaries and inter-UAV range or proximity constraints. The method performs uncertainty-aware neighborhood fusion by weighting each UAV's prior according to its reported covariance and weighting neighbor constraints according to link quality, ranging uncertainty, and a learned trust score. To support practical deployment, the framework explicitly handles cold start and temporary localization loss by inflating or substituting weak priors, allowing trusted neighborhood constraints to bootstrap and stabilize estimates until absolute sensing recovers. To mitigate the impact of faulty or malicious participants, each UAV applies a local range-consistency check, smoothed over time, to down-weight or exclude neighbors whose reported positions are incompatible with observed inter-UAV distances. Simulation experiments with 10 UAVs in a 3D volume show that the proposed refinement substantially reduces mean localization error during cold start, remains competitive after local estimators stabilize, and maintains lower error as the fraction of malicious nodes increases compared with fusion without trust. These results suggest that the approach can serve as a practical resilience layer for swarm operation in challenging environments.
crashdeploymentmulti-agent - BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning2605.133825/13/2026Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen …
While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.
rldeploymentvla - What Limits Vision-and-Language Navigation ?2605.133285/13/2026Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li …
Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.
renderingdeployment - Calibration-Free Gas Source Localization with Mobile Robots: Source Term Estimation Based on Concentration Measurement Ranking2605.132085/13/2026Wanting Jin, Agatha Duranceau, İzzet Kağan Erünsal, Alcherio Martinoli
Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.
- What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models2605.131055/13/2026Yuanfang Peng, Jingjing Fu, Chuheng Zhang, Li Zhao …
Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $π_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $π_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.
rlrenderingdeploymentmanipulationvla - TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion2605.122205/12/2026Mohammad Khoshkdahan, Alexey Vinel
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
deploymentsensorsperception - Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete2605.121605/12/2026Joonha Park, Jiseung Jeong, Taesik Gong
Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall-clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full-prompt baseline's success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.
rldeploymentperceptionvla - Control of Fully Actuated Aerial Vehicles: A Comparison of Model-based and Sensor-based Dynamic Inversion2605.120715/12/2026Ali Sidar Yilmaz, Buday Turan, Lukas Pries, Markus Ryll
Fully actuated multirotor platforms decouple translational force generation from vehicle attitude, enabling independent control of position and orientation and shifting performance limitations from attitude authority to actuator dynamics and control effectiveness. This paper compares a model-based nonlinear dynamic inversion controller (geometric NDI) with a sensor-based incremental dynamic inversion controller (INDI) on a fixed-tilt fully actuated hexarotor. Both controllers share an identical outer-loop structure and are both executed at 500 Hz; therefore, performance differences can be attributed primarily to the inversion strategy. Controller performance is evaluated in five experiments covering attitude step tracking under nominal conditions and under a 50% mismatch in the rotor force coefficient, hover disturbance rejection under an external lateral load, waypoint tracking in the presence of wind gust disturbances, reduced control frequency, and injected sensor degradation. The results show that INDI offers clear advantages under parameter mismatch, gust disturbances, and sensor degradation, and maintains lower position errors across the controller-frequency sweep. However, its advantages are not universal: geometric NDI yields better attitude tracking at reduced control frequencies. To the authors' best knowledge, this work presents the first experimental validation of a full pose tracking INDI controller with decoupled translational and rotational dynamics. These findings highlight the trade-off between measurement-based and model-based inversion for robust control and rapid deployment of fully actuated UAVs.
deployment - See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model2605.118175/12/2026Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia …
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.
deploymentmanipulationvla - Nautilus: From One Prompt to Plug-and-Play Robot Learning2605.116655/12/2026Yufeng Jin, Jianfei Guo, Xiaogang Jia, Yu Deng …
Robot learning research is fragmented across policy families, benchmark suites, and real robots; each implementation is entangled with the others in a complex combination matrix, making it an engineering nightmare to port any single element. General-purpose coding agents may occasionally bridge specific setups, but cannot close this gap at scale because they lack the procedural priors and validation practices that characterize robotics research workflows. We propose NAUTILUS, an open-source harness that turns a single user prompt -- for example, "Evaluate policy A with benchmark B" -- into ready-to-use reproduction, evaluation, fine-tuning, and deployment workflows. NAUTILUS provides: plug-and-play agent skill sets with distilled priors from robotics research; typed contracts among policies, simulators/benchmarks, and real-world robots; unified interfaces and execution environments; and a trustworthy agentic coding workflow with explicit, automated validation, and testing at each milestone. NAUTILUS can not only automatically generate the required adapters and containers for existing implementations, but also wrap and onboard new or user-provided policies, simulators/benchmarks, and robots, all connected via a uniform interface. This expands cross-validation coverage without hand-written glue code. Like a nautilus shell that grows by adding chambers, NAUTILUS scales by extending its execution in chambered units, making it a research harness for scalability rather than a hand-curated framework, and aiming to reduce the engineering burden of cross-family reproduction and evaluation in the ever-growing robot learning ecosystem.
rldeployment - RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning2605.115645/12/2026Pablo Ortega-Kral, Eliot Xing, Arthur Bucker, Vernon Luk …
Despite recent efforts to collect multi-task, multi-embodiment datasets, to design recipes for training Vision-Language-Action models (VLAs), and to showcase these models on different robot platforms, generalist cross-embodiment robot capabilities remains a largely elusive ideal. Progress is limited by fragmented infrastructure: most robot code is highly specific to the exact setup the user decided on, which adds major overhead when attempting to reuse, recycle, or share artifacts between users. We present RIO (Robot I/O), an open source Python framework that provides flexible, lightweight components for robot control, teleoperation, data formatting, sensor configuration, and policy deployment across diverse hardware platforms and morphologies. RIO provides abstractions that enable users to make any choice and to switch between them, with minimal reconfiguration effort. We validate RIO on VLA deployment workflows across three morphologies (single-arm, bimanual, humanoid) and four hardware platforms with varying grippers and cameras. Using teleoperated data collected with RIO, we fine-tune state-of-the-art VLAs including $π_{0.5}$ and GR00T on household tasks such as pick-and-place, folding, and bowl scrubbing. By open sourcing all our efforts, we hope the community can accelerate their pace of robot learning on real-world robot hardware. Additional details at: https://robot-i-o.github.io
rldeploymentmanipulationgroothumanoidvla - Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation2605.114795/11/2026Hao Wang, Joshua Bowden, Colton Crosby, Somil Bansal
Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.
rldeploymentmanipulation - MAGS-SLAM: Monocular Multi-Agent Gaussian Splatting SLAM for Geometrically and Photometrically Consistent Reconstruction2605.107605/11/2026Zhihao Cao, Qi Shao, Shuhao Zhai, Jing Zhang …
Collaborative photorealistic 3D reconstruction from multiple agents enables rapid large-scale scene capture for virtual production and cooperative multi-robot exploration. While recent 3D Gaussian Splatting (3DGS) SLAM algorithms can generate high-fidelity real-time mapping, most of the existing multi-agent Gaussian SLAM methods still rely on RGB-D sensors to obtain metric depth and simplify cross-agent alignment, which limits the deployment on lightweight, low-cost, or power-constrained robotic platforms. To address this challenge, we propose MAGS-SLAM, the first RGB-only multi-agent 3DGS SLAM framework for collaborative scene reconstruction. Each agent independently builds local monocular Gaussian submaps and transmits compact submap summaries rather than raw observations or dense maps. To facilitate robust collaboration in the presence of monocular scale ambiguity, our framework integrates compact submap communication, geometry- and appearance-aware loop verification, and occupancy-aware Gaussian fusion, enabling coherent global reconstruction without active depth sensors. We further introduce ReplicaMultiagent Plus benchmark for evaluating collaborative Gaussian SLAM. Intensive experiments on synthetic and real-world datasets show that MAGS-SLAM achieves competitive tracking accuracy and comparable or superior rendering quality to state-of-the-art RGB-D collaborative Gaussian SLAM methods while relying only RGB images.
renderingdeploymentperceptionmulti-agent - ObjView-Bench: Rethinking Difficulty and Deployment for Object-Centric View Planning2605.107075/11/2026Sicong Pan, Hao Hu, Xuying Huang, Benno Wingender …
Object-centric view planning is a core component of active geometric 3D reconstruction in robotics, yet existing evaluations often conflate object complexity, planning difficulty, budget assumptions, and physical reachability constraints. As a result, conclusions drawn from idealized view-planning evaluations may not reliably predict performance under realistic reconstruction settings. We introduce ObjView-Bench, an evaluation framework for rethinking difficulty and deployment in object-centric view planning. First, we disentangle three quantities underlying view-planning evaluation: omnidirectional self-occlusion as an object-side attribute, observation saturation difficulty, and protocol-dependent planning difficulty defined through a set-cover formulation. This separation supports controlled dataset construction, analysis of slow-saturation objects, and a case study showing that planning difficulty-aware sampling can improve learned view planners. Second, we design deployment-oriented evaluation protocols that reveal how budget regimes and reachable-view constraints alter method behavior. Across classical, learned, and hybrid planners, ObjView-Bench shows that difficulty, budget, and reachability constraints substantially change method rankings and failure modes.
deployment - Embodied AI in Action: Insights from SAE World Congress 2026 on Safety, Trust, Robotics, and Real-World Deployment2605.106535/11/2026Jan-Mou Li, Paul Schmitt, Wei Tong, Majed Mohammed …
Embodied artificial intelligence is rapidly moving from research into real-world systems such as autonomous vehicles, mobile robots, and industrial machines. As these systems become more capable of perceiving, deciding, and acting in dynamic environments, they also introduce new challenges in safety, trust, governance, and operational reliability. This white paper summarizes key insights from the SAE World Congress 2026 panel session \textit{Embodied AI in Action}, which brought together experts from automotive, robotics, artificial intelligence, and safety engineering. The discussion highlighted the need to treat embodied AI as a systems challenge requiring engineering rigor, lifecycle governance, human-centered design, and evolving standards. The paper provides practical perspectives for executives, policymakers, and technical leaders seeking to adopt embodied AI responsibly. The panel reached broad agreement that long-term success will depend not only on advances in AI capability, but equally on safe and trustworthy deployment.
deployment - Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation2605.104575/11/2026Rabin Gajmer, Joonas Haapala, Zoltan Beck
Real-time Light Detection And Ranging (LiDAR) simulation must find, per emitted ray, the closest intersecting triangle even in dynamic scenes containing large numbers of moving and deformable objects. Dominant acceleration-structure approaches require rebuilding each frame for dynamic geometry -- a cost that compounds directly with scene dynamics and cannot be amortized regardless of how little actually changed. This paper presents the Gajmer Ray-Casting Algorithm (GRCA), which inverts the question: instead of asking what does each ray hit? it asks which rays can each triangle possibly hit? GRCA geometrically models spinning LiDAR emitters as rotation-traced cones or planes and uses each triangle's emitter-centric apparent area to cull, per triangle, which channels and the rays within those channels can possibly reach it -- without any acceleration structure. GRCA is compute-based and vendor-agnostic by design, targeting highly dynamic, high-resolution simultaneous multi-sensor simulation. At its core, GRCA is a general-purpose ray-casting algorithm: the emitter-centric inversion applies to any setting where rays originate from a known position, not only LiDAR. Benchmarks evaluate 2-8 simultaneous 128x4096-ray LiDARs (360deg/180deg) over complex dynamic scenes -- with just two sensors casting ~1M rays per frame. With range culling inactive, GRCA reaches up to 7.97x over hardware-accelerated OptiX (GPU) and 14.55x over Embree (CPU). Two independent extensions further boost performance even in the most complex scene (~22M triangles, ~9M of which are dynamic, 8 LiDARs): range culling at realistic deployment ranges (10-100m) reaches up to 7.02x GPU and 9.33x CPU; a hybrid pipeline -- GRCA for dynamic geometry, OptiX/Embree for static -- reaches up to 10.5x GPU and 19.2x CPU.
deploymentsensors - Increasing the Efficiency of DETR for Maritime High-Resolution Images2605.102695/11/2026Tinsae Yehuala, Hao Cheng, Ville Lehtola
Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.
deploymentlocomotionperception - Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation2605.102105/11/2026Federico Pizzolato, Francesco Pasti, Nicola Bellotto
Terrain segmentation is a fundamental capability for autonomous mobile robots operating in unstructured outdoor environments. However, state-of-the-art models are incompatible with the memory and compute constraints typical of microcontrollers, limiting scalable deployment in small robotics platforms. To address this gap, we develop a complete framework for robust binary terrain segmentation on a low-cost microcontroller. At the core of our approach we design Nano-U, a highly compact binary segmentation network with a few thousand parameters. To compensate for the network's minimal capacity, we train Nano-U via Quantization-Aware Distillation (QAD), combining knowledge distillation and quantization-aware training. This allows the final quantized model to achieve excellent results on the Botanic Garden dataset and to perform very well on TinyAgri, a custom agricultural field dataset with more challenging scenes. We deploy the quantized Nano-U on a commodity microcontroller by extending MicroFlow, a compiler-based inference engine for TinyML implemented in Rust. By eliminating interpreter overhead and dynamic memory allocation, the quantized model executes on an ESP32-S3 with a minimal memory footprint and low latency. This compiler-based execution demonstrates a viable and energy-efficient solution for perception on low-cost robotic platforms.
deploymentperception