Search — rl
Issues
25 matches- github:google-deepmind/mujoco5/14/2026rendering
Request to add native point cloud import and visualization (e.g., PLY/PCD) to the MuJoCo viewer as an efficient vertex rendering overlay. Current workaround renders each vertex as a sphere, which is inefficient for ground-truth verification.
mujocopoint-cloudpcdplyviewerdebuggingrendering - Isaac Sim 5.1.0 crashes shortly after startup on Windows Server 2025 with RTX Pro 6000 BlackwellBlockernvidia-forum:simulation5/13/2026crashes-stability
Report that Isaac Sim 5.1.0 crashes shortly after startup on Windows Server 2025 with an RTX Pro 6000 Blackwell GPU. User cannot run the simulator in this environment.
isaac-sim5-1-0windows-server-2025rtx-pro-6000blackwellstartup-crashdrivers - github:newton-physics/newton5/13/2026integration
Newton’s `add_joint_free()` allows parent bodies other than the world, but MuJoCo requires the parent to be the world. They want to manage the discrepancy with a warning to avoid confusing behavior differences.
newton-physicsmujocoapi-compatjointsmigrationwarnings - github:newton-physics/newton5/13/2026training-infra
Newton wants reusable automation to run a release-candidate validation matrix (OS, Python, CUDA/driver, etc.) for sign-off. Today this is tracked and run manually; they want a consistent launch and reporting workflow for RC branches/tags/commits.
newton-physicsrelease-engineeringrc-testingtest-matrixcudadriversautomation - github:newton-physics/newton5/13/2026rendering
The cloth_franka example simulates in centimeters but only partially converts data back to meters for visualization. Debug overlays like COM markers and joint/contact arrows appear meters away due to a cm→m mismatch.
renderingnewton - github:isaac-sim/IsaacLab5/13/2026crashes-stability
TacSL force-field readings in an Isaac Lab demo do not increase smoothly with stepped applied normal force and instead appear irregular. The user reports Isaac Lab main branch with Isaac Sim 5.1.
crashusdrenderingsensorsisaac-simisaac-lab - nvidia-forum:robotics-edge-computing5/13/2026hardware-integration
A USB microphone reportedly cannot record properly on the Jetson Thor platform. No additional details are provided in the post.
jetsonthorusb-audiomicrophonemultimodal - github:newton-physics/newton5/13/2026crashes-stability
A dexterous hand imported via URDF fails to grasp and lift a bottle; the object slides and remains unliftable. The same bottle can be lifted using a Franka example, suggesting contact/friction or grasp modeling differences for the hand.
crashusdrenderingmanipulationisaac-labnewton - github:newton-physics/newton5/13/2026crashes-stability
A dexterous hand imported via URDF cannot grasp a bottle reliably; the bottle slides and cannot be lifted. The reporter notes the Franka example can lift the same object, implying a hand-specific contact/friction issue.
crashusdrenderinghardwaremanipulationisaac-labnewtonwarp - nvidia-forum:simulation5/12/2026other
MultiMeshRayCaster outputs (pos_w, ray_hits_w) reportedly fail to update after dynamic teleportation in _apply_action. This suggests stale raycast data after state changes.
- github:newton-physics/newton5/12/2026other
Newton's scheduled nightly workflow failed specifically in the Warp nightly tests suite while other suites passed. The issue references the failing GitHub Actions runs and logs.
newtonwarp - github:isaac-sim/IsaacLab5/12/2026asset-pipeline
Relative texture paths do not work in the IsaacLab Beta according to the report, even when the image is in the same folder as the USD. Loading via IsaacLab code triggers errors.
usdrenderinghardwareisaac-simisaac-lab - nvidia-forum:robotics-edge-computing5/12/2026hardware-integration
User reports a first boot issue with Jetson Orin Nano dev kit. This blocks initial setup.
jetsonorin-nanodevkitfirst-bootsetup - Xavier NX new DRAM modules won’t flash/boot with L4T 32.7.1 — request PCN overlay / recommended pathBlockernvidia-forum:robotics-edge-computing5/12/2026hardware-integration
Xavier NX with new DRAM modules won’t flash/boot with L4T 32.7.1; user requests PCN overlay or recommended path. This blocks bringing up new modules on existing software.
jetsonxavier-nxdraml4t-32-7-1flashboot - github:isaac-sim/IsaacLab5/12/2026training-infra
Proposal to integrate DiffRL into IsaacLab via an isaaclab_diffrl extension and the Mineral algorithm library. It targets the Direct workflow and Newton (Warp) backend to enable end-to-end backprop through physics for algorithms like SHAC and APG/BPTT.
rlintegrationisaac-simisaac-labnewtonwarp - github:isaac-sim/IsaacLab5/11/2026rendering
In a CloudXR + OpenXR setup, frames stream correctly but inbound messages and hand-tracking poses are silently dropped between client and Isaac Sim’s OpenXR plugin. This blocks teleop commands and hand tracking for interactive workflows.
renderinghardwaredeploymentintegrationisaac-simisaac-lab - github:newton-physics/newton5/11/2026training-infra
Kamino examples were added but are missing required README entries and screenshots on the release branch. This reduces discoverability and fails documented contribution requirements.
rlmanipulationnewton - github:newton-physics/newton5/10/2026other
Mouse dragging in cable examples introduces overly large forces, possibly from large torques applied to segments. This makes interactive manipulation unreliable.
- github:newton-physics/newton5/10/2026other
Request to assign a uniform color to cable segments by default because current visualization is overly colorful. This is a usability/visual quality improvement.
- github:newton-physics/newton5/10/2026asset-pipeline
Running the full Newton test suite occasionally triggers out-of-bounds errors, seen in some CUDA example tests. It is hard to reproduce in isolation, suggesting an order- or resource-dependent issue.
usdhardwarenewtonwarp - github:NVIDIA/warp5/10/2026hardware-integration
Some array-rooted composite augmented assignments and matrix row writes either fail codegen or silently drop gradients on current Warp main. This breaks differentiable programs that rely on these write patterns.
hardwarewarp - github:isaac-sim/IsaacLab5/9/2026training-infrarlrenderinghardwaredocsintegrationisaac-simisaac-lab
- github:isaac-sim/IsaacLab5/9/2026synthetic-datasynthetic-datarldeploymentdocsintegrationfeature-requestisaac-simisaac-lab
- github:isaac-sim/IsaacLab5/8/2026otherisaac-simisaac-lab
- github:isaac-sim/IsaacSim5/8/2026training-infrarlusdrenderinghardwaredeploymentlocomotionisaac-simunitree
Papers
25 matches- Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction2605.151575/14/2026Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu …
Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human takeover data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the takeover moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with direct teleoperation takeover, HandITL reduces takeover jitter by 99.8% and preserves robust post-takeover manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect intervention data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.
rlmanipulationvla - Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action2605.151535/14/2026Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding …
We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.
foundation-model - CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios2605.151225/14/2026Michael Baumgartner, David Müller, Agon Serifi, Ruben Grandia …
Robust state estimation for highly dynamic motion of legged robots remains challenging, especially in dynamic, contact-rich scenarios. Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage. This paper presents CoCo-InEKF, a differentiable invariant extended Kalman filter that utilizes continuous contact velocity covariances instead of binary contact states. These learned covariances allow the method to dynamically modulate contact confidence, accounting for more nuanced conditions ranging from firm contact to directional slippage or no contact. To predict these covariances for a set of predefined contact candidate points, we employ a lightweight neural network trained end-to-end using a state-error loss. This approach eliminates the need for heuristic ground-truth contact labels. In addition, we propose an automated contact candidate selection procedure and demonstrate that our method is insensitive to their exact placement. Experiments on a bipedal robot demonstrate a superior accuracy-efficiency tradeoff for linear velocity estimation, as well as improved filter consistency compared to baseline methods. This enables the robust execution of challenging motions, including dancing and complex ground interactions -- both in simulation and in the real world.
- Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model2605.149505/14/2026Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu …
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
deploymentmanipulationperceptionvla - Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations2605.149375/14/2026Jonathan Spieler, Angel Villar-Corrales, Sven Behnke
Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.
synthetic-datarlmanipulationperceptionworld-model - Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework2605.149115/14/2026Bocheng Zou, Harry Zhang, Khailanii Slaton, Jingquan Wang …
High-fidelity physics simulation is essential for closing the sim-to-real gap in robotics and complex mechanical systems. However, the computational overhead of high-fidelity engines often limits their use in data-intensive tasks like Reinforcement Learning (RL) and global optimization. We introduce Chrono-Gymnasium, a distributed computing framework that scales the high-fidelity multi-body dynamics of Project Chrono across large-scale computing clusters. Built upon the Ray framework, Chrono-Gymnasium provides a standardized Gymnasium interface, enabling seamless integration with modern machine learning libraries while providing built-in synchronization and messaging primitives for distributed execution. We demonstrate the framework's capabilities through two distinct case studies: (1) the training of an RL agent for autonomous robotic navigation in complex terrains, and (2) the Bayesian Optimization of a planetary lander's design parameters to ensure landing stability. Our results show that Chrono-Gymnasium reduces wall-clock time for high-fidelity simulations without sacrificing physical accuracy, offering a scalable path for the design and control of complex robotic systems.
rlenv-apiintegration - CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments2605.148105/14/2026Hong Hong, Feiyu Liao, Yongheng Liang, Boning Zhang …
In obstacle avoidance navigation of unmanned aerial vehicles (UAVs), variations in obstacle scale have received strangely less attention than obstacle number or density. Existing methods typically extract purely geometric features from single-frame depth observations. Such representations tend to neglect small obstacles and lose spatial context under occlusions caused by large obstacles, leading to noticeable degradation in environments with multi-scale obstacles. To address this issue, we propose CaMeRL, a Collision-aware and Memory-enhanced Reinforcement Learning framework for UAV navigation. The collision-aware latent representation encodes risk-sensitive depth cues to preserve fine-grained obstacle structures, thereby improving sensitivity to small obstacles. The temporal memory module integrates observations across frames, mitigating partial observability caused by large-obstacle occlusions. We evaluate CaMeRL with multi-scale obstacles, including ultra-small and extra-large obstacle settings. Results show that CaMeRL outperforms state-of-the-art baselines across all scales, with success rate gains of 0.48 and 0.28 in the ultra-small and extra-large settings, respectively. More importantly, CaMeRL achieves reliable navigation in cluttered outdoor environments.
crashrl - EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding2605.147425/14/2026Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao …
Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.
rl - SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization2605.147045/14/2026Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su …
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
integration - DSSP: Diffusion State Space Policy with Full-History Encoding2605.145985/14/2026Zhiyuan Guan, Jianshu Hu, Han Fang, Yunpeng Jiang …
Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.
rlmanipulation - DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration2605.145265/14/2026Shih-Yu Lai, Sung-Han Tien, Jui-I Huang, Yen-Chen Tseng …
Differentiable simulation of soft bodies is a foundation for system identification, trajectory optimization, and Real2Sim transfer. Yet, existing methods such as the differentiable Projective Dynamics (DiffPD) struggle when faced with heterogeneous materials with extreme stiffness contrasts, hyperelasticity under large deformations, and contact-rich interactions, which are common scenarios in the real world. We present DiffPhD, a unified GPU-accelerated differentiable Projective Dynamics framework for heterogeneous materials that tackles these intertwined challenges simultaneously. Our key insight is a careful integration of: (i) stiffness-aware projective weights to embed heterogeneity into the global system; (ii) trust-region eigenvalue filtering lifted to the backward pass for stable hyperelastic gradients and a type-II Anderson Acceleration scheme with dual-gate convergence to stabilize forward iteration under large stiffness contrasts; and (iii) a unified GPU pipeline that reuses a single sparse factor across forward, backward, and contact computations, with stiffness-amplified Rayleigh damping folded into the same factor for heterogeneity-aware dissipation at zero recurring cost. DiffPhD achieves strict gradient accuracy while delivering up to an order-of-magnitude speedup over prior differentiable solvers on heterogeneous, hyperelastic, contact-rich benchmarks. Crucially, this speedup does not come at the cost of stability: DiffPhD remains convergent on stiffness contrasts up to 100x where prior PD solvers degrade. This unlocks end-to-end gradient-based optimization on regimes previously bottlenecked by either solver fragility or per-iteration cost -- shell--joint composite creatures, soft characters wielding stiff weapons, and soft-gripper robotic manipulation -- all handled within a single forward--backward pass.
renderingmanipulationintegration - Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control2605.144175/14/2026Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan …
Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.
rllocomotionhumanoid - Energy-Efficient Quadruped Locomotion with Compliant Feet2605.144115/14/2026Pramod Pal, Shishir Kolathaya, Ashitava Ghosal
Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.
rllocomotionintegration - Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion2605.143965/14/2026Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis …
Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.
- Distill: Uncovering the True Intent behind Human-Robot Communication2605.142625/13/2026Ting Li, David Porfirio
As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.
- MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving2605.142015/13/2026Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han …
Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.
rlmulti-agentvla - Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation2605.141745/13/2026Qisong He, Xinmiao Huang, Jinwei Hu, Zhuoyun Li …
Safe navigation for mobile robots demands policies that remain reliable under the high-consequence perception uncertainty of cluttered environments. Yet most existing safe reinforcement learning (RL) methods assess safety through average cumulative cost. Such metrics can mask dangerous tail-risk behaviors. To address this, we propose a framework that trains risk-sensitive policies through Conditional Value-at-Risk (CVaR) constrained optimization on an off-policy TD3 backbone and evaluates their safety margins post-training through neural network reachability verification. During training, the policy is optimized under CVaR constraints on cumulative costs, promoting sensitivity to high-cost tail outcomes rather than average behavior alone. After training, we compute action reachable sets under bounded observation uncertainty using Taylor Model analysis, yielding a safety rate metric that quantifies the proportion of evaluated states at which the policy's reachable action set remains within prescribed safety margins. A key finding is that policies trained with CVaR constraints maintain larger safety margins from obstacles across evaluated states. This makes them significantly more amenable to formal reachability verification. Experiments across ten navigation scenarios and six baselines show that our method achieves a 98.3\% success rate, the highest safety verification rate among all compared methods, while revealing that average cost rankings and reachability-based safety rankings can diverge. This indicates that reachability verification captures risks which are missed by empirical cost metrics alone. We further validate our approach on a physical Clearpath Jackal robot, demonstrating successful sim-to-real transfer.
rlperception - SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection2605.141105/13/2026Sandro Papais, Lezhou Feng, Charles Cossette, Lingting Ge
Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.
perception - WarmPrior: Straightening Flow-Matching Policies with Temporal Priors2605.139595/13/2026Sinjae Kang, Chanyoung Kim, Kaixin Wang, Li Zhao …
Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.
rlmanipulation - OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation2605.138155/13/2026Youquan Liu, Weidong Yang, Ao Liang, Xiang Xu …
LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.
synthetic-datalocomotionsensorsperception - Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs2605.137785/13/2026Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang …
Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.
deploymentvla - RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data2605.137755/13/2026Harold Haodong Chen, Sirui Chen, Yingjie Xu, Wenhang Ge …
The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.
rlmanipulation - Learning Responsibility-Attributed Adversarial Scenarios for Testing Autonomous Vehicles2605.137515/13/2026Yizhuo Xiao, Haotian Yan, Ying Wang, Zhongpan Zhu …
Establishing trustworthy safety assurance for autonomous driving systems (ADSs) requires evidence that failures arise from avoidable system deficiencies rather than unavoidable traffic conflicts. Current adversarial simulation methods can efficiently expose collisions, but generally lack mechanisms to distinguish these fundamentally different failure modes. Here we present CARS (Context-Aware, Responsibility-attributed Scenario generation), a framework that integrates responsibility attribution directly into adversarial scenario generation. CARS combines context-aware adversary selection with a generative adversarial policy optimized in closed-loop simulation to construct collision scenarios that are both physically feasible and diagnostically attributable. Across benchmark datasets spanning heterogeneous national traffic environments, CARS consistently discovers feasible collision scenarios with high attribution rates under multiple regulation-prescribed careful and competent driver models. By coupling adversarial generation with normative responsibility assessment, CARS moves simulation testing beyond collision discovery toward the construction of interpretable, regulation-aligned safety evidence for scalable ADS validation.
crashrlhardware - TinySDP: Real Time Semidefinite Optimization for Certifiable and Agile Edge Robotics2605.137485/13/2026Ishaan Mahajan, Jon Arrizabalaga, Andrea Grillo, Fausto Vega …
Semidefinite programming (SDP) provides a principled framework for convex relaxations of nonconvex geometric constraints in motion planning, yet existing solvers are too computationally expensive for real-time control, particularly on resource-constrained embedded systems. To address this gap, we introduce TinySDP, the first semidefinite programming solver designed for embedded systems, enabling real-time model-predictive control (MPC) on microcontrollers for problems with nonconvex obstacle constraints. Our approach integrates positive-semidefinite cone projections into a cached-Riccati-based ADMM solver, leveraging computational structure for embedded tractability. We pair this solver with an a posteriori rank-1 certificate that converts relaxed solutions into explicit geometric guarantees at each timestep. On challenging benchmarks, e.g., cul-de-sac and dynamic obstacle avoidance scenarios that induce failures in local methods, TinySDP achieves collision-free navigation with up to 73% shorter paths than state-of-the-art baselines. We validate our approach on a Crazyflie quadrotor, demonstrating that semidefinite constraints can be enforced at real-time rates for agile embedded robotics.
crashrldeployment - Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels2605.136655/13/2026Amir Hossain Raj, Dibyendu Das, Xuesu Xiao
Quadruped robots demonstrate exceptional potential for navigating complex terrain in critical applications such as search and rescue missions and infrastructure inspection However autonomous traversal of confined 3D environments including tunnels caves and collapsed structures remains a significant challenge Existing methods often struggle with rigid gait patterns limited adaptability to diverse geometries and reliance on oversimplified environmental assumptions This paper introduces a Reinforcement Learning RL framework that combines procedural environment generation with policy distillation to enable robust locomotion across various tunnel configurations Our approach leverages a teacher student training paradigm where specialized expert policies trained on procedurally generated tunnel geometries transfer their knowledge to a unified student policy This strategy eliminates the need for complex reward shaping in end-to-end RL training simplifying the process by breaking down complicated tasks into smaller more manageable components that are easier for the robot to learn By synthesizing diverse tunnel structures during training and distilling navigation strategies into a generalizable policy our method achieves consistent traversal across complex spatial constraints where conventional approaches fail We demonstrate through both simulation and real world experiments that our method enables quadruped robots to successfully traverse challenging confined tunnel environments
rllocomotion