new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 8

The Impact of Environment Configurations on the Stability of AI-Enabled Systems

Nowadays, software systems tend to include Artificial Intelligence (AI) components. Changes in the operational environment have been known to negatively impact the stability of AI-enabled software systems by causing unintended changes in behavior. However, how an environment configuration impacts the behavior of such systems has yet to be explored. Understanding and quantifying the degree of instability caused by different environment settings can help practitioners decide the best environment configuration for the most stable AI systems. To achieve this goal, we performed experiments with eight different combinations of three key environment variables (operating system, Python version, and CPU architecture) on 30 open-source AI-enabled systems using the Travis CI platform. We determine the existence and the degree of instability introduced by each configuration using three metrics: the output of an AI component of the system (model performance), the time required to build and run the system (processing time), and the cost associated with building and running the system (expense). Our results indicate that changes in environment configurations lead to instability across all three metrics; however, it is observed more frequently with respect to processing time and expense rather than model performance. For example, between Linux and MacOS, instability is observed in 23\%, 96.67\%, and 100\% of the studied projects in model performance, processing time, and expense, respectively. Our findings underscore the importance of identifying the optimal combination of configuration settings to mitigate drops in model performance and reduce the processing time and expense before deploying an AI-enabled system.

  • 5 authors
·
Aug 5, 2024

Dynamical Linear Bandits

In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order mathcal{O} Big( d sqrt{T}{(1-rho)^{3/2}} Big), where rho is a measure of the stability of the system, and d is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines.

  • 3 authors
·
Nov 16, 2022

Questioning the Stability of Visual Question Answering

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

  • 3 authors
·
Nov 14, 2025

Disengagement Cause-and-Effect Relationships Extraction Using an NLP Pipeline

The advancement in machine learning and artificial intelligence is promoting the testing and deployment of autonomous vehicles (AVs) on public roads. The California Department of Motor Vehicles (CA DMV) has launched the Autonomous Vehicle Tester Program, which collects and releases reports related to Autonomous Vehicle Disengagement (AVD) from autonomous driving. Understanding the causes of AVD is critical to improving the safety and stability of the AV system and provide guidance for AV testing and deployment. In this work, a scalable end-to-end pipeline is constructed to collect, process, model, and analyze the disengagement reports released from 2014 to 2020 using natural language processing deep transfer learning. The analysis of disengagement data using taxonomy, visualization and statistical tests revealed the trends of AV testing, categorized cause frequency, and significant relationships between causes and effects of AVD. We found that (1) manufacturers tested AVs intensively during the Spring and/or Winter, (2) test drivers initiated more than 80% of the disengagement while more than 75% of the disengagement were led by errors in perception, localization & mapping, planning and control of the AV system itself, and (3) there was a significant relationship between the initiator of AVD and the cause category. This study serves as a successful practice of deep transfer learning using pre-trained models and generates a consolidated disengagement database allowing further investigation for other researchers.

  • 3 authors
·
Nov 5, 2021

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) architecture which can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp in speaker similarity over SOTA entangled representations, and a word error rate (WER) 5.4pp lower. Furthermore, they achieve higher naturalness than human recordings of the LibriTTS test-other dataset. Finally, we show that using explicit reference embedding negatively impacts intelligibility (stability), with WER increasing by 14pp compared to the model that only uses text to infer the style.

  • 9 authors
·
Feb 5, 2024

Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7\% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

  • 6 authors
·
Dec 15, 2025

YOCO: You Only Calibrate Once for Accurate Extrinsic Parameter in LiDAR-Camera Systems

In a multi-sensor fusion system composed of cameras and LiDAR, precise extrinsic calibration contributes to the system's long-term stability and accurate perception of the environment. However, methods based on extracting and registering corresponding points still face challenges in terms of automation and precision. This paper proposes a novel fully automatic extrinsic calibration method for LiDAR-camera systems that circumvents the need for corresponding point registration. In our approach, a novel algorithm to extract required LiDAR correspondence point is proposed. This method can effectively filter out irrelevant points by computing the orientation of plane point clouds and extracting points by applying distance- and density-based thresholds. We avoid the need for corresponding point registration by introducing extrinsic parameters between the LiDAR and camera into the projection of extracted points and constructing co-planar constraints. These parameters are then optimized to solve for the extrinsic. We validated our method across multiple sets of LiDAR-camera systems. In synthetic experiments, our method demonstrates superior performance compared to current calibration techniques. Real-world data experiments further confirm the precision and robustness of the proposed algorithm, with average rotation and translation calibration errors between LiDAR and camera of less than 0.05 degree and 0.015m, respectively. This method enables automatic and accurate extrinsic calibration in a single one step, emphasizing the potential of calibration algorithms beyond using corresponding point registration to enhance the automation and precision of LiDAR-camera system calibration.

  • 4 authors
·
Jul 25, 2024

TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents' capability boundaries. To mitigate these gaps, we propose TravelBench, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent's three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.

  • 7 authors
·
Dec 27, 2025

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.

The Apache Point Observatory Galactic Evolution Experiment (APOGEE) Spectrographs

We describe the design and performance of the near-infrared (1.51--1.70 micron), fiber-fed, multi-object (300 fibers), high resolution (R = lambda/delta lambda ~ 22,500) spectrograph built for the Apache Point Observatory Galactic Evolution Experiment (APOGEE). APOGEE is a survey of ~ 10^5 red giant stars that systematically sampled all Milky Way populations (bulge, disk, and halo) to study the Galaxy's chemical and kinematical history. It was part of the Sloan Digital Sky Survey III (SDSS-III) from 2011 -- 2014 using the 2.5 m Sloan Foundation Telescope at Apache Point Observatory, New Mexico. The APOGEE-2 survey is now using the spectrograph as part of SDSS-IV, as well as a second spectrograph, a close copy of the first, operating at the 2.5 m du Pont Telescope at Las Campanas Observatory in Chile. Although several fiber-fed, multi-object, high resolution spectrographs have been built for visual wavelength spectroscopy, the APOGEE spectrograph is one of the first such instruments built for observations in the near-infrared. The instrument's successful development was enabled by several key innovations, including a "gang connector" to allow simultaneous connections of 300 fibers; hermetically sealed feedthroughs to allow fibers to pass through the cryostat wall continuously; the first cryogenically deployed mosaic volume phase holographic grating; and a large refractive camera that includes mono-crystalline silicon and fused silica elements with diameters as large as ~ 400 mm. This paper contains a comprehensive description of all aspects of the instrument including the fiber system, optics and opto-mechanics, detector arrays, mechanics and cryogenics, instrument control, calibration system, optical performance and stability, lessons learned, and design changes for the second instrument.

  • 89 authors
·
Feb 3, 2019

Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.

  • 8 authors
·
Nov 6, 2025

Hybrid Systems Neural Control with Region-of-Attraction Planner

Hybrid systems are prevalent in robotics. However, ensuring the stability of hybrid systems is challenging due to sophisticated continuous and discrete dynamics. A system with all its system modes stable can still be unstable. Hence special treatments are required at mode switchings to stabilize the system. In this work, we propose a hierarchical, neural network (NN)-based method to control general hybrid systems. For each system mode, we first learn an NN Lyapunov function and an NN controller to ensure the states within the region of attraction (RoA) can be stabilized. Then an RoA NN estimator is learned across different modes. Upon mode switching, we propose a differentiable planner to ensure the states after switching can land in next mode's RoA, hence stabilizing the hybrid system. We provide novel theoretical stability guarantees and conduct experiments in car tracking control, pogobot navigation, and bipedal walker locomotion. Our method only requires 0.25X of the training time as needed by other learning-based methods. With low running time (10-50X faster than model predictive control (MPC)), our controller achieves a higher stability/success rate over other baselines such as MPC, reinforcement learning (RL), common Lyapunov methods (CLF), linear quadratic regulator (LQR), quadratic programming (QP) and Hamilton-Jacobian-based methods (HJB). The project page is on https://mit-realm.github.io/hybrid-clf.

  • 2 authors
·
Mar 18, 2023

A Topological and Operator Algebraic Framework for Asynchronous Lattice Dynamical Systems

I introduce a novel mathematical framework integrating topological dynamics, operator algebras, and ergodic geometry to study lattices of asynchronous metric dynamical systems. Each node in the lattice carries an internal flow represented by a one-parameter family of operators, evolving on its own time scale. I formalize stratified state spaces capturing multiple levels of synchronized behavior, define an asynchronous evolution metric that quantifies phase-offset distances between subsystems, and characterize emergent coherent topologies arising when subsystems synchronize. Within this framework, I develop formal operators for the evolution of each subsystem and give precise conditions under which phase-aligned synchronization occurs across the lattice. The main results include: (1) the existence and uniqueness of coherent (synchronized) states under a contractive coupling condition, (2) stability of these coherent states and criteria for their emergence as a collective phase transition in a continuous operator topology, and (3) the influence of symmetries, with group-invariant coupling leading to flow-invariant synchrony subspaces and structured cluster dynamics. Proofs are given for each theorem, demonstrating full mathematical rigor. In a final section, I discuss hypothetical applications of this framework to symbolic lattice systems (e.g. subshifts), to invariant group actions on dynamical lattices, and to operator fields over stratified manifolds in the spirit of noncommutative geometry. Throughout, I write in the first person to emphasize the exploratory nature of this work. The paper avoids any reference to cosmology or observers, focusing instead on clean, formal mathematics suitable for a broad array of dynamical systems.

  • 1 authors
·
May 14, 2025

STORI: A Benchmark and Taxonomy for Stochastic Environments

Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.

  • 3 authors
·
Sep 1, 2025

Of the People, By the Algorithm: How AI Transforms Democratic Representation

This review examines how AI technologies are transforming democratic representation, focusing on citizen participation and algorithmic decision-making. The analysis reveals that AI technologies are reshaping democratic processes in fundamental ways: enabling mass-scale deliberation, changing how citizens access and engage with political information, and transforming how representatives make and implement decisions. While AI offers unprecedented opportunities for enhancing democratic participation and governance efficiency, it also presents significant challenges to democratic legitimacy and accountability. Social media platforms' AI-driven algorithms currently mediate much political discourse, creating concerns about information manipulation and privacy. Large Language Models introduce both epistemic challenges and potential tools for improving democratic dialogue. The emergence of Mass Online Deliberation platforms suggests possibilities for scaling up meaningful citizen participation, while Algorithmic Decision-Making systems promise more efficient policy implementation but face limitations in handling complex political trade-offs. As these systems become prevalent, representatives may assume the role of architects of automated decision frameworks, responsible for guiding the translation of politically contested concepts into technical parameters and metrics. Advanced deliberation platforms offering real-time insights into citizen preferences will challenge traditional representative independence and discretion to interpret public will. The institutional integration of these participation mechanisms requires frameworks that balance the benefits with democratic stability through hybrid systems weighting different forms of democratic expression.

  • 1 authors
·
Aug 26, 2025

Synergistic Fusion of Multi-Source Knowledge via Evidence Theory for High-Entropy Alloy Discovery

Discovering novel high-entropy alloys (HEAs) with desirable properties is challenging due to the vast compositional space and complex phase formation mechanisms. Efficient exploration of this space requires a strategic approach that integrates heterogeneous knowledge sources. Here, we propose a framework that systematically combines knowledge extracted from computational material datasets with domain knowledge distilled from scientific literature using large language models (LLMs). A central feature of this approach is the explicit consideration of element substitutability, identifying chemically similar elements that can be interchanged to potentially stabilize desired HEAs. Dempster-Shafer theory, a mathematical framework for reasoning under uncertainty, is employed to model and combine substitutabilities based on aggregated evidence from multiple sources. The framework predicts the phase stability of candidate HEA compositions and is systematically evaluated on both quaternary alloy systems, demonstrating superior performance compared to baseline machine learning models and methods reliant on single-source evidence in cross-validation experiments. By leveraging multi-source knowledge, the framework retains robust predictive power even when key elements are absent from the training data, underscoring its potential for knowledge transfer and extrapolation. Furthermore, the enhanced interpretability of the methodology offers insights into the fundamental factors governing HEA formation. Overall, this work provides a promising strategy for accelerating HEA discovery by integrating computational and textual knowledge sources, enabling efficient exploration of vast compositional spaces with improved generalization and interpretability.

  • 9 authors
·
Feb 20, 2025

Transient Stability Analysis with Physics-Informed Neural Networks

We explore the possibility to use physics-informed neural networks to drastically accelerate the solution of ordinary differential-algebraic equations that govern the power system dynamics. When it comes to transient stability assessment, the traditionally applied methods either carry a significant computational burden, require model simplifications, or use overly conservative surrogate models. Conventional neural networks can circumvent these limitations but are faced with high demand of high-quality training datasets, while they ignore the underlying governing equations. Physics-informed neural networks are different: they incorporate the power system differential algebraic equations directly into the neural network training and drastically reduce the need for training data. This paper takes a deep dive into the performance of physics-informed neural networks for power system transient stability assessment. Introducing a new neural network training procedure to facilitate a thorough comparison, we explore how physics-informed neural networks compare with conventional differential-algebraic solvers and classical neural networks in terms of computation time, requirements in data, and prediction accuracy. We illustrate the findings on the Kundur two-area system, and assess the opportunities and challenges of physics-informed neural networks to serve as a transient stability analysis tool, highlighting possible pathways to further develop this method.

  • 3 authors
·
Mar 14, 2023

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.

  • 5 authors
·
Mar 24

Unsupervised Domain Adaptive Detection with Network Stability Analysis

Domain adaptive detection aims to improve the generality of a detector, learned from the labeled source domain, on the unlabeled target domain. In this work, drawing inspiration from the concept of stability from the control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. In specific, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instancelevel disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noticing, our NSA is designed for general purpose, and thus applicable to one-stage detection model (e.g., FCOS) besides the adopted one, as shown by experiments. https://github.com/tiankongzhang/NSA.

  • 4 authors
·
Aug 16, 2023

ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems

Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.

  • 9 authors
·
Mar 19, 2024

Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in Federated Learning

Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. In the typical FL paradigm (e.g., FedAvg), model weights are sent to and from the server each round to participating clients. Recently, the use of small pre-trained models has been shown to be effective in federated learning optimization and improving convergence. However, recent state-of-the-art pre-trained models are getting more capable but also have more parameters, known as the "Foundation Models." In conventional FL, sharing the enormous model weights can quickly put a massive communication burden on the system, especially if more capable models are employed. Can we find a solution to enable those strong and readily available pre-trained models in FL to achieve excellent performance while simultaneously reducing the communication burden? To this end, we investigate the use of parameter-efficient fine-tuning in federated learning and thus introduce a new framework: FedPEFT. Specifically, we systemically evaluate the performance of FedPEFT across a variety of client stability, data distribution, and differential privacy settings. By only locally tuning and globally sharing a small portion of the model weights, significant reductions in the total communication overhead can be achieved while maintaining competitive or even better performance in a wide range of federated learning scenarios, providing insight into a new paradigm for practical and effective federated systems.

  • 5 authors
·
Oct 4, 2022

Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics

Learning to simulate complex physical systems from data has emerged as a promising way to overcome the limitations of traditional numerical solvers, which often require prohibitive computational costs for high-fidelity solutions. Recent Graph Neural Simulators (GNSs) accelerate simulations by learning dynamics on graph-structured data, yet often struggle to capture long-range interactions and suffer from error accumulation under autoregressive rollouts. To address these challenges, we propose Information-preserving Graph Neural Simulators (IGNS), a graph-based neural simulator built on the principles of Hamiltonian dynamics. This structure guarantees preservation of information across the graph, while extending to port-Hamiltonian systems allows the model to capture a broader class of dynamics, including non-conservative effects. IGNS further incorporates a warmup phase to initialize global context, geometric encoding to handle irregular meshes, and a multi-step training objective that facilitates PDE matching, where the trajectory produced by integrating the port-Hamiltonian core aligns with the ground-truth trajectory, thereby reducing rollout error. To evaluate these properties systematically, we introduce new benchmarks that target long-range dependencies and challenging external forcing scenarios. Across all tasks, IGNS consistently outperforms state-of-the-art GNSs, achieving higher accuracy and stability under challenging and complex dynamical systems. Our project page: https://thobotics.github.io/neural_pde_matching.

  • 7 authors
·
Nov 11, 2025

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Predicting the trajectories of road agents is essential for autonomous driving systems. The recent mainstream methods follow a static paradigm, which predicts the future trajectory by using a fixed duration of historical frames. These methods make the predictions independently even at adjacent time steps, which leads to potential instability and temporal inconsistency. As successive time steps have largely overlapping historical frames, their forecasting should have intrinsic correlation, such as overlapping predicted trajectories should be consistent, or be different but share the same motion goal depending on the road situation. Motivated by this, in this work, we introduce HPNet, a novel dynamic trajectory forecasting method. Aiming for stable and accurate trajectory forecasting, our method leverages not only historical frames including maps and agent states, but also historical predictions. Specifically, we newly design a Historical Prediction Attention module to automatically encode the dynamic relationship between successive predictions. Besides, it also extends the attention range beyond the currently visible window benefitting from the use of historical predictions. The proposed Historical Prediction Attention together with the Agent Attention and Mode Attention is further formulated as the Triple Factorized Attention module, serving as the core design of HPNet.Experiments on the Argoverse and INTERACTION datasets show that HPNet achieves state-of-the-art performance, and generates accurate and stable future trajectories. Our code are available at https://github.com/XiaolongTang23/HPNet.

  • 6 authors
·
Apr 9, 2024

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

  • 5 authors
·
May 18 1

The Role of Deep Learning in Advancing Proactive Cybersecurity Measures for Smart Grid Networks: A Survey

As smart grids (SG) increasingly rely on advanced technologies like sensors and communication systems for efficient energy generation, distribution, and consumption, they become enticing targets for sophisticated cyberattacks. These evolving threats demand robust security measures to maintain the stability and resilience of modern energy systems. While extensive research has been conducted, a comprehensive exploration of proactive cyber defense strategies utilizing Deep Learning (DL) in {SG} remains scarce in the literature. This survey bridges this gap, studying the latest DL techniques for proactive cyber defense. The survey begins with an overview of related works and our distinct contributions, followed by an examination of SG infrastructure. Next, we classify various cyber defense techniques into reactive and proactive categories. A significant focus is placed on DL-enabled proactive defenses, where we provide a comprehensive taxonomy of DL approaches, highlighting their roles and relevance in the proactive security of SG. Subsequently, we analyze the most significant DL-based methods currently in use. Further, we explore Moving Target Defense, a proactive defense strategy, and its interactions with DL methodologies. We then provide an overview of benchmark datasets used in this domain to substantiate the discourse.{ This is followed by a critical discussion on their practical implications and broader impact on cybersecurity in Smart Grids.} The survey finally lists the challenges associated with deploying DL-based security systems within SG, followed by an outlook on future developments in this key field.

  • 3 authors
·
Jan 11, 2024

How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication methods preserve the pairwise significant differences between systems as the full collection. In other terms, while traditional approaches look for stability in answering the question "is system A better than system B?", our proposed approach looks for stability in answering the question "is system A significantly better than system B?", which is the ultimate questions researchers need to answer to guarantee the generalisability of their results. Among other results, we found that the best methods in terms of ranking of systems correlation do not always match those preserving statistical significance.

  • 3 authors
·
Aug 18, 2023

Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

Unlearning in large language models (LLMs) aims to remove harmful training data while preserving overall utility. However, we find that existing methods often hallucinate, generate abnormal token sequences, or behave inconsistently, raising safety and trust concerns. According to prior literature on LLM honesty, such behaviors are often associated with dishonesty. This motivates us to investigate the notion of honesty in the context of model unlearning. We propose a formal definition of unlearning honesty, which includes: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in Q&A and MCQ settings. Evaluating 9 methods across 3 mainstream families shows that all current methods fail to meet these standards. After experimental and theoretical analyses, we present ReVa, a representation-alignment procedure that fine-tunes feature-randomized unlearned models to better acknowledge forgotten knowledge. On Q&A tasks from the forget set, ReVa achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second-best method. Remarkably, It also improves honesty on the retained set. We release our data and code at https://github.com/renjiegu.

  • 4 authors
·
May 8

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems

A growing body of multi-agent studies with LLMs explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without explicit knowledge of the payoff structure or how individual actions translate into long-run outcomes, relying instead on heuristics, communication, and enforcement. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom's principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a 2times2 grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.

  • 5 authors
·
Oct 16, 2025

Improving Robustness of Tabular Retrieval via Representational Stability

Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as csv, tsv, html, markdown, and ddl, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across MPNet, BGE-M3, ReasonIR, and SPLADE. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}.

  • 5 authors
·
Apr 26 2

Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model's behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.

  • 4 authors
·
Apr 5

Intelligent Load Balancing in Cloud Computer Systems

Cloud computing is an established technology allowing users to share resources on a large scale, never before seen in IT history. A cloud system connects multiple individual servers in order to process related tasks in several environments at the same time. Clouds are typically more cost-effective than single computers of comparable computing performance. The sheer physical size of the system itself means that thousands of machines may be involved. The focus of this research was to design a strategy to dynamically allocate tasks without overloading Cloud nodes which would result in system stability being maintained at minimum cost. This research has added the following new contributions to the state of knowledge: (i) a novel taxonomy and categorisation of three classes of schedulers, namely OS-level, Cluster and Big Data, which highlight their unique evolution and underline their different objectives; (ii) an abstract model of cloud resources utilisation is specified, including multiple types of resources and consideration of task migration costs; (iii) a virtual machine live migration was experimented with in order to create a formula which estimates the network traffic generated by this process; (iv) a high-fidelity Cloud workload simulator, based on a month-long workload traces from Google's computing cells, was created; (v) two possible approaches to resource management were proposed and examined in the practical part of the manuscript: the centralised metaheuristic load balancer and the decentralised agent-based system. The project involved extensive experiments run on the University of Westminster HPC cluster, and the promising results are presented together with detailed discussions and a conclusion.

  • 1 authors
·
Sep 22, 2025

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

In personalized marketing, uplift models estimate incremental effects by modeling how customer behavior changes under alternative treatments. However, real-world data often exhibit biases - such as selection bias, spillover effects, and unobserved confounding - which adversely affect both estimation accuracy and metric validity. Despite the importance of bias-aware assessment, a lack of systematic studies persists. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets lack counterfactual ground truth, rendering direct metric validation infeasible. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking, effectively bridging the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that: (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) evaluation metric stability is linked to mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and metrics. Code will be released upon acceptance.

  • 3 authors
·
Mar 20

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench

Agent Identity URI Scheme: Topology-Independent Naming and Capability-Based Discovery for Multi-Agent Systems

Multi-agent systems face a fundamental architectural flaw: agent identity is bound to network location. When agents migrate between providers, scale across instances, or federate across organizations, URI-based identity schemes break references, fragment audit trails, and require centralized coordination. We propose the agent:// URI scheme, which decouples identity from topology through three orthogonal components: a trust root establishing organizational authority, a hierarchical capability path enabling semantic discovery, and a sortable unique identifier providing stable reference. The scheme enables capability-based discovery through DHT key derivation, where queries return agents by what they do rather than where they are. Trust-root scoping prevents cross-organization pollution while permitting federation when desired. Cryptographic attestation via PASETO tokens binds capability claims to agent identity, enabling verification without real-time contact with the issuing authority. We evaluate the scheme across four dimensions: capability expressiveness (100% coverage on 369 production tools with zero collision), discovery precision (F1=1.0 across 10,000 agents), identity stability (formal proofs of migration invariance), and performance (all operations under 5 microseconds). The agent:// URI scheme provides a formally-specified, practically-evaluated foundation for decentralized agent identity and capability-based discovery.

  • 1 authors
·
Jan 20

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market ("The Lemon Market"), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open-weight models across both scenarios and find that models largely fail to self-regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open-weight models. We propose the Economic Alignment Score (EAS), a 4-component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross-model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

  • 7 authors
·
Jan 3, 2025 2

Geometric Stability: The Missing Axis of Representations

Analysis of learned representations has a blind spot: it focuses on similarity, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce geometric stability, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present Shesha, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated (ρapprox 0.01) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2times more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability (ρ= 0.89-0.96); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying how reliably systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.

  • 1 authors
·
Jan 14 2

From Syntax to Semantics: Geometric Stability as the Missing Axis of Perturbation Biology

The capacity to precisely edit genomes has outpaced our ability to predict the consequences. A cell can be genetically perfect and therapeutically useless: edited exactly as intended, yet unstable, drifting toward unintended fates, or selected for properties that compromise safety. This paradox reflects a deeper gap in how we evaluate biological intervention. Current frameworks excel at measuring what was done to a cell but remain blind to what the cell has become. We argue that this blindness stems from treating cells as collections of independent variables rather than as dynamical systems occupying positions on high-dimensional state manifolds. Drawing on Waddington's epigenetic landscape, we propose geometric stability as a missing axis of evaluation: the directional coherence of cellular responses to perturbation. This metric distinguishes interventions that guide cells coherently toward stable states from those that scatter them across the state manifold. Validation across diverse perturbation datasets reveals that geometric stability captures regulatory architecture invisible to conventional metrics, discriminating pleiotropic master regulators from lineage-specific factors without prior biological annotation. As precision medicine increasingly relies on cellular reprogramming, the question shifts from ``did the intervention occur?'' to ``is the resulting state stable?'' Geometric stability provides a framework for answering.

  • 1 authors
·
Apr 24

A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.

  • 47 authors
·
Jan 18, 2023

The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an "effective reward" aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

  • 1 authors
·
Jul 27, 2025

Real-Time Structural Deflection Estimation in Hydraulically Actuated Systems Using 3D Flexible Multibody Simulation and DNNs

The precision, stability, and performance of lightweight high-strength steel structures in heavy machinery is affected by their highly nonlinear dynamics. This, in turn, makes control more difficult, simulation more computationally intensive, and achieving real-time autonomy, using standard approaches, impossible. Machine learning through data-driven, physics-informed and physics-inspired networks, however, promises more computationally efficient and accurate solutions to nonlinear dynamic problems. This study proposes a novel framework that has been developed to estimate real-time structural deflection in hydraulically actuated three-dimensional systems. It is based on SLIDE, a machine-learning-based method to estimate dynamic responses of mechanical systems subjected to forced excitations.~Further, an algorithm is introduced for the data acquisition from a hydraulically actuated system using randomized initial configurations and hydraulic pressures.~The new framework was tested on a hydraulically actuated flexible boom with various sensor combinations and lifting various payloads. The neural network was successfully trained in less time using standard parameters from PyTorch, ADAM optimizer, the various sensor inputs, and minimal output data. The SLIDE-trained neural network accelerated deflection estimation solutions by a factor of 10^7 in reference to flexible multibody simulation batches and provided reasonable accuracy. These results support the studies goal of providing robust, real-time solutions for control, robotic manipulators, structural health monitoring, and automation problems.

  • 6 authors
·
Mar 10, 2025

Geometric coherence of single-cell CRISPR perturbations reveals regulatory architecture and predicts cellular stress

Genome engineering has achieved remarkable sequence-level precision, yet predicting the transcriptomic state that a cell will occupy after perturbation remains an open problem. Single-cell CRISPR screens measure how far cells move from their unperturbed state, but this effect magnitude ignores a fundamental question: do the cells move together? Two perturbations with identical magnitude can produce qualitatively different outcomes if one drives cells coherently along a shared trajectory while the other scatters them across expression space. We introduce a geometric stability metric, Shesha, that quantifies the directional coherence of single-cell perturbation responses as the mean cosine similarity between individual cell shift vectors and the mean perturbation direction. Across five CRISPR datasets (2,200+ perturbations spanning CRISPRa, CRISPRi, and pooled screens), stability correlates strongly with effect magnitude (Spearman ρ=0.75-0.97), with a calibrated cross-dataset correlation of 0.97. Crucially, discordant cases where the two metrics decouple expose regulatory architecture: pleiotropic master regulators such as CEBPA and GATA1 pay a "geometric tax," producing large but incoherent shifts, while lineage-specific factors such as KLF1 produce tightly coordinated responses. After controlling for magnitude, geometric instability is independently associated with elevated chaperone activation (HSPA5/BiP; ρ_{partial}=-0.34 and -0.21 across datasets), and the high-stability/high-stress quadrant is systematically depleted. The magnitude-stability relationship persists in scGPT foundation model embeddings, confirming it is a property of biological state space rather than linear projection. Perturbation stability provides a complementary axis for hit prioritization in screens, phenotypic quality control in cell manufacturing, and evaluation of in silico perturbation predictions.

  • 1 authors
·
Apr 16 2

What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.

  • 3 authors
·
Oct 31, 2024 4

Kling-MotionControl Technical Report

Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

KlingTeam Kling Team
·
Mar 3 1

Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs

Generating realistic time series data is important for many engineering and scientific applications. Existing work tackles this problem using generative adversarial networks (GANs). However, GANs are often unstable during training, and they can suffer from mode collapse. While variational autoencoders (VAEs) are known to be more robust to these issues, they are (surprisingly) less often considered for time series generation. In this work, we introduce Koopman VAE (KVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data. Inspired by Koopman theory, we represent the latent conditional prior dynamics using a linear map. Our approach enhances generative modeling with two desired features: (i) incorporating domain knowledge can be achieved by leverageing spectral tools that prescribe constraints on the eigenvalues of the linear map; and (ii) studying the qualitative behavior and stablity of the system can be performed using tools from dynamical systems theory. Our results show that KVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks. Whether trained on regular or irregular data, KVAE generates time series that improve both discriminative and predictive metrics. We also present visual evidence suggesting that KVAE learns probability density functions that better approximate empirical ground truth distributions.

  • 5 authors
·
Oct 4, 2023

Left/Right Brain, human motor control and the implications for robotics

Neural Network movement controllers promise a variety of advantages over conventional control methods however they are not widely adopted due to their inability to produce reliably precise movements. This research explores a bilateral neural network architecture as a control system for motor tasks. We aimed to achieve hemispheric specialisation similar to what is observed in humans across different tasks; the dominant system (usually the right hand, left hemisphere) excels at tasks involving coordination and efficiency of movement, and the non-dominant system performs better at tasks requiring positional stability. Specialisation was achieved by training the hemispheres with different loss functions tailored toward the expected behaviour of the respective hemispheres. We compared bilateral models with and without specialised hemispheres, with and without inter-hemispheric connectivity (representing the biological Corpus Callosum), and unilateral models with and without specialisation. The models were trained and tested on two tasks common in the human motor control literature: the random reach task, suited to the dominant system, a model with better coordination, and the hold position task, suited to the non-dominant system, a model with more stable movement. Each system out-performed the non-favoured system in its preferred task. For both tasks, a bilateral model outperforms the 'non-preferred' hand, and is as good or better than the 'preferred' hand. The Corpus Callosum tends to improve performance, but not always for the specialised models.

  • 4 authors
·
Jan 25, 2024

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

  • 16 authors
·
Jul 18, 2025

Bayesian Physics-informed Neural Networks for System Identification of Inverter-dominated Power Systems

While the uncertainty in generation and demand increases, accurately estimating the dynamic characteristics of power systems becomes crucial for employing the appropriate control actions to maintain their stability. In our previous work, we have shown that Bayesian Physics-informed Neural Networks (BPINNs) outperform conventional system identification methods in identifying the power system dynamic behavior under measurement noise. This paper takes the next natural step and addresses the more significant challenge, exploring how BPINN perform in estimating power system dynamics under increasing uncertainty from many Inverter-based Resources (IBRs) connected to the grid. These introduce a different type of uncertainty, compared to noisy measurements. The BPINN combines the advantages of Physics-informed Neural Networks (PINNs), such as inverse problem applicability, with Bayesian approaches for uncertainty quantification. We explore the BPINN performance on a wide range of systems, starting from a single machine infinite bus (SMIB) system and 3-bus system to extract important insights, to the 14-bus CIGRE distribution grid, and the large IEEE 118-bus system. We also investigate approaches that can accelerate the BPINN training, such as pretraining and transfer learning. Throughout this paper, we show that in presence of uncertainty, the BPINN achieves orders of magnitude lower errors than the widely popular method for system identification SINDy and significantly lower errors than PINN, while transfer learning helps reduce training time by up to 80 %.

  • 4 authors
·
Mar 20, 2024

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.

  • 7 authors
·
May 9 2

Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

Agentic systems involved in high-stake decision-making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configure endpoint detection and response (EDR) policies under adversarial pressure, we present a tool-mediated architecture: LLM agents use deterministic tools (Stackelberg best-response, Bayesian observer updates, attack-graph primitives) and select from finite action catalogs enforced at the tool-output interface. A composite Lyapunov function machine-checked in Lean 4 with zero sorry certifies controllability, observability from asymmetric sensor data, and Input-to-State Stability (ISS) robustness under intelligent adversarial disturbance, with two corollaries extending the certificate to any controller or adversary from the catalogs. On 282 real enterprise attack graphs, the claims hold with margin. On paired offensive/defensive telemetry, a tool-mediated Claude Sonnet 4 controller reduces the attacker's expected payoff (game value) by 59% relative to a deterministic greedy baseline, with zero variance across 40 runs at four temperatures. A Claude Haiku 4.5 controller converges to suboptimal game values but stays catalog-bounded over an additional 40 runs, demonstrating that architectural stability is not dependent on the controller capability. The LLM agent's non-determinism furthers creative exploration of strategies, while the tool-mediated architecture ensures system stability.

  • 8 authors
·
May 3

MemEvolve: Meta-Evolution of Agent Memory Systems

Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to 17.06%; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

  • 8 authors
·
Dec 21, 2025 2

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.

  • 6 authors
·
Feb 3 1

Interpreting Agentic Systems: Beyond Model Explanations to System-Level Accountability

Agentic systems have transformed how Large Language Models (LLMs) can be leveraged to create autonomous systems with goal-directed behaviors, consisting of multi-step planning and the ability to interact with different environments. These systems differ fundamentally from traditional machine learning models, both in architecture and deployment, introducing unique AI safety challenges, including goal misalignment, compounding decision errors, and coordination risks among interacting agents, that necessitate embedding interpretability and explainability by design to ensure traceability and accountability across their autonomous behaviors. Current interpretability techniques, developed primarily for static models, show limitations when applied to agentic systems. The temporal dynamics, compounding decisions, and context-dependent behaviors of agentic systems demand new analytical approaches. This paper assesses the suitability and limitations of existing interpretability methods in the context of agentic systems, identifying gaps in their capacity to provide meaningful insight into agent decision-making. We propose future directions for developing interpretability techniques specifically designed for agentic systems, pinpointing where interpretability is required to embed oversight mechanisms across the agent lifecycle from goal formation, through environmental interaction, to outcome evaluation. These advances are essential to ensure the safe and accountable deployment of agentic AI systems.

  • 6 authors
·
Jan 23

Question answering systems for health professionals at the point of care -- a systematic review

Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology and forward and backward citations on 7th February 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. Results: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. Discussion: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.

  • 9 authors
·
Jan 24, 2024

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.57times training speedup compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

  • 13 authors
·
May 30, 2025 2

Comparing Software Developers with ChatGPT: An Empirical Investigation

The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.

  • 3 authors
·
May 19, 2023

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.

netflix Netflix
·
Oct 22, 2025 2

Leslie Population Models in Predator-prey and Competitive populations: theory and applications by machine learning

We introduce a new predator-prey model by replacing the growth and predation constant by a square matrix, and the population density as a population vector. The classical Lotka-Volterra model describes a population that either modulates or converges. Stability analysis of such models have been extensively studied by the works of Merdan (https://doi.org/10.1016/j.chaos.2007.06.062). The new model adds complexity by introducing an age group structure where the population of each age group evolves as prescribed by the Leslie matrix. The added complexity changes the behavior of the model such that the population either displays roughly an exponential growth or decay. We first provide an exact equation that describes a time evolution and use analytic techniques to obtain an approximate growth factor. We also discuss the variants of the Leslie model, i.e., the complex value predator-prey model and the competitive model. We then prove the Last Species Standing theorem that determines the dominant population in the large time limit. The recursive structure of the model denies the application of simple regression. We discuss a machine learning scheme that allows an admissible fit for the population evolution of Paramecium Aurelia and Paramecium Caudatum. Another potential avenue to simplify the computation is to use the machinery of quantum operators. We demonstrate the potential of this approach by computing the Hamiltonian of a simple Leslie system.

  • 5 authors
·
Dec 20, 2024

"I May Not Have Articulated Myself Clearly": Diagnosing Dynamic Instability in LLM Reasoning at Inference Time

Reasoning failures in large language models (LLMs) are typically measured only at the end of a generation, yet many failures manifest as a process-level breakdown: the model "loses the thread" mid-reasoning. We study whether such breakdowns are detectable from inference-time observables available in standard APIs (token log probabilities), without any training or fine-tuning. We define a simple instability signal that combines consecutive-step distributional shift (JSD) and uncertainty (entropy), summarize each trace by its peak instability strength, and show that this signal reliably predicts failure. Across GSM8K and HotpotQA, instability strength predicts wrong answers with above-chance AUC and yields monotonic bucket-level accuracy decline at scale across model sizes. Crucially, we show that instability is not uniformly harmful: early instability can reflect subsequent stabilization and a correct final answer (corrective instability), whereas late instability is more often followed by failure (destructive instability), even at comparable peak magnitudes, indicating that recoverability depends not only on how strongly the distribution changes but also on when such changes occur relative to the remaining decoding horizon. The method is model-agnostic, training-free, and reproducible, and is presented as a diagnostic lens rather than a corrective or control mechanism.

  • 4 authors
·
Feb 2 3

Avoiding tipping points in fisheries management through Gaussian Process Dynamic Programming

Model uncertainty and limited data are fundamental challenges to robust management of human intervention in a natural system. These challenges are acutely highlighted by concerns that many ecological systems may contain tipping points, such as Allee population sizes. Before a collapse, we do not know where the tipping points lie, if they exist at all. Hence, we know neither a complete model of the system dynamics nor do we have access to data in some large region of state-space where such a tipping point might exist. We illustrate how a Bayesian Non-Parametric (BNP) approach using a Gaussian Process (GP) prior provides a flexible representation of this inherent uncertainty. We embed GPs in a Stochastic Dynamic Programming (SDP) framework in order to make robust management predictions with both model uncertainty and limited data. We use simulations to evaluate this approach as compared with the standard approach of using model selection to choose from a set of candidate models. We find that model selection erroneously favors models without tipping points -- leading to harvest policies that guarantee extinction. The GPDP performs nearly as well as the true model and significantly outperforms standard approaches. We illustrate this using examples of simulated single-species dynamics, where the standard model selection approach should be most effective, and find that it still fails to account for uncertainty appropriately and leads to population crashes, while management based on the GPDP does not, since it does not underestimate the uncertainty outside of the observed data.

  • 3 authors
·
Dec 27, 2014