See how Qualcomm AI Research continues to innovate from edge to cloud
What you should know:
- At Qualcomm AI Research, we are advancing AI to make its core capabilities — perception, reasoning and action — ubiquitous across devices.
- During the annual NeurIPS conference, we showcase our leadership in cutting-edge language, vision, multimodal reasoning and foundational machine learning (ML) research.
- Through hands-on demos, workshops and talks, we feature our innovative AI applications and technologies, efficient generative AI models, low-power computer vision and power-efficient neural networks.
Neural Information Processing Systems (NeurIPS), the premier machine learning conference, returns this year with an impressive 25% acceptance rate. We’re thrilled to engage with attendees directly, showcasing our research and demos.
2025 is notable for the remarkable advancements in generative artificial intelligence (GenAI), and Qualcomm Technologies is at the forefront of bringing these capabilities to edge devices as well as the cloud. Our Qualcomm AI Research team is dedicated to pushing the boundaries of AI/ML and translating these innovations into real-world applications.
We’re proud to have 17 papers (10 main conference and 7 workshop papers) and nine technology demos accepted at the conference, with our demos making up 45% of the total EXPO demonstrations. We can’t wait to connect with you in sunny San Diego, the city where we are headquartered, starting December 2. Be sure to visit us at booth #1503, explore our EXPO demos, workshops and talk panel, as well as join our accepted poster sessions to learn more about our work.
Get what’s next in AI and computing
Advancing AI with breakthrough ideas
At prestigious academic conferences like NeurIPS, groundbreaking papers serve as a key channel for sharing innovative and influential AI research with the broader community. I want to spotlight some of our accepted papers and key themes that are pushing the boundaries of machine learning.
This year at NeurIPS, our team will present a diverse set of topics in language, multimodal reasoning, image/video generation and machine learning foundations. Together, they reflect a common goal: making AI systems more efficient, trustworthy and capable of tackling complex real‑world challenges. From faster large language models to robust multimodal search, from benchmarks for generative vision to theoretical advances in recurrent neural networks (RNNs), our work spans the spectrum of modern AI research.
Next‑gen LLMs
Large language models (LLMs) are evolving quickly and becoming ubiquitous, so we are constantly striving to make them faster, more efficient, more reliable and deployable across many devices. OmniDraft: Cross Vocabulary Online Adaptive Drafter for On-device Speculative Decoding reimagines speculative decoding by introducing a universal drafter that can adapt to any target model on the fly. This innovation enables faster responses, lower costs and greater flexibility for on-device AI, with experiments showing speed improvements of up to two times.
Complementing efficiency gains, KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments addresses the challenge of long conversations overwhelming memory. By retaining only the most unique pieces of context instead of relying on heavy attention scores, KeyDiff allows models to handle longer prompts, run faster and use less memory while barely sacrificing accuracy.
Diffusion for LLMs promises even faster generation speeds, and our team is exploring in context learning for this paradigm. Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models offers new insights into how training objectives influence inductive biases. Our findings show that diffusion models trained with a non-autoregressive loss perform comparably to autoregressive models, while enabling faster inference. However, they still display recency bias and a preference for left-to-right processing from initial to later tokens, though less strongly, suggesting that diffusion models inherit some inductive biases despite their distinct training approach.
To strengthen reasoning, Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG introduces a structured approach to multihop reasoning by reusing common reasoning patterns and adding a smart “stopper” that knows when enough information has been gathered, making complex answers faster, cheaper and more reliable.
Finally, Analyzing and Improving Chain of Thought Monitorability investigates why monitors that check reasoning traces for bias or harmfulness often fail. We identify information gaps and elicitation errors as key issues and propose new training strategies, one that rewards clearer reasoning traces and another that uses information theory to align outputs with reasoning, both of which significantly improve monitor accuracy and resilience.
Together, these AI research papers for next-generation LLM showcase techniques to advance performance, efficiency, reasoning and trustworthiness.
Multimodal AI
AI systems increasingly need to understand and connect across text, images and video, and our work advances this frontier. Generalized Contrastive Learning (GCL): Better Search Across Text and Images introduces a training method that allows models to compare across multiple modalities at once, enabling more universal retrieval without specialized datasets. By teaching models to align text and images simultaneously, GCL gives AI a more general “search sense” that works across diverse content.
Building on this, Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? explores interactive assistance through the Qualcomm Interactive Cooking dataset, which contains videos of people making and correcting mistakes while following recipes. Using this dataset, we tested current multimodal models and introduced a system designed to provide live, streaming guidance, an early step toward AI that can coach users in real time.
To improve perception and trustworthiness, Attention Guided Alignment in Vision Language Models addresses hallucinations in vision-language systems by showing that poor alignment between image regions and text is the root cause. Our framework guides attention to the correct image regions using advanced segmentation tools, resulting in more accurate and trustworthy descriptions.
Extending multimodal methods into autonomous driving, Distilling Multi-modal Large Language Models for Autonomous Driving presents a framework that distills the knowledge of a multimodal LLM into a lighter vision-based planner. This system maintains efficiency while benefiting from broader reasoning power, cutting trajectory errors by 44% in rare longtail scenarios and achieving state-of-the-art performance on the popular benchmarks.
Finally, Leveraging Probabilistic Modeling for Robust E2E Autonomous Driving across Domains introduces a framework that formulates the joint probabilistic distribution over tokens encoding ego and surrounding vehicle information. Instantiated with a Gaussian process, it learns basis tokens with corresponding trajectories that span diverse driving scenarios, enabling robust adaptation to new domains and significantly outperforming direct fine‑tuning without extra inference cost.
Collectively, these AI research papers demonstrate how multimodal AI can improve search, guidance, perception and autonomous decision-making.
Image and video generation
Generative vision and perception remain some of the hardest challenges in AI, and our papers tackle them head on. MultiHuman Testbench: Raising the Bar for Multi Person Image Generation provides the first dedicated benchmark for evaluating how well models can create realistic images of multiple people, each with distinct faces, poses and actions. With thousands of diverse faces and prompts, plus metrics to measure identity, action and alignment, this benchmark sets a new standard for advancing multi-human image generation.
To advance generative methods, Improved Training for Shortcut Models introduces a unified training framework that resolves key challenges of shortcut models, such as frequency bias and guidance inconsistency. By conditioning the network on both the current noise level and the desired step size, shortcut models can predict multiple timesteps in a single forward pass, significantly accelerating the generation process. With innovations like dynamic control, wavelet loss and twin EMA strategies, our framework makes shortcut models competitive again, delivering sharper and more reliable image generation across both one‑step and multi‑step sampling.
Extending generative vision into autonomous driving, ODG: Smarter 3D Scene Understanding for Self Driving Cars proposes a dual Gaussian approach that splits driving scenes into static and dynamic parts, capturing stationary structures such as buildings and moving objects like cars or pedestrians more effectively. The result is sharper, faster and more accurate 3D predictions, giving self-drivin systems a clearer view of the road ahead.
Together, these AI research papers advance the state of visual AI by setting new benchmarks, improving generative training and enabling smarter scene understanding.
Foundations of ML
Beyond applications, our work also strengthens foundational ML research, where we explore fundamental limitations or capabilities from a theoretical standpoint to understand what AI/ML can and can’t do. Non‑exchangeable Conformal Prediction with Optimal Transport addresses the problem of distribution shift, where test data differs from training data. By leveraging optimal transport, our paper shows how uncertainty can be estimated and corrected even without knowing exactly how the data changes, making predictions more reliable in complex, real-world settings.
Revisiting Bi‑Linear State Transitions in RNNs rethinks the role of recurrent neural networks, traditionally seen as “memory keepers.” We show that hidden units are active computational players, and by revisiting bilinear operations, where inputs and hidden states interact multiplicatively. We argue these are a natural fit for tasks requiring tracking evolving states. Our work even maps out a hierarchy of complexity, placing popular models like Mamba at the simpler end, offering a fresh perspective on how RNNs actually think.
Finally, Latency NMS Attacks: Is It Real Life or Is It Just Fantasy? examines longstanding concerns about latency attacks in computer vision systems. Using our EVADE evaluation framework, we show that these attacks barely affect performance under realistic conditions: slowdowns don’t transfer across models, remain within acceptable limits, and can be easily defended against. In short, while they look alarming on paper, NMS latency attacks aren’t much of a real‑world threat.
Together, these AI research papers advance the foundations of ML by making predictions more robust, offering new theoretical insights into RNNs and clarifying the true risks of adversarial attacks.
Our NeurIPS papers highlight a broad vision for the future of AI: systems that are faster and more efficient, capable of reasoning across modalities, grounded in trustworthy perception and built on solid theoretical foundations. By addressing challenges from long‑context memory to multimodal search, from generative vision benchmarks to distribution shift, our work contributes to building AI that is not only powerful but also practical and reliable.
EXPO talk and workshops at the frontier
We’re excited to share that we’ll be presenting a talk and workshops at the NeurIPS Expo this year. Our talk, Recent Developments in Embodied AI, explores how real-world interactions pose unique challenges for AI systems since they naturally require a deep understanding of the physical world and/or its inhabitants. The talk provides an in-depth discussion of embodied AI, with a focus on recent advances based on multimodal LLMs. It explains how end-to-end training has made it possible to instil key aspects of real-world common sense in a model, unlocking ambitious applications such as generalist robot control and real-world visual interaction, like chatbots that can see and hear.
Our workshop, Large-Scale Real-World Physical AI Systems, covers the latest research and best practices in industrial research of physical AI by leaders in the domain. It discusses emerging technologies like VLA based foundation models, AI data flywheel and cross-embodiment learning focused on physical AI.
Generative AI is evolving from offline, single modality models into interactive agentic systems that perceive, decide and act in the real world. Our other workshop, AI Assistants in the Wild: Agents, Adaptation, and Memory-Augmented Deployment, explores how we build generative agents that are not only efficient and responsive but also able to accumulate, recall and adapt based on personal memory over time. It aims to bring together perspectives from generative modeling, agentic learning, efficient model design and memory systems to close the gap between lab scale prototypes and real-world deployment.
Advanced technology demonstrations paint what’s coming next
It’s important for us to provide live, real-world demonstrations in an interactive environment to complement our cutting-edge research publications. We showcase our AI research, including examples in on-device generative AI, full-stack AI optimization, new AI applications and efficient AI inference. Here’s a quick description of our accepted EXPO demos, which can be categorized as platform optimizations, image generation and advanced reasoning.
Platform optimization demos
- Mobile video diffusion transformers: Diffusion transformers (DiTs) for text/image-to-video generation require huge memory and computation cost due to the quadratic attention over thousands of video tokens. We demonstrate the first DiT designed to run on low-power NPUs in mobile devices, such as phones and laptops. With full-stack AI optimization, our implementation generates 48 frames at 1024×640 resolution within 8 seconds on a phone powered by a Snapdragon 8 Elite Gen 5 processor with Qualcomm Hexagon NPU.
- Disaggregated LLM serving on AI accelerators: LLM inference typically involves two distinct stages: prefill and decode. The prefill stage is compute bound, while the decode stage is memory bound. This demo showcases disaggregated serving on Qualcomm Cloud AI 100 Ultra Card, a power-efficient AI inference accelerator, to deliver significant improvements in time to first token (TTFT) and overall throughput.
- Parallel generation with verification on device: Efficiently generating and verifying multiple responses from LLMs directly on device is a major challenge. Our solution addresses this by leveraging multi‑stream execution graphs and parallel LLM generation, unlocking the benefits of test‑time scaling within a unified framework for joint generation and verification. Running on a phone powered by Snapdragon 8 Elite Gen 5 , this approach reduces memory movement, minimizes latency and optimizes the selection of high‑quality responses. The result is more effective, safety‑enabled and personalized on‑device LLM inference—bringing advanced AI capabilities closer to everyday use.
- Efficient LiDAR processing with AI models leveraging heterogeneous compute: Efficient processing of AI models is challenging. This demo showcases heterogeneous compute execution of a LiDAR model running on a Snapdragon processor. The LiDAR processing, specifically 3D sparse convolution (SpConv3D) network, runs on the Qualcomm Adreno GPU, while the Region Proposal Network (RPN) executes on the Hexagon NPU. This division of tasks across specialized processors reduces on-device inference latency and maximizes overall efficiency.
Image generation demos
- Generating group photos of multiple people from text and reference images: Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation and benchmarking generative models. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. Our demo addresses these challenges by showcasing a state-of-the-art system to produce a high-quality image featuring all participants in context.
- SwiftEdit: Fast text-guided image editing via one-step diffusion on a mobile device: Existing text-guided image editing methods fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. Our demo shows our one-step diffusion image editing model, SwiftEdit, interactively editing a user’s source image based on text prompts.
Multimodal and reasoning demos
- Multimodal AI forensic search for video surveillance: Video surveillance often involves sifting through hours of footage across multiple cameras to find specific targets or events. To address this challenge, we introduce ForeSea, a novel AI forensic search framework designed to support rich multimodal queries, combining both text and images, and returning timestamped key events. Built on our new AI Forensic‑QA benchmark, ForeSea demonstrates significant gains, achieving an 8.6% accuracy improvement and a 6.9% intersection-over-union (IoU) boost over strong baselines.
- Soft prompts for on-device content moderation: LLMs can sometimes generate unsafe or toxic outputs when prompted harmfully. With our proposed TV‑DiSP framework, we showcase the first seamless on‑device integration of a safety‑aligned LLM using efficient soft prompt distillation. This design allows a mobile device to run a quantized LLM equipped with learned soft prompts to moderate harmful content in real time, with only a minimal increase in inference cost. The result is over 15% safety gains, proving that advanced safety alignment can be both practical and lightweight for on‑device AI.
- Reasoning through multimodal end-to-end decision transformer networks and vision language action (VLA) models: Complex automated driving use cases involving vulnerable road users and other actors on the road can be challenging to handle, especially for modular AI approaches. This demonstration showcases the live output and visualization capabilities of an edge-integrated end-to-end VLA model for path planning scenarios. By harnessing raw multimodal sensor inputs, including visual and voice data, the VLA model processes information in real time to generate safe, explainable and repeatable driving trajectories.
Additional demos and talks at our booth
Beyond our EXPO demos, we hosted a variety of additional demos across research topics in our booth, categorized as: efficient LLMs and reasoning, multimodal AI, visual content generation, AI for automotive, computer vision, AI for XR, AI for cloud and AI for developer with Qualcomm AI Hub.
For example, efficient LLMs and reasoning demos included: running a 3B-parameter model at over 200+ tokens per second on a phone, utilizing a dynamic budget to reduce reasoning tokens and parallel reasoning with verification. Our generative and multimodal AI demos include one‑step text‑to‑image diffusion model with negative prompts delivering high‑fidelity images 30 times faster, compact language‑vision model for edge deployment and personalized fitness assistance through smart glasses.
Our computer vision demos include open‑vocabulary object detection on phones, human analytics for next‑gen UI, high‑quality traffic scene simulation and 3D Gaussian splatting on devices powered by Snapdragon.
Finally, in our booth, we are presenting 11 spotlight talks by our AI researchers, ranging from advanced quantization to reasoning, embodied AI and personalization.
At Qualcomm Technologies, we pioneer groundbreaking research and extend its impact across various devices and industries, enabling our vision to drive intelligent computing everywhere. Qualcomm AI Research collaborates closely with the rest of the company to seamlessly integrate cutting-edge AI advancements into our products. This collaboration accelerates the transition from laboratory research to real-world applications, enriching our lives with innovative AI solutions.
https://www.qualcomm.com/news/onq/2025/12/qualcomm-ai-research-neurips-2025

