Milestone Launches Vision Language Model (VLM)

Singapore, December 23, 2025 — Milestone Systems, a world leader in data-driven video technology, today released an advanced vision language model (VLM) specialising in traffic understanding and powered by NVIDIA Cosmos Reason. The VLM powers two new products: a Video Summarisation tool for XProtect® Video Management Software and a VLM as a Service for third party integrations.

Video Summarisation for XProtect allows users to search summaries from visual data and automates reporting

Large volumes of data are captured by modern video systems, and reviewing footage is still mostly done by hand and takes a lot of time. Users and operators can now rely on a specialised product that automates operator operations, saves critical time, and greatly decreases false alarm fatigue thanks to Milestone Systems’ new Video Summarisation tool, a generative AI-powered plug-in for the XProtect Smart Client.Video summarisation may minimise operator false alarm fatigue by up to 30%, according to early data.

The Video Summarisation tool describes what’s happening after analysing camera video. The model will provide a text summary in a matter of seconds after users send a brief video clip and a prompt outlining their request.

Key capabilities:

  • Convert video segments into structured text summaries inside XProtect Smart Client​
  • Search summaries based on video content, rather than timestamps or manual tagging​
  • Bookmark and filter summaries to streamline review workflows​
  • Integrate seamlessly with existing XProtect event and rule logic to trigger automated summaries based on specific alarms or alerts
  • Focus attention on valid events​ by filtering out irrelevant motion or noise
  • Access customized, sovereign VLM’s per region, starting with the US and EU. More regions to follow.

The Video Summarisation can be downloaded for free and installed immediately in the XProtect Smart Client in a matter of minutes. Additionally, users only pay when the VLM is prompted.

VLM as a Service for developers: Add production-ready video intelligence to any application

With Milestone’s Hafnia VLM as a Service (VLMaaS), developers, integrators and partners get API access to production-ready video intelligence built on NVIDIA’s latest technology and fine-tuned on responsibly sourced data.

The VLMaaS improves any existing solutions with generative AI, independent of the level of analytics currently in place, and enables developers to quickly create AI-powered solutions without having to set up, optimise, or manage their own AI systems. Whether testing a minimum viable product (MVP) or growing a platform, this makes it quick and easy to include sophisticated video intelligence capabilities into apps.

AI and analytics development can be greatly accelerated with VLMaaS, requiring up to 70 times less labour than fine-tuning a VLM model to achieve the same results.

Key capabilities:

  • Access high accuracy vision language model, fine-tune on traffic optimized data and built on NVIDIA Cosmos Reason
  • Follow prompt-based instructions for traffic-related operations
  • API-first delivery – simple integration via HTTPS​
  • Fine-tuned models for US and EU markets, with more regions to follow​
  • Designed to build standalone solutions or integrate with the Milestone product portfolio
  • 100% responsibly sourced training data with auditable data lineage, GDPR- and EU AI Act-compliant, used for the fine-tuning of the model

Pricing for the VLMaaS is pay-per-use (based on API calls) – no large upfront investments or custom training costs. Sign up for early access at https://hafnia.milestonesys.com/.

Andrew Burnett, Acting Chief Technology Officer, Milestone Systems, said, “With the Vision Language Model as a Service and Video Summarization for XProtect, we’re tackling some of the most challenging bottlenecks: video overload and time-consuming manual work. Operators get immediate insight directly within XProtect; builders get API‑first access to production‑ready intelligence without bespoke training or heavy infrastructure. Because this model is specialized for real-world traffic video and fine-tuned on responsibly sourced data, customers can trust the results, deploy with confidence, and enhance all existing solutions in place. It’s the fastest, most advanced and impactful path to turning video into actionable outcomes.”

XProtect customers like the cities of Genoa, Italy, and Dubuque, Iowa, US, are excited to use these new capabilities, leading the way in adopting advanced video intelligence solutions to enhance traffic management.

Built on responsible AI, Powered by Real-World Data

The two new offerings are powered by Milestone’s Hafnia VLM, which has been fine-tuned on 75,000 hours of responsibly sourced, real-world video data from either Europe or the US, using NVIDIA Cosmos Curator for data preparation and running either on cloud infrastructure or regional data centers. Leveraging NVIDIA Cosmos Reason VLM and Milestone’s data for fine-tuning makes it one of the most advanced video AI platforms in the industry.