Workplace injuries remain one of the most consistent causes of serious harm and lost time on construction sites and in factories. Regulators across the EU and beyond mandate hard hats in these environments, yet enforcement still depends on supervisors walking the floor and noticing infractions in real time.
On the other hand, all modern industrial facilities are equipped with CCTV cameras. Dozens of them, sometimes hundreds, streaming around the clock to a control room or a forgotten DVR humming in a closet. They record diligently. Mostly no one watches them in real time.
The reality is that CCTV in industrial settings is a forensic tool, not a preventive one. After an accident, someone scrolls back through hours of footage to figure out what happened. Before the accident, the footage exists in a kind of permanent fog: it sees everything and notices nothing.
That is why we built Safety Guard AI: a computer-vision system that analyzes industrial CCTV streams in real time, detects when a worker enters the frame without a hard hat, alerts the right people the moment it happens, and turns months of previously passive recordings into a queryable safety dataset.
Here is how we built it, our tech stack, and the lessons we learned.
The Problem
A typical mid-sized plant has dozens of cameras feeding monitor walls that no single human can meaningfully observe. Even where a guard is assigned to the wall, fatigue and split attention mean the most relevant frames almost always pass unnoticed. The footage is recorded, but nothing is acted on.
Our Goal was to build a system that has the following features:
- Real-time detection: Flag missing hard hats immediately upon a worker appearing in the frame.
- Multi-stream throughput: Process multiple concurrent camera feeds.
- Low end-to-end latency: Minimize glass-to-glass delay, so the operator's dashboard reflects what is happening on the floor right now, not thirty seconds ago.
- High detection accuracy: Achieve target mAP on the helmet-detection task with usable performance at HD resolution.
- Auditable evidence: Persist every detected violation with timestamp, camera identifier, duration, and a frame still that holds up as evidence.
- Historical audit capability: Run the same pipeline against archived footage on demand.
- Cost-efficient GPU usage: Keep cloud GPU spend predictable and elastic.
Tech Stack
We dedicated the first phase of the project to research and stack selection. Here is what we chose and why:
Cloud Provider: Azure Cloud
Our development environment runs on Azure DevOps, so integrating with Azure Cloud was the path of least friction for storage, container registry, secret management, and static frontend hosting.
ML Environment: Azure ML Workspace
We use Azure ML for managing trained models and their evaluation results. Its deep integration with the rest of our Azure footprint means versioning, lineage, and artifact handoff to deployment all live in one place rather than being stitched together by hand.
App Server: RunPod
The non-obvious choice in our stack. RunPod offers cost-effective on-demand access to NVIDIA GPUs in the specific class we needed. Since a model compiled with TensorRT is optimized for a specific GPU architecture and must be deployed on matching hardware to maintain performance, the entire backend runs on RunPod rather than on a more general cloud GPU service.
Gateway: Cloudflare
Free at our traffic volume, which is not the main reason we picked it. The decisive factor was that RunPod assigns dynamic service URLs, and we needed a stable custom domain in front of them. Cloudflare gave us that mapping, plus traffic management and DDoS protection, essentially as a side effect.
Backend: Python + FastAPI + SQLAlchemy
We aimed for a single language across the project, and Python's dominance in model training and the computer-vision ecosystem made it the natural choice. FastAPI is the trending choice in modern Python services. SQLAlchemy is the most popular ORM in the Python ecosystem with the largest community, and there are no viable alternatives at this point if you want production-grade tooling and broad support.
Database: SQLite + Turso
A free, developer-friendly SQLite database server that gave us rapid integration and database branching.
Video processing: pyAV
Ease of use, native Python integration, and the fact that it wraps FFmpeg - the industry standard for video decoding and encoding. For a pipeline that deals with multiple concurrent streams in different codecs and frame rates, pyAV gave us reliable primitives without forcing us to drop down to FFmpeg's CLI for every operation.
Detection model: RF-DETR
Cost-effective, high-performing, and fast. It is supported by an active community with a low barrier to entry, which mattered for our fine-tuning iterations.
Tracking & visualization: Supervision
Part of the RF-DETR ecosystem and open-source. We use Supervision for object tracking across frames and for rendering bounding boxes onto outgoing video.
Inference runtime: TensorRT
The industry standard for running models on NVIDIA GPUs, offering peak performance with no viable alternatives. TensorRT compilation is what makes it feasible to run the model on multiple concurrent streams on a single instance.
Frontend: TypeScript / React + Chakra UI
React is the industry standard for modern web applications, with the largest ecosystem and community in frontend development. We paired it with Chakra UI for rapid, smooth implementation of the operator dashboard.
The Architecture
Safety Guard AI is, at its core, a real-time video analytics pipeline with a persistence layer and a dashboard. The architecture is organized around two flows: streams come in, get analyzed, and go back out annotated; violations are extracted, persisted, and surfaced to dashboard.
The Processing Pipeline:
- Stream ingest: Our service connects to industrial camera and pulls its feed.
- Decoding: pyAV decodes each stream into a frame queue, handling codec variation and frame-rate normalization per camera.
- Frame extraction: Frames are pulled from the decode queue at the rate the inference stage can sustain, so we drop intelligently rather than queueing into infinity under load.
- Inference: RF-DETR, compiled to TensorRT and pinned to the target GPU architecture, runs detection on sampled video frames. The model is fine-tuned for hard-hat detection.
- Tracking: Supervision tracks detected objects across frames, which lets us reason about violation duration rather than alerting on every individual frame in which a worker is visible without a helmet.
- Annotation: Bounding boxes and labels are rendered onto the frames using Supervision's drawing utilities.
- HLS streaming: Annotated frames are re-encoded and delivered back to the operator's browser as an HLS stream, which gives us a delivery protocol that works through standard web infrastructure without requiring custom client software.
- Persistence: When a tracked, helmet-less worker remains in frame past a configured duration threshold, we persist a violation record - frame still, timestamp, duration, camera identifier - to Turso, where it becomes queryable from the dashboard, the historical-audit views, and the PDF exports.
The Dashboard:
The user opens the dashboard and sees their cameras live, with bounding boxes drawn around workers and helmets in real time. The moment a worker enters the frame without a hard hat and the violation is confirmed past the duration threshold, the violation is logged automatically.
From the dashboard, the operator can drill into the violation list - every detected event in one place, filterable by camera and by time range. Each entry shows date and time, duration, the camera that recorded it, and an evidence frame. Clicking through opens the single-violation detail view: the same frame at full size, the violation type, the precise timestamp, the duration, and the camera name.
Two further capabilities round out the workflow. The first is a historical audit: the same detection pipeline can be run against archived footage on demand. The second is reporting. The operator can view aggregate statistics and export the entire view as a PDF.
Technical Challenge
The most significant engineering hurdle we faced was achieving real-time, multi-stream video processing within a Python-based architecture on cloud GPUs.
The Challenge
Python is excellent for ML, but it presents challenges when you need to sustain high-throughput, parallel video pipelines. We needed to process multiple concurrent HD streams on a single GPU without causing significant "glass-to-glass" latency. This was further complicated by hardware coupling: a model optimized via TensorRT for a specific NVIDIA card must be deployed on identical hardware to maintain performance.
The Solution
Our solution was a four-pronged approach to infrastructure and optimization. First, we leveraged TensorRT-optimized inference to ensure we were squeezing every bit of performance out of the GPU. Second, we built custom Docker images specifically tuned to our target RunPod instances, guaranteeing hardware parity between our training and production environments. Third, we implemented pyAV-based parallel pipelines, allowing us to handle frame decoding and encoding in separate processes to prevent the Python interpreter from choking the video flow. Finally, we placed Cloudflare in front of our dynamic RunPod endpoints. This allowed us to maintain a stable custom domain and manage traffic effectively, shielding the backend from the instability of dynamic URLs and providing an extra layer of security.
Project Timeline
Safety Guard AI was delivered over five months, totaling roughly 2,000 working hours across a five-person team - one architect, three developers, and one DevOps engineer.
Month 1 - Requirements Analysis & Architecture Design
We took the project's business concept and translated it into technical requirements: throughput targets, latency budgets, accuracy thresholds, and evidence-retention rules. In parallel, we ran model selection, evaluated cloud GPU providers, and prototyped the runtime environment. By the end of the month, we had committed to RF-DETR, TensorRT, pyAV, RunPod, and the broader stack - and we had a written architecture document that the rest of the team built against.
Months 2 and 3 - Implementation
The core build phase. We brought up the ingestion pipeline, the inference runtime, the persistence layer, the operator dashboard, the historical-audit mode, the violation list and detail views, and the PDF export.
Month 4 - Optimization
Performance testing and tuning. We profiled the per-stage cost in the pipeline, pushed throughput up, pulled latency down, and stress-tested the system under multi-stream load to find where it breaks.
Month 5 - Deployment
Production release on RunPod fronted by Cloudflare. Infrastructure hardening, observability, monitoring, and alerting on the system itself.
Conclusions
We make it a standard practice to wrap up our work with a deep-dive retrospective. Safety Guard AI gave us plenty to examine.
Key Challenges Faced
- Real-time video processing at scale: Multi-stream, low-latency video in Python is not a solved problem, and most of our hardest work went into making it stable rather than into the model itself.
- Runtime environment configuration: Aligning TensorRT, the CUDA toolchain, our base Docker images, and the specific RunPod GPU instance class consumed more time than any one of us would have predicted.
- Software architecture inside Python: Building the system in Python required us to make key design decisions, including choosing the web framework, structuring dependency injection, designing the persistence layer, and separating concerns between the streaming and application layers.
- Streaming protocol selection and cloud GPU integration: Picking HLS, mapping it through Cloudflare to dynamic RunPod URLs, and getting consistent behavior through the whole chain took several iterations.
- Local development hardware: MacOS incompatibility with the CUDA toolchain excluded part of the team from heavy local iteration.
- Parallel programming in a video pipeline: Concurrency, backpressure, frame drops, codec edge cases, and the interaction between the tracker and the persistence layer collectively produced a long tail of subtle bugs that only show up under sustained load.
- Model selection trade-offs: Project requirements constrained the model space more than we initially acknowledged. We had to balance accuracy, inference cost, and hardware affinity.
What We Would Do Differently
- Reconsider Python for the streaming layer specifically: Python carried the model and the pipeline well, but the web-streaming layer in front of HLS was the part of the system where it gave us the most pain. We are not abandoning Python, but we would consider using a different technology stack for certain components.
- Move from Scrum to Kanban for R&D-heavy work: We ran one-week Scrum sprints and watched task carry-over rates climb whenever the work touched computer-vision subject. Innovative tasks resist tight estimation, and forcing them into sprint cadences produced friction without producing better outcomes. Kanban would have served the project better.
- Develop a better estimation approach for innovative tasks: Related to the above. We need a way to budget exploratory work that does not pretend it is the same shape as feature delivery.
- Set up standardized QA for system performance earlier: We did performance testing well, but late. A standardized verification process - run continuously rather than as a separate phase - would have surfaced issues sooner and prevented at least one architectural detour.
- Be honest about what we did not hit: The project was completed significantly later than originally planned, and frame-processing speed has not yet hit the benchmarks we targeted at kickoff.
What Went Well
- F-DETR was the right model choice: The balance of cost, speed, and accuracy held up across our evaluation, and the active community made fine-tuning iterations smoother than the alternatives we considered.
- Chakra UI delivered the dashboard smoothly: Of every component in the system, the frontend was the most predictable to build.
- RunPod was cost-effective for this workload profile: On-demand GPU access at the class we needed kept the bill predictable, and we avoided the overheads of more general-purpose cloud GPU services we evaluated.
- Cloudflare elegantly solved the dynamic-URL problem: Putting Cloudflare in front of RunPod gave us a stable custom domain, traffic management, and DDoS protection essentially for free at our traffic level - a rare case where free-tier capability lined up exactly with what production needed.
- The R&D bets paid off: Custom fine-tuning of the detection model and HLS as the delivery protocol were both early decisions that we revisited under stress and ultimately kept.
Summary
Safety Guard AI demonstrates that specialized, edge-tuned computer vision can transform an existing CCTV installation into an active safety system without requiring new cameras, new control rooms, or new headcount.
Technical Execution
- 5-Person Team - 1 architect, 3 developers, and 1 DevOps engineer, delivering the full project lifecycle from research through production.
- 5 Months total time from requirements analysis to a production-ready MVP.
- 2,000 Working Hours invested in research, iterative development, optimization, and deployment.
The Results
- Throughput (FPS): The system achieves a processing rate of 30 frames per second per stream, ensuring fluid motion analysis.
- Camera Density: Optimized hardware utilization allows for 5 concurrent camera feeds to be processed on a single GPU instance.
- Single-image Inference Latency: Total delay has been minimized to 6 ms.
- Model Accuracy: The fine-tuned RF-DETR model achieved 91.3% mAP (Mean Average Precision), meeting the target detection reliability.
By replacing passive recording with active detection, evidence collection, and reporting, Safety Guard AI is a 24/7 digital supervisor that never blinks. We designed and built the whole system end-to-end - from the model and the runtime to the dashboard and the deployment topology - and we are continuing to evolve it.
