At the core of IREX.ai's product is a highly optimized module for real-time video stream inference using computer vision algorithms – from image recognition to object detection and classification. A key differentiator of the platform is its ability to efficiently run neural network inference on CPUs, particularly Intel Xeon Skylake and Cascade Lake, without requiring GPUs.
Moving Away from GPUs: The Path to Efficient Infrastructure
Traditionally, the AI industry, especially in computer vision, relies on graphical processing units (GPUs) for both training and inference. However, deploying models in production with GPUs can be expensive and technically challenging. IREX.ai took a strategic bet on CPU-based inference, which has proven to be a winning approach.
A critical step in this direction was the adoption of Post-Training Quantization – converting weights and activations from the float32 format to int8 without the need for model retraining. Even in the early stages of quantization, this approach reduced model sizes by a factor of four and eliminated the need for GPU dependency. This simplified the infrastructure, enabled CPU-based operations, and significantly cut costs.
Optimization Results
Following quantization, CPU loads decreased by 10-12%, resulting in annual savings of tens of thousands of dollars on cloud platforms like AWS, Azure, and GCP. Additionally, stable inference with high video stream throughput enabled seamless scaling without compromising performance or quality.
As a result, the IREX.ai platform has successfully expanded into new markets, including the US, UK, Egypt, UAE, and others. This growth has been driven by a combination of technical optimization and a flexible, scalable architecture.
Technical Perspective: How the Optimizations Work
Neural networks typically contain hundreds of millions of parameters in float32 format, requiring gigabytes of memory and substantial computational power. IREX.ai uses models like YOLO, VGG, and ResNet50, which were not originally designed for CPU inference.
Quantization, particularly Post-Training Quantization, reduces the precision of these weights from float32 to int8, cutting model sizes by fourfold and speeding up processing. This approach is particularly valuable in real-time video stream scenarios where speed, stability, and efficiency are crucial.
Scaling Through Engineering
IREX.ai's experience demonstrates that scaling doesn't always require additional computational power. Sometimes, a different perspective on the problem, leveraging optimizations like quantization, pruning, or clustering, can be just as effective.
Thanks to its optimized architecture and GPU independence, the platform is now deployed in high-impact projects – from detecting weapons in public spaces to locating missing persons. In collaboration with international partners, IREX.ai is developing solutions that make cities smarter and society safer.
Looking Ahead
The team continues to advance its infrastructure, exploring edge inference, AutoML, and other cutting-edge technologies. However, the shift to CPU inference and the adoption of quantization remains the foundation for sustainable growth and international expansion.