When Less is More: Surprising Gains from Label-Aware Quantization

Introduction

What is Label-Aware Quantization (LAQ)?

Post-training quantization (PTQ) techniques typically use data from the same distribution as the training data to achieve high performance under memory constraints. LAQ, on the other hand, leverages data from different distributions.

Why LAQ?

Many machine learning tasks involve only subsets of much larger datasets.
LAQ can reduce the model size in memory.
LAQ might perform better than the original model on subsets.

Greedy Path-Following Quantization (GPFQ):

Computationally efficient quantization method for pre-trained models (MLPs and CNNs).
Quantizes each neuron using a greedy path-following algorithm, eliminating the need for complex retraining.

Research Question: Does LAQ boost CNN performance on subset classification tasks?

Dataset

CIFAR-100

Number of Classes: 100
Image Size: 32 x 32 x 3
Train Set: 50,000 images (500 per class)
Test Set: 10,000 images (100 per class)

Dataset Visualization — Figure 1: Visualizing Class Centers

Methods

Subset Generation (From Section 3):

Feature Extraction (FE): Flattened output of convolutional layers from a pre-trained ResNet-50 (2048 dimensions).
Dimensionality Reduction (DR): UMAP preserves cluster structure & location (2 dimensions).
Inter-Class Distance (ICD): KL divergence (Gaussian closed-form) for every unique pair of classes.
Subset Selection: Selecting 10 classes greedily based on inter-class similarity for spread along the x-axis:
- Similar Classes: Low Median Distance
- Dissimilar Classes: High Median Distance
- Random Classes: Intermediate Median Distance

Data Processing — Figure 2: Data Preprocessing & Generation

Model Variations (From Section 4):

Original: Uses pre-trained weights.
Quant: Quantized using the training split of the subset.
Fine-Tuned (FT): Fine-tuned using the training split of the subset.
FT + Quant: Fine-tuned first, then quantized.

Results

Conclusion

Our study demonstrates that label-aware quantization effectiveness is strongly influenced by inter-class similarity, with higher accuracy maintained when quantizing for dissimilar class subsets across all tested architectures. While ResNet50, VGG16, and GoogleNet demonstrated strong resilience to 4-bit precision reduction, MobileNetV2 exhibited significantly greater sensitivity. This shows how fundamental architecture design significantly influences quantization effectiveness. We observed that fine-tuning provides greater benefits for similar-class tasks, whereas direct label-aware quantization performs best with distinct classes. These findings offer practical solutions for resource-constrained environments: developers can apply label-aware quantization directly for tasks involving distinct objects, while fine-tuning before quantization would be better suited for similar-class tasks. This approach enables significant model compression while maintaining high accuracy for specific subtasks, emphasizing the viability of neural network quantization in resource-constrained edge devices.