CoffeeA: The Data Was the Bug, Rebuilding Coffee Leaf Disease Classification from the Ground Up
Từ khóa:
coffee leaf diseases, DINOv2, K-fold ensemble, data deduplication, bootstrap confidence intervals, probability calibration, Vision Transformer, precision agricultureTóm tắt
Reported accuracies for coffee leaf disease classification-Citations on public benchmarks are frequently overstated because The underlying datasets retain a considerable number of near- duplicate samples that are not filtered before partitioning into training and test subsets. Motivated by this gap, the present study addresses a single guiding research question: once stringent Data quality control has been enforced, which includes elements within A classification pipeline is responsible for genuine, statistically significant defensible accuracy gains? Our answer is CoffeeA, a framework that contribute the following four reproducible elements. (1) Five publicly available coffee corpora (totalling 58,383 raw images) are unified and processed through a two-stage deduplication scheme—first an exact MD5 byte hash, then a perceptual dHash at 17×16 resolution (256-bit)—retaining 6,294 genuinely unique images and discarding 89.2% of redundant content; one striking observation is that JMuBEN preserves only 2.5% of its original images as truly unique. (2) A K-fold (k = 5) ensemble of DINOv2 ViT-L/14 backbones are built using selective fine-tuning restricted to the last three Transformer blocks, on-the-fly switching between Mixup and CutMix, EMA weight averaging, and label smoothing. (3) A 12-pass multi-scale Test-Time Augmentation procedure is applied to the K-fold ensemble and then paired with single-parameter temperature scaling (T ∗ = 0.55); together they cut Expected Calibration Error by 93.3% (from 0.0833 down to 0.0056) without affecting top-1 accuracy. (4) Every reported
number is accompanied by 95% bootstrap confidence intervals together with two-tailed exact McNemar tests across seven
baseline configurations. The proposed K-fold ensemble (without TTA) reaches 97.88% accuracy and 97.85% macro-F1 (95% CI:
[96.84%, 98.83%], a statistically significant improvement relative to the frozen+Ensemble ELM reference (McNemar p =
0.0052). A finding that runs against common expectations is that, On deduplicated data, K-fold fine-tuning is the only component
contributing a meaningful gain (+1.69 pp accuracy / +1.37 pp F1), whereas handcrafted descriptors and multi-scale TTA are
essentially neutral for classification accuracy.