We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this article, we propose an alternative smoothed-Bayesian approach, inspired by existing BNP methods but designed to overcome their computational limitations when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by precise theoretical guarantees, including unbiasedness and optimality under a Bayesian framework within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Beraha, M., Favaro, S., Sesia, M. (2025). A smoothed-Bayesian approach to frequency recovery from sketched data. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1-21 [10.1080/01621459.2025.2490302].

A smoothed-Bayesian approach to frequency recovery from sketched data

Beraha, Mario;
2025

Abstract

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this article, we propose an alternative smoothed-Bayesian approach, inspired by existing BNP methods but designed to overcome their computational limitations when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by precise theoretical guarantees, including unbiasedness and optimality under a Bayesian framework within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Articolo in rivista - Articolo scientifico
Nonparametric estimation; Normalized random measures; Random hashing; Smoothed estimation; Worst-case analysis;
English
24-giu-2025
2025
1
21
embargoed_20260421
Beraha, M., Favaro, S., Sesia, M. (2025). A smoothed-Bayesian approach to frequency recovery from sketched data. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1-21 [10.1080/01621459.2025.2490302].
File in questo prodotto:
File Dimensione Formato  
Beraha-2025-J Am Stat Ass-AAM.pdf

embargo fino al 21/04/2026

Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print)
Licenza: Creative Commons
Dimensione 1.5 MB
Formato Adobe PDF
1.5 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/551722
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
Social impact