Bicocca Open Archive

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this article, we propose an alternative smoothed-Bayesian approach, inspired by existing BNP methods but designed to overcome their computational limitations when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by precise theoretical guarantees, including unbiasedness and optimality under a Bayesian framework within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Beraha, M., Favaro, S., Sesia, M. (2025). A smoothed-Bayesian approach to frequency recovery from sketched data. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1-21 [10.1080/01621459.2025.2490302].

A smoothed-Bayesian approach to frequency recovery from sketched data

Beraha, Mario;Favaro, Stefano;Sesia, Matteo

2025

Abstract

We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this article, we propose an alternative smoothed-Bayesian approach, inspired by existing BNP methods but designed to overcome their computational limitations when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by precise theoretical guarantees, including unbiasedness and optimality under a Bayesian framework within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on multi-view learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Nonparametric estimation; Normalized random measures; Random hashing; Smoothed estimation; Worst-case analysis;
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				24-giu-2025
			
	Data di pubblicazione
	
				2025
			
	Rivista
	
				JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				21
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1080/01621459.2025.2490302
			
	Fulltext
	
				embargoed_20260421
			
	Citazione
	
				Beraha, M., Favaro, S., Sesia, M. (2025). A smoothed-Bayesian approach to frequency recovery from sketched data. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1-21 [10.1080/01621459.2025.2490302].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Beraha-2025-J Am Stat Ass-AAM.pdf embargo fino al 21/04/2026 Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print) Licenza: Creative Commons Dimensione 1.5 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.5 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/551722

Citazioni

0

0

Social impact