Fundamentals of Task-Agnostic Data Valuation

AAAI 2023

M. M. Amiri, F. Berdoz, R. Raskar

ETH Zurich, Switzerland; MIT, USA

data-valuationdata-marketsfederated-learningprivacy

Abstract

We study valuing the data of a data owner/seller for a data seeker/buyer. Data valuation is often carried out for a specific task assuming a particular utility metric, such as test accuracy on a validation set, that may not exist in practice. In this work, we focus on task-agnostic data valuation without any validation requirements. The data buyer has access to a limited amount of data and seeks more data samples from a data seller. We formulate the problem as estimating the differences in the statistical properties of the data at the seller with respect to the baseline data available at the buyer. We capture these statistical differences through second moment by measuring diversity and relevance of the seller’s data for the buyer.

Overview

Diagram of the buyer-seller-broker interaction for task-agnostic data valuation, showing how principal component directions are shared to compute diversity and relevance metrics. — **Figure 1:** The data valuation protocol. A buyer shares principal component directions (eigenvectors) with sellers via a broker. Each seller projects their data statistics onto these directions. The broker computes diversity and relevance scores without either party revealing raw data or variance values.

Data valuation in machine learning typically requires a specific downstream task and a validation set, neither of which may be available in practice. This paper proposes a task-agnostic approach to data valuation that estimates the value of a seller’s data for a buyer based on statistical properties alone.

How it works

The method captures differences in data distributions through second-moment statistics (covariance matrices):

The buyer computes PCA on their dataset and shares the principal component directions (eigenvectors) with sellers through a broker.
Each seller projects their covariance matrix onto these directions to compute variance estimates along each component.
The broker computes two complementary metrics:
- Diversity: Measures how much the seller’s data covers directions underrepresented in the buyer’s data (geometric mean of normalized absolute variance differences).
- Relevance: Measures how much the seller’s data overlaps with the buyer’s data distribution (geometric mean of normalized minimum variances).

These metrics satisfy the constraint Diversity + Relevance ≤ 1, are bounded in [0, 1], and align with intuitive expectations: identical data yields Diversity = 0 and Relevance = 1, while completely different data yields Diversity = 1 and Relevance = 0.

Geometric illustration of diversity and relevance in 2D, showing how buyer and seller data distributions project onto principal components. — **Figure 2:** Geometric intuition for diversity and relevance. The buyer's and seller's data distributions are compared along the buyer's principal component directions. Diversity captures variance in orthogonal directions; relevance captures overlap along shared directions.

Results

Diversity and relevance scores for three sellers with different 2D Gaussian distributions relative to a buyer. — **Figure 3:** Diversity vs. relevance for three sellers. Seller 1 (similar to buyer) has high relevance and low diversity. Seller 3 (different distribution) has high diversity and low relevance. Seller 2 falls in between.

Key properties

Privacy-preserving: Only eigenvector directions are shared; raw data and variance values remain private.
Robust against manipulation: Sellers cannot inflate their scores without knowing the buyer’s variance values.
Task-agnostic: No downstream task or validation set is required.
Theoretically grounded: Diversity and relevance satisfy four formally proved monotonicity properties with respect to variance changes.

Key takeaway: Data can be valued without a specific task or validation set by comparing statistical properties (diversity and relevance) between buyer and seller datasets. The protocol preserves privacy and is robust against manipulation.

Citation

@inproceedings{amiri2023fundamentals,
  author = {Amiri, M. M. and Berdoz, F. and Raskar, R.},
  title = {{Fundamentals of Task-Agnostic Data Valuation}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2023}
}