Fundamentals of Task-Agnostic Data Valuation
AAAI 2023
ETH Zurich, Switzerland; MIT, USA
Abstract
We study valuing the data of a data owner/seller for a data seeker/buyer. Data valuation is often carried out for a specific task assuming a particular utility metric, such as test accuracy on a validation set, that may not exist in practice. In this work, we focus on task-agnostic data valuation without any validation requirements. The data buyer has access to a limited amount of data and seeks more data samples from a data seller. We formulate the problem as estimating the differences in the statistical properties of the data at the seller with respect to the baseline data available at the buyer. We capture these statistical differences through second moment by measuring diversity and relevance of the seller’s data for the buyer.
Overview

Data valuation in machine learning typically requires a specific downstream task and a validation set, neither of which may be available in practice. This paper proposes a task-agnostic approach to data valuation that estimates the value of a seller’s data for a buyer based on statistical properties alone.
How it works
The method captures differences in data distributions through second-moment statistics (covariance matrices):
- The buyer computes PCA on their dataset and shares the principal component directions (eigenvectors) with sellers through a broker.
- Each seller projects their covariance matrix onto these directions to compute variance estimates along each component.
- The broker computes two complementary metrics:
- Diversity: Measures how much the seller’s data covers directions underrepresented in the buyer’s data (geometric mean of normalized absolute variance differences).
- Relevance: Measures how much the seller’s data overlaps with the buyer’s data distribution (geometric mean of normalized minimum variances).
These metrics satisfy the constraint Diversity + Relevance ≤ 1, are bounded in [0, 1], and align with intuitive expectations: identical data yields Diversity = 0 and Relevance = 1, while completely different data yields Diversity = 1 and Relevance = 0.

Results

Key properties
- Privacy-preserving: Only eigenvector directions are shared; raw data and variance values remain private.
- Robust against manipulation: Sellers cannot inflate their scores without knowing the buyer’s variance values.
- Task-agnostic: No downstream task or validation set is required.
- Theoretically grounded: Diversity and relevance satisfy four formally proved monotonicity properties with respect to variance changes.
Key takeaway: Data can be valued without a specific task or validation set by comparing statistical properties (diversity and relevance) between buyer and seller datasets. The protocol preserves privacy and is robust against manipulation.
Citation
@inproceedings{amiri2023fundamentals,
author = {Amiri, M. M. and Berdoz, F. and Raskar, R.},
title = {{Fundamentals of Task-Agnostic Data Valuation}},
booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
year = {2023}
}