Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper thus studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic. Note that complete version with technical appendix is available on arXiv: http://arxiv.org/abs/2211.13904.
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators’ accuracy on a narrow set of hyperparameters and evaluation policies. Therefore, we cannot know which estimator is safe and reliable to use in general practice. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators’ sensitivity to the choice of hyperparameters and possible changes in evaluation policies in an interpretable manner. We also build open-source Python software, pyIEOE, to streamline the evaluation with the IEOE protocol. With this software, researchers can use IEOE to compare different OPE estimators in their research, and practitioners can select an appropriate estimator for the given practical situation. Then, using the IEOE procedure, we perform extensive re-evaluation of a wide variety of existing estimators on public datasets. We show that, surprisingly, simple estimators that have fewer hyperparameters are more reliable than other advanced estimators because advanced estimators need environment specific hyperparameter tuning to perform well. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.