Author(s): Hector Gonzalez Lopez; Majid Niazkar; Jaroslav Mysiak; Carlos Dionisio Perez Blanco
Linked Author(s):
Keywords: Hydroclimatic variables; Precipitation; Temperature; Machine learning; Missing data imputation
Abstract: Ground-based observations are essential for hydrological modelling, yet station records often contain missing values. Reanalysis products provide more continuous series even though they may be biased. This study evaluates eight machine learning (ML) models to impute missing temperature and precipitation records in the Tormes catchment (Spain) using CFSR and ERA5 as predictors. The tested models include Multiple Linear Regression (MLR), Decision Tree Regression, Random Forest Regression, Support Vector Regression, K-Nearest Neighbors, AdaBoost, Gradient Boosting Regressor, and XGBoost. Results show that both MLR and XGBoost accurately impute temperature, whereas precipitation remains more difficult due to nonlinearity and high frequency of zero values. Additional strategies, including time lags, seasonal information, hybrid SARIMA-ML, logarithmic transformation, and SHAP-based feature selection, did not improve results over the direct application of XGBoost. Overall, XGBoost provided the best performance for precipitation imputation, while simpler models were also effective for temperature.
Year: 2026