The Temporal Fusion Transformers (TFT) is a sophisticated mannequin for time sequence forecasting. It consists of the Variable Choice Community (VSN), which is a key part of the mannequin. It’s particularly designed to robotically establish and concentrate on essentially the most related options inside a dataset. It achieves this by assigning realized weights to every enter variable, successfully highlighting which options contribute most to the predictive activity.
This VSN-based method shall be our second discount method. We’ll implement it utilizing PyTorch Forecasting, which permits us to leverage the Variable Choice Community from the TFT mannequin.
We’ll use a fundamental configuration. Our aim isn’t to create the highest-performing mannequin doable, however somewhat to establish essentially the most related options whereas utilizing minimal assets.
from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.metrics import QuantileLoss
from lightning.pytorch.callbacks import EarlyStopping
import lightning.pytorch as pl
import torchpl.seed_everything(42)
max_encoder_length = 32
max_prediction_length = 1
VAL_SIZE = .2
VARIABLES_IMPORTANCE = .8
model_data_feature_sel = initial_model_train.be part of(stationary_df_train)
model_data_feature_sel = model_data_feature_sel.be part of(pca_df_train)
model_data_feature_sel['price'] = model_data_feature_sel['price'].astype(float)
model_data_feature_sel['y'] = model_data_feature_sel['price'].pct_change()
model_data_feature_sel = model_data_feature_sel.iloc[1:].reset_index(drop=True)
model_data_feature_sel['group'] = 'spy'
model_data_feature_sel['time_idx'] = vary(len(model_data_feature_sel))
train_size_vsn = int((1-VAL_SIZE)*len(model_data_feature_sel))
train_data_feature = model_data_feature_sel[:train_size_vsn]
val_data_feature = model_data_feature_sel[train_size_vsn:]
unknown_reals_origin = [col for col in model_data_feature_sel.columns if col.startswith('value_')] + ['y']
timeseries_config = {
"time_idx": "time_idx",
"goal": "y",
"group_ids": ["group"],
"max_encoder_length": max_encoder_length,
"max_prediction_length": max_prediction_length,
"time_varying_unknown_reals": unknown_reals_origin,
"add_relative_time_idx": True,
"add_target_scales": True,
"add_encoder_length": True
}
training_ts = TimeSeriesDataSet(
train_data_feature,
**timeseries_config
)
The VARIABLES_IMPORTANCE
threshold is about to 0.8, which suggests we’ll retain options within the high eightieth percentile of significance as decided by the Variable Choice Community (VSN). For extra details about the Temporal Fusion Transformers (TFT) and its parameters, please consult with the documentation.
Subsequent, we’ll practice the TFT mannequin.
if torch.cuda.is_available():
accelerator = 'gpu'
num_workers = 2
else :
accelerator = 'auto'
num_workers = 0validation = TimeSeriesDataSet.from_dataset(training_ts, val_data_feature, predict=True, stop_randomization=True)
train_dataloader = training_ts.to_dataloader(practice=True, batch_size=64, num_workers=num_workers)
val_dataloader = validation.to_dataloader(practice=False, batch_size=64*5, num_workers=num_workers)
tft = TemporalFusionTransformer.from_dataset(
training_ts,
learning_rate=0.03,
hidden_size=16,
attention_head_size=2,
dropout=0.1,
loss=QuantileLoss()
)
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-5, endurance=5, verbose=False, mode="min")
coach = pl.Coach(max_epochs=20, accelerator=accelerator, gradient_clip_val=.5, callbacks=[early_stop_callback])
coach.match(
tft,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader
)
We deliberately set max_epochs=20
so the mannequin doesn’t practice too lengthy. Moreover, we applied an early_stop_callback
that halts coaching if the mannequin exhibits no enchancment for five consecutive epochs (endurance=5
).
Lastly, utilizing the very best mannequin obtained, we choose the eightieth percentile of crucial options as decided by the VSN.
best_model_path = coach.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)raw_predictions = best_tft.predict(val_dataloader, mode="uncooked", return_x=True)
def get_top_encoder_variables(best_tft,interpretation):
encoder_importances = interpretation["encoder_variables"]
sorted_importances, indices = torch.kind(encoder_importances, descending=True)
cumulative_importances = torch.cumsum(sorted_importances, dim=0)
threshold_index = torch.the place(cumulative_importances > VARIABLES_IMPORTANCE)[0][0]
top_variables = [best_tft.encoder_variables[i] for i in indices[:threshold_index+1]]
if 'relative_time_idx' in top_variables:
top_variables.take away('relative_time_idx')
return top_variables
interpretation= best_tft.interpret_output(raw_predictions.output, discount="sum")
top_encoder_vars = get_top_encoder_variables(best_tft,interpretation)
print(f"nOriginal variety of options: {stationary_df_train.form[1]}")
print(f"Variety of options after Variable Choice Community (VSN): {len(top_encoder_vars)}n")
The unique dataset contained 438 options, which have been then diminished to 1 characteristic solely after making use of the VSN methodology! This drastic discount suggests a number of prospects:
- Lots of the authentic options could have been redundant.
- The characteristic choice course of could have oversimplified the information.
- Utilizing solely the goal variable’s historic values (autoregressive method) would possibly carry out in addition to, or probably higher than, fashions incorporating exogenous variables.
On this remaining part, we examine out discount strategies utilized to our mannequin. Every methodology is examined whereas sustaining equivalent mannequin configurations, various solely the options subjected to discount.
We’ll use TiDE, a small state-of-the-art Transformer-based mannequin. We’ll use the implementation supplied by NeuralForecast. Any mannequin from NeuralForecast here would work so long as it permits exogenous historic variables.
We’ll practice and take a look at two fashions utilizing each day SPY (S&P 500 ETF) information. Each fashions can have the identical:
- Prepare-test break up ratio
- Hyperparameters
- Single time sequence (SPY)
- Forecasting horizon of 1 step forward
The one distinction between the fashions would be the characteristic discount method. That’s it!
- First mannequin: Unique options (no characteristic discount)
- Second mannequin: Characteristic discount utilizing PCA
- Third mannequin: Characteristic discount utilizing VSN
This setup permits us to isolate the impression of every characteristic discount method on mannequin efficiency.
First we practice the three fashions with the identical configuration apart from the options.
from neuralforecast.fashions import TiDE
from neuralforecast import NeuralForecasttrain_data = initial_model_train.be part of(stationary_df_train)
train_data = train_data.be part of(pca_df_train)
test_data = initial_model_test.be part of(stationary_df_test)
test_data = test_data.be part of(pca_df_test)
hist_exog_list_origin = [col for col in train_data.columns if col.startswith('value_')] + ['y']
hist_exog_list_pca = [col for col in train_data.columns if col.startswith('PC')] + ['y']
hist_exog_list_vsn = top_encoder_vars
tide_params = {
"h": 1,
"input_size": 32,
"scaler_type": "sturdy",
"max_steps": 500,
"val_check_steps": 20,
"early_stop_patience_steps": 5
}
model_original = TiDE(
**tide_params,
hist_exog_list=hist_exog_list_origin,
)
model_pca = TiDE(
**tide_params,
hist_exog_list=hist_exog_list_pca,
)
model_vsn = TiDE(
**tide_params,
hist_exog_list=hist_exog_list_vsn,
)
nf = NeuralForecast(
fashions=[model_original, model_pca, model_vsn],
freq='D'
)
val_size = int(train_size*VAL_SIZE)
nf.match(df=train_data,val_size=val_size,use_init_models=True)
Then, we make the predictions.
from tabulate import tabulate
y_hat_test_ret = pd.DataFrame()
current_train_data = train_data.copy()y_hat_ret = nf.predict(current_train_data)
y_hat_test_ret = pd.concat([y_hat_test_ret, y_hat_ret.iloc[[-1]]])
for i in vary(len(test_data) - 1):
combined_data = pd.concat([current_train_data, test_data.iloc[[i]]])
y_hat_ret = nf.predict(combined_data)
y_hat_test_ret = pd.concat([y_hat_test_ret, y_hat_ret.iloc[[-1]]])
current_train_data = combined_data
predicted_returns_original = y_hat_test_ret['TiDE'].values
predicted_returns_pca = y_hat_test_ret['TiDE1'].values
predicted_returns_vsn = y_hat_test_ret['TiDE2'].values
predicted_prices_original = []
predicted_prices_pca = []
predicted_prices_vsn = []
for i in vary(len(predicted_returns_pca)):
if i == 0:
last_true_price = train_data['price'].iloc[-1]
else:
last_true_price = test_data['price'].iloc[i-1]
predicted_prices_original.append(last_true_price * (1 + predicted_returns_original[i]))
predicted_prices_pca.append(last_true_price * (1 + predicted_returns_pca[i]))
predicted_prices_vsn.append(last_true_price * (1 + predicted_returns_vsn[i]))
true_values = test_data['price']
strategies = ['Original','PCA', 'VSN']
predicted_prices = [predicted_prices_original,predicted_prices_pca, predicted_prices_vsn]
outcomes = []
for methodology, costs in zip(strategies, predicted_prices):
mse = np.imply((np.array(costs) - true_values)**2)
rmse = np.sqrt(mse)
mae = np.imply(np.abs(np.array(costs) - true_values))
outcomes.append([method, mse, rmse, mae])
headers = ["Method", "MSE", "RMSE", "MAE"]
desk = tabulate(outcomes, headers=headers, floatfmt=".4f", tablefmt="grid")
print("nPrediction Errors Comparability:")
print(desk)
with open("prediction_errors_comparison.txt", "w") as f:
f.write("Prediction Errors Comparability:n")
f.write(desk)
We forecast the each day returns utilizing the mannequin, then convert these again to costs. This method permits us to calculate prediction errors utilizing costs and examine the precise costs to the forecasted costs in a plot.
The same efficiency of the TiDE mannequin throughout each authentic and diminished characteristic units reveals an important perception: characteristic discount didn’t result in improved predictions as one would possibly anticipate. This means potential key points:
- Data loss: regardless of aiming to protect important information, dimensionality discount strategies discarded data related to the prediction activity, explaining the dearth of enchancment with fewer options.
- Generalization struggles: constant efficiency throughout characteristic units signifies the mannequin’s problem in capturing underlying patterns, no matter characteristic rely.
- Complexity overkill: related outcomes with fewer options recommend TiDE’s subtle structure could also be unnecessarily complicated. A less complicated mannequin, like ARIMA, may probably carry out simply as properly.
Then, let’s study the chart to see if we will observe any important variations among the many three forecasting strategies and the precise costs.
import matplotlib.pyplot as pltplt.determine(figsize=(12, 6))
plt.plot(train_data['ds'], train_data['price'], label='Coaching Information', shade='blue')
plt.plot(test_data['ds'], true_values, label='True Costs', shade='inexperienced')
plt.plot(test_data['ds'], predicted_prices_original, label='Predicted Costs', shade='crimson')
plt.legend()
plt.title('SPY Worth Forecast Utilizing All Unique Characteristic')
plt.xlabel('Date')
plt.ylabel('SPY Worth')
plt.savefig('spy_forecast_chart_original.png', dpi=300, bbox_inches='tight')
plt.shut()
plt.determine(figsize=(12, 6))
plt.plot(train_data['ds'], train_data['price'], label='Coaching Information', shade='blue')
plt.plot(test_data['ds'], true_values, label='True Costs', shade='inexperienced')
plt.plot(test_data['ds'], predicted_prices_pca, label='Predicted Costs', shade='crimson')
plt.legend()
plt.title('SPY Worth Forecast Utilizing PCA Dimensionality Discount')
plt.xlabel('Date')
plt.ylabel('SPY Worth')
plt.savefig('spy_forecast_chart_pca.png', dpi=300, bbox_inches='tight')
plt.shut()
plt.determine(figsize=(12, 6))
plt.plot(train_data['ds'], train_data['price'], label='Coaching Information', shade='blue')
plt.plot(test_data['ds'], true_values, label='True Costs', shade='inexperienced')
plt.plot(test_data['ds'], predicted_prices_vsn, label='Predicted Costs', shade='crimson')
plt.legend()
plt.title('SPY Worth Forecast Utilizing VSN')
plt.xlabel('Date')
plt.ylabel('SPY Worth')
plt.savefig('spy_forecast_chart_vsn.png', dpi=300, bbox_inches='tight')
plt.shut()
The distinction between true and predicted costs seems constant throughout all three fashions, with no noticeable variation in efficiency between them.
We did it! We explored the significance of characteristic discount in time sequence evaluation and supplied a sensible implementation information:
- Characteristic discount goals to simplify fashions whereas sustaining predictive energy. Advantages embody diminished complexity, improved generalization, simpler interpretation, and computational effectivity.
- We demonstrated two discount strategies utilizing FRED information:
- Principal Element Evaluation (PCA), a linear dimensionality discount methodology, diminished options from 438 to 76 whereas retaining 90% of defined variance.
- Variable Choice Community (VSN) from the Temporal Fusion Transformers, a non-linear method, drastically diminished options to simply 1 utilizing an eightieth percentile significance threshold.
- Analysis utilizing TiDE fashions confirmed related efficiency throughout authentic and diminished characteristic units, suggesting characteristic discount could not all the time enhance forecasting efficiency. This may very well be as a result of data loss throughout discount, the mannequin’s problem in capturing underlying patterns, or the chance {that a} less complicated mannequin is likely to be equally efficient for this explicit forecasting activity.
On a remaining word, we didn’t discover all characteristic discount strategies, corresponding to SHAP (SHapley Additive exPlanations), which gives a unified measure of characteristic significance throughout numerous mannequin sorts. Even when we didn’t enhance our mannequin, it’s nonetheless higher to carry out characteristic curation and examine efficiency throughout totally different discount strategies. This method helps make sure you’re not discarding helpful data whereas optimizing your mannequin’s effectivity and interpretability.
In future articles, we’ll apply these characteristic discount strategies to extra complicated fashions, evaluating their impression on efficiency and interpretability. Keep tuned!
Able to put these ideas into motion? You will discover the entire code implementation here.
👏 Clap it as much as 50 occasions
🤝 Ship me a LinkedIn connection request to remain in contact
Your assist means every thing! 🙏