Machine Learning on Posit Open Source

torch Ecosystem Updates

Daniel Falbel — Thu, 30 Apr 2026 00:00:00 +0000

We’ve just published a new round of CRAN releases across the torch ecosystem. Here’s a tour of what’s new in each package.

torch v0.17.0

The most exciting experimental new feature is support for the cudatoolkit packages. With this, you no longer need a global CUDA toolkit installation in order to use torch on the GPU.

You can now do:

1
2
3
4
5


install.packages(
  "cuda12.8", 
  repos = c("https://mlverse.r-universe.dev", "https://cloud.r-project.org")
)
install.packages("torch")

The {cuda12.8} package bundles all the CUDA runtime libraries and torch can find it and use it by default. See more details in the installation docs .

We also highlight the update to LibTorch v2.8.0 led by Troy Hernandez ( #1419 ).

Additionally, this release includes many small bug fixes and small additions to the API. See the full release notes in the changelog .

torchvision v0.9.0

torchvision provides datasets, model architectures, and image transformations for computer vision. This is a big release with new models, datasets, and many improvements — largely driven by community contributors.

New models:

model_maskrcnn_resnet50_fpn() and model_maskrcnn_resnet50_fpn_v2() for instance segmentation.
model_convnext_*_detection() for object detection (tiny/small/base).
model_convnext_*_fcn() and model_convnext_*_upernet() for semantic segmentation (tiny/small/base).

New datasets and features:

vggface2_dataset() for loading the VGGFace2 dataset.
New coco_segmentation_dataset(), split from coco_detection_dataset(), reducing memory usage by ~50%.
Collection dataset catalog with search_collection(), get_collection_catalog(), and list_collection_datasets() for discovering and exploring datasets.
New visualization utilities draw_segmentation_masks() and vision_make_grid().

See the full release notes in the changelog .

A huge thank you to the community contributors who made this release possible: @cregouby , @ANAMASGARD , @Chandraveersingh1717 , @DerrickUnleashed , and @srishtiii28 .

Other releases

Most of the other packages don’t have significant changes, and the releases add minimal improvements to docs, CI infrastructure and CRAN related updates.

luz v0.5.2 — Higher-level API for torch with a Keras-like interface for training neural networks.
hfhub v0.1.2 — Download and cache files from Hugging Face Hub repositories, making it easy to use pretrained models and datasets from R.
tok v0.2.2 — Fast tokenizers for R, powered by the Hugging Face Tokenizers library written in Rust. Supports BPE, WordPiece, and other tokenization algorithms.
torchdatasets v0.3.2 — Extra ready-to-use datasets for torch, complementing the built-in datasets in torchvision.
safetensors v0.2.1 — Read and write the Safetensors file format, a safe and fast format for storing and loading tensors.
tfevents v0.0.5 — Write event files compatible with TensorBoard from R for experiment tracking and visualization.
wav v0.2.0 — Read and write WAV files in R.

New maintainer

We’re excited to welcome Tomasz Kalinowski as the new maintainer of torch and the broader mlverse ecosystem.

tidymodels Cheatsheets

Edgar Ruiz — Wed, 29 Apr 2026 00:00:00 +0000

After almost 8 years, tidymodels finally has its first cheatsheets, and not just one, but two! The first one , covering data preprocessing with recipes, was released a couple of months ago. Today, we are delighted to announce a second cheatsheet , this time focusing on modeling with parsnip.

Both cheatsheets have a dedicated HTML version on the Posit Open Source site, so you can browse and search them without opening a PDF. In this post we’ll walk through what each one covers, starting with the newest.

Create Models with parsnip

The "Create Models with parsnip" cheatsheet — click to enlarge

The cheatsheet is organized into three main parts: an introduction to parsnip’s basics, a catalog of all models available through the package, and a hands-on operations reference for fitting and inspecting models. The basics section introduces how parsnip provides a single, unified interface for defining and fitting models, regardless of the underlying package powering them.

Model catalog

The largest section of the cheatsheet catalogs all models available through parsnip, grouped by use case:

Classification only: models for binary and multiclass prediction. It also includes probability-based classification using Bayes’ theorem and models for ordinal responses.
Regression only: models for predicting continuous numeric outcomes, from standard linear regression to generalized linear models for count data.
General use: a versatile mix of model types that work for both classification and regression, including decision trees, nearest neighbors, neural networks, and spline-based approaches.
Discriminant analysis: models that estimate the distribution of predictors separately for each class and use Bayes’ theorem to assign probabilities, available in linear, quadratic, flexible, and regularized variants.
Ensemble methods: models that combine many individual learners into a stronger prediction, including random forests, gradient boosting, bagged trees, and Bayesian additive regression trees.
Support Vector Machines: models that find an optimal boundary between classes, or fit a robust regression, using linear, polynomial, or radial kernel functions.
Feature rules: models that extract simple, human-readable rules from tree ensembles and use them as the basis for prediction.
Survival models: models for time-to-event data, covering both proportional hazards and fully parametric approaches.

One design choice in particular makes this section much easier to navigate: pills. Each model’s compatible engines and supported modes are shown as small, visually distinct tags, so you can see at a glance which mode a given engine supports, without having to read through the description text. Each mode is encoded in the pill with a number: Classification (1), Regression (2), Censored Regression (3), and Quantile Regression (4). A legend mapping each number to its mode is available at the top of page one.

Engine pills show the name and supported modes of each engine at a glance

And true to the R cheatsheet tradition, individual models or groups of related models are paired with small illustrations, thoughtfully designed for visual impact to aid recall. Each one attempts to accurately represent the function or functions it accompanies, making them a genuine navigation aid rather than decoration, especially when you have a vague memory of “that tree-based ensemble that used Bayesian analysis” and need to scan quickly.

Operations

The last section covers the practical workflow of fitting and using a model. Each function is paired with a quick runnable example, and the examples build on each other starting from the two lines of code right below the section title, making it easy to follow the full workflow from model specification to results.

Explore the parsnip cheatsheet

Preprocessing Data with recipes

The "Preprocessing Data with recipes" cheatsheet — click to enlarge

After a quick Basics section covering the core workflow, the vast majority of the cheatsheet is dedicated to step_*() functions, the building blocks of any recipe, before finishing with role and type management.

Step catalog

The steps are organized into groups based on what they do, each listed with its arguments and a short description:

Filters: steps for removing variables that are sparse, zero-variance, linearly dependent, highly correlated, or missing too many values
In-place Transformations: basis functions (splines, polynomials), discretization, and normalization steps
Imputation: steps for filling in missing values, ranging from simple statistical substitution to model-based approaches
Encodings: type converters (e.g. factor to string, numeric to factor), value converters, and other factor-handling steps
Dummy Variables: one-hot and binary encoding, text pattern matching, and conversion helpers
Multivariate Transformations: signal extraction (PCA, ICA, PLS, and friends) and centroid-based distance measures
Date & Time: steps for converting date and datetime columns into usable numeric or factor features
Row operations: sampling, shuffling, slicing, and removing rows with missing values
Other: interaction terms, renaming, rolling window statistics, geographic distances, and ratios

As with the parsnip cheatsheet, each group of steps is paired with small, thoughtfully designed illustrations to help you visually locate a step family when scanning.

Role & type

The last section focuses on the selection and management of variable roles and types within the recipe. The selection side covers ways to target variables by their role (outcome, predictor, or any custom role) as well as by their type (numeric, factor, logical, and so on), including a handy set of convenience selectors for the most common combinations. The management side shows how to add, update, and remove roles, showing you how to gain fine-grained control over how each variable participates in the recipe.

Easily find the right selector function with this at-a-glance guide

Explore the recipes cheatsheet

Need them on the go? Print them!

A lot of care went into ensuring both cheatsheets hold up when printed, particularly in black and white. We know that many folks print cheatsheets to keep at their desk for quick reference, and we wanted to make sure they remain fully usable in that medium. That meant making sure font sizes and weights stay legible on paper, that the illustrations remain perceptible without color, and that contrast levels are strong enough that no text ends up too pale to read or too heavy to parse. Accessibility in print mattered to us just as much as clarity on screen.

New tidymodels cheatsheets are fully readable when printed

New tidymodels Releases for April 2026

Max Kuhn — Mon, 27 Apr 2026 00:00:00 +0000

We’ve released a sequence of tidymodels packages over the last few weeks: dials (1.4.3), parsnip (1.5.0), tune (2.1.0), yardstick (1.4.0), and tidymodels (1.5.0). You can install them via:

1
2
3


# tidymodels installs all of the new versions
require(pak)
pak::pak("tidymodels")

Here are links to the NEWS files for each package:

Let’s first talk about the two biggest updates enabled by this group of releases, then we’ll cover some of the other changes for each package.

Ordered Outcomes

parsnip has a new model type, ordinal_reg(), analogous to multinom_reg(), for fitting various generalized linear models with ordered class levels.

The ordered package by Cory Brunson is now on CRAN. This contains the specific engine code for these models, including:

ordinal_reg(): three engines: "polr", "ordinalNet", and "vglm".
gen_additive_mod(): "vgam"
decision_tree(): "rpartScore"
rand_forest(): "ordinalForest"

These models can be fitted, tuned, and evaluated with tidymodels. For the evaluation, we’ve added a specific performance metric for ordered categories: the ranked probability score (RPS). The function ranked_prob_score() is in the new yardstick release and requires an ordered factor for the outcome.

Quantile Regression

We previously reported that parsnip supports quantile regression models. With the latest set of releases, new boosting and neural network engines are available, and these models can now be tuned and evaluated using a relevant metric. yardstick now includes the weighted interval score ( Bracher et al (2021) ) to evaluate the quality of the quantile predictions.

Here’s a simple one-dimensional example using the Ames data; we’ll predict the sale price as a function of latitude. To start, let’s make a training/test split, generate some resamples, and plot the training data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


library(tidymodels)
# We'll also need the qrnn package for the neural network engine

set.seed(1215)
ames_split <-
  ames |>
  select(Latitude, Sale_Price) |>
  initial_split(strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rs <- vfold_cv(ames_train, strata = Sale_Price)

ames_train |>
  ggplot(aes(Latitude, Sale_Price)) +
  geom_point(alpha = 1 / 5) +
  geom_smooth(se = FALSE) +
  labs(x = "Latitude", y = "Sale Price (USD)")

Note that we almost always model these data with a log transformation on the outcome due to its inherent skewness. That helps us avoid making negative predictions, be more robust to overly influential points (i.e., locations with very large sale prices), and stabilize the variance. However, we don’t necessarily have to do that with quantile regression. The objective functions used to estimate parameters do not impose requirements on the normality of the data or heterogeneity of residuals. For this analysis, let’s stick with the original units of the outcome (USD).

There are a few engines for quantile regression, and we’ll use a neural network model. To get started, the quantiles to be predicted need to be specified. We make a model specification with a few additions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# Pre-defined quantiles of interest
qnt_lvls <- c(0.05, 0.25, 0.5, 0.75, 0.95)

nnet_spec <-
  mlp(hidden_units = tune(), penalty = tune(), epochs = 10) |>
  # Set the quantile levels with the mode:
  set_mode("quantile regression", quantile_levels = qnt_lvls) |>
  # A new engine for quantile regression with neural networks via the
  # qrnn package. We'll add an engine argument to specify the
  # optimization method for training the model:
  set_engine("qrnn", method = "adam")

# Scale the single predictor to help the model initialize its
# parameters.
nnet_rec <- recipe(Sale_Price ~ ., data = ames_train) |>
  step_normalize(all_predictors())

nnet_wflow <- workflow(nnet_rec, nnet_spec)

From there, we can use any of our tuning functions to optimize the number of hidden units and the amount of weight decay. By default, the weighted interval score is used for this particular mode.

We’ll consider 25 tuning parameter candidates to optimize model performance.

1
2
3
4
5
6
7
8


set.seed(971)
nnet_res <-
  nnet_wflow |>
  tune_grid(
    resamples = ames_rs,
    grid = 25,
    control = control_grid(save_workflow = TRUE)
  )

We can get the performance metric and visualize which tuning parameter combinations have the smallest weighted interval score:

1
2
3
4
5
6
7
8


nnet_mtr <- collect_metrics(nnet_res)

nnet_mtr |>
  ggplot(aes(penalty, hidden_units, size = mean)) +
  geom_point() +
  scale_x_log10() +
  coord_fixed(ratio = 1) +
  labs(x = "Penalty", y = "# Hidden Units", size = "WIS")

1

select_best(nnet_res, metric = "weighted_interval_score")

# A tibble: 1 × 3
  hidden_units     penalty .config         
                            
1           10 0.000000215 pre0_mod24_post0

The model appears to prefer a smaller penalty and more hidden units.

It’s hard to conceptualize how well the model functions with just these numbers. To show that the metric does select good models, let’s fit the best, median, and worst models and see how they look on the test set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


set.seed(8281)
best_model <- fit_best(nnet_res)

set.seed(8281)
worst_model <-
  nnet_mtr |>
  slice_max(mean, n = 1) |>
  select(hidden_units, penalty) |>
  finalize_workflow(nnet_wflow, parameters = _) |>
  fit(ames_train)

set.seed(8281)
mid_model <-
  nnet_mtr |>
  # Since we have an odd number of grid points:
  filter(mean == median(mean)) |>
  select(hidden_units, penalty) |>
  finalize_workflow(nnet_wflow, parameters = _) |>
  fit(ames_train)

Now let’s plot the results. We’ll color the predicted quantiles: black indicates the predicted median sale price, orange lines indicate the inner quartiles, and smoky periwinkle lines indicate the 0.05 and 0.95 quantiles (which could serve as 90% prediction intervals).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


bind_rows(
  best_model |> augment(ames_test) |> mutate(Model = "Best Results"),
  mid_model |> augment(ames_test) |> mutate(Model = "Meh Results"),
  worst_model |> augment(ames_test) |> mutate(Model = "Worst Results")
) |>
  mutate(
    .pred_quantile = map(.pred_quantile, ~ as_tibble(.x))
  ) |>
  unnest(.pred_quantile) |>
  arrange(Latitude) |>
  ggplot(aes(Latitude)) +
  geom_point(aes(y = Sale_Price), alpha = 1 / 30, cex = 3 / 4) +
  geom_path(
    aes(
      y = .pred_quantile,
      group = .quantile_levels,
      col = factor(.quantile_levels)
    ),
    show.legend = FALSE,
    linewidth = 1
  ) +
  scale_color_manual(
    values = c("#8785B2FF", "#D95F30FF", "black", "#D95F30FF", "#8785B2FF")
  ) +
  facet_wrap(~Model)

These plots show that configurations with very large score values have poor fits (linear in this case). The “meh” model is nonlinear but not responsive enough to the datas’ ups and downs. The best model, with more hidden units and a low penalty, appears to be flexible enough to model the data well.

We’ll have more metrics in yardstick that can use quantile predictions in the future. For example, we can extend the ones that we have, such as rmse() or rsq(), to use a predicted value from the center of the predictive distribution, such as the 0.5 quantile.

Now we’ll describe various other improvements in the recently released versions.

dials

The latest dials release contains several new parameters for new-ish models in parsnip: For the ordinal_reg() models, dials now contains ordinal_link() and odds_link(). For the tab_pfn(), dials contains num_estimators(), softmax_temperature(), balance_probabilities(), average_before_softmax(), and training_set_limit().

The other user-facing changes were related to input checking and related error messages. The most prominent example is that parameters() and the grid_*() functions now give more information in the error message when non-parameter objects are passed in: which inputs aren’t a parameter object and what they are instead.

@corybrunson , @daltonkw , @hfrick , @jeroenjanssens , @topepo , and @vmikk contributed to the package since the last release.

yardstick

Beyond the two new metrics ranked_prob_score() and weighted_interval_score() described above, this release adds a further 8 metrics.

Three new regression metrics:

mse() — mean squared error (the squared counterpart to the existing rmse()).
rmse_relative() — root mean squared error normalized by the observed value range.
gini_coef() — normalized Gini coefficient.

Three new classification metrics:

fall_out() — false positive rate (1 − specificity).
miss_rate() — false negative rate (1 − sensitivity).
markedness() — predictive power of a classifier, computed as PPV + NPV − 1.

Two new probability-based classification metrics:

roc_dist() — Euclidean distance from the perfect-classifier corner of ROC space.
sedi() — Symmetric Extremal Dependence Index.

In addition to these new metrics, we have also updated the documention of all metrics. Now each metric shows the formula used to calculate it, as well as the valid values it can produce.

We also have pages that list all metrics of the same type. These can be found with ?class-metrics , ?numeric-metrics or linked within each metric documentation.

We are thankful to the developers who contributed to this version: @abichat , @astamm , @corybrunson , @DarioS , @EmilHvitfeldt , @FvD , @hfrick , @JavOrraca , @jeroenjanssens , @jkylearmstrong-temple , @mle2718 , @nathant181 , @SimonDedman , @topepo , and @tripartio

parsnip

Version 1.5.0 of parsnip had a variety of changes. Besides the additions for the two new model types shown above:

We enabled case weight usage for the "nnet" engines of mlp() and bag_mlp() as well as for the "dbarts" engine of bart().

Many of the other changes are most likely to be noticed by developers:

The interface for declaring tunable parameters has been simplified and is the same for main arguments as well as engine parameters. Also, these values can now be set inside extension packages.
We now export the generics for predict_quantile(), predict_class(), predict_classprob(), and predict_hazard().
format_predictions() is a new unified function for formatting prediction outputs, consolidating the logic from the individual format_*() functions. The individual functions format_num(), format_class(), format_classprobs(), format_time(), format_survival(), format_linear_pred(), and format_hazard() are now deprecated.

Thanks to those who contributed to parsnip since the last release: @CeresBarros , @corybrunson , @EmilHvitfeldt , @hfrick , @iamYannC , @jack-davison , @jameslamb , @martinju , and @topepo .

tune

The core functionality of tune is to do all the model fitting (including pre- and postprocessing) and performance evaluation across various resamples and tuning parameter combinations. For grid search, we could take the full parameter grid, splice one parameter combination into the workflow at a time, and run with it. That can be pretty inefficient though. So what actually happens in tune are a few optimizations in how we do all that fitting and evaluating: For preprocessing, we do it once for a resample (per preprocessing parameter combination) and then evaluate all model candidates on it. This lets us avoid unnecessarily repeating the same preprocessing multiple times. For model fitting, we make use of what Max calls “the submodel trick” : For certain models, like a boosted tree, you can use a submodel to make predictions without having to refit the model. A boosted tree ensemble fitted with 20 trees can be used to make predictions for any number of trees up to the 20 used for fitting. That allows us to evaluate different tuning parameter candidates for, here, the number of trees, without having to refit the model. When we added postprocessing, we temporarily disabled this (to ensure we got the integration right) - now we’ve brought it back. We make use of this speedup for both the main model as well as the calibration model.

Another big update is that the Gaussian process model package was changed from GPfit to GauPro because the former is no longer actively maintained. There are some differences:

Fit diagnostics are computed and reported. If the fit quality is poor, an “uncertainty sample” that is furthest away from the existing data is used as the new candidate.
The GP no longer uses binary indicators for qualitative predictors. Instead, a “categorical kernel” is used for those parameter columns. Fewer starting values are required with this change.
For numeric predictors, the Matern 3/2 kernel is always used.

Some other changes of note:

When calculating resampling estimates, we can now use a weighted mean based on the number of rows in the assessment set thanks to Tyler Burch. You can opt-in to this using the new add_resample_weights() function. See ?calculate_resample_weights
The warning threshold when check the size of a workflow is now a parameter to the control functions and has a new default of 100MB.

Some bug fixes:

Models with submodel parameters would train all calibration models on predictions from a single submodel value instead of the correct value for each submodel. We sorted this out.
We fixed a bug for cases where we tune a grid without a model parameter but with a postprocessing parameter.
Another bug was fixed for augment() when using last_fit() objects

Thanks to the following contributors: @edgararuiz , @EmilHvitfeldt , @hfrick , @jeroenjanssens , @jjcurtin , @mikewolfe , @mthulin , @ncalliencsu , @rvalieris , @StevenWallaert , @tjburch , and @topepo

finetune

This release was mostly focused on internal changes to support the new version of tune.

tidymodels

A basic release that updates the version numbers to require the latest releases of the core packages.

tabpfn 0.1.0

Max Kuhn — Tue, 31 Mar 2026 00:00:00 +0000

We’re stoked to announce the release of tabpfn 0.1.0. TabPFN is a precompiled deep learning Python model for prediction. The R package tabpfn is an interface to this model via reticulate.

You can install it from CRAN with:

1

install.packages("tabpfn")

What is TabPFN?

The “tab” means tabular, which is code for everyday rectangular data structures that we find in csv files and databases.

The “pfn” is more complicated – it stands for “prior fitted network”. The model is trained on fully synthetic datasets. The developers created a complex graph model that can simulate a wide variety of data-generating methods, including correlation structures, distributional skewness, missing-data mechanisms, interactions, latent variables, and more. It can also simulate random supervised relationships linking potential predictors to the outcome data. The training process for the model simulated a very large number of these data sets that, in effect, constitute a “training set data point”. For example, during training, if a batch size of 64 was used, that means 64 randomly generated datasets were used in that iteration.

From these data sets, a complex deep learning model is created that captures a huge number of possible relationships. The model is sophisticated enough and trained in a manner that allows it to effectively emulate Bayesian estimation.

When we use the pre-trained model, our training set matters, even though there is no new estimation. The model includes an attention mechanism that “primes the model” by focusing on the types of relationships in your training data. In that way, the pre-fitted network is deliberately biased to effectively predict our new samples. This leads to in-context learning .

And it works; in fact, it works really well.

License for the Underyling Model

PriorLabs created TabPFN. Version 2.5 of the model, which contained several improvements, requires an API key for accessing the model parameter. Without one, an error occurs:

This model is gated and requires you to accept its terms. Please follow these steps: 1. Visit https://huggingface.co/Prior-Labs/tabpfn_2_5 in your browser and accept the terms of use. 2. Log in to your Hugging Face account via the command line by running: hf auth login (Alternatively, you can set the HF_TOKEN environment variable with a read token).

The license includes provisions for “Non-Commercial Use Only” if you are just trying it out.

Instructions for installing the package and obtaining the API key are in the package’s manual .

Also, the model is most efficient when a GPU is available (by an order of magnitude or two). This may seem obvious to anyone already working with deep learning models, but it is a fairly new requirement for those strictly working with traditional tabular data models.

Usage

The syntax is idiomatic R: it supports fitting interfaces via data frames/vectors, formulas, and recipes. The standard R predict() method is used for prediction. augument() is also available for prediction.

When evaluating pre-trained models, there is a possibility that they may have memorized well-known datasets (e.g., Ames housing, Palmer penguins). TabPFN isn’t trained that way, but just in case we are worried about that, we’ll use lesser-known data. Worley (1987) derived a mechanistic model for the flow rate of liquids from two aquifers positioned vertically (i.e., the “upper” and “lower” aquifers). We’ll generate some of that data and add completely noisy predictors to increase the difficulty. The outcome is very skewed, so we’ll log that too.

Additionally, we’ll load the tidymodels library for simulation, data splitting, and visualization.

1
2
3
4
5
6
7
8
9


library(tabpfn)
library(tidymodels)
library(probably)

set.seed(17)
aquifier_data <-
 sim_regression(2000,  method = "worley_1987") |>
 bind_cols(sim_noise(2000, 50)) |>
 mutate(outcome = log10(outcome))

We’ll use a stratified 3:1 training and testing split:

1
2
3


set.seed(8223)
aquifier_split <- initial_split(aquifier_data, strata = outcome)
aquifier_split

## 
## <1500/500/2000>

1
2


aquifier_train <- training(aquifier_split)
aquifier_test  <- testing(aquifier_split)

and “fit” the model:

1

tab_fit <- tab_pfn(outcome ~ ., data = aquifier_train)

Again, the model does not actually fit anything new. This computes the embeddings for the training set data and stores them for the prediction stage.

To make predictions, predict() returns the model’s results. As previously mentioned, a GPU is not strictly required for these computations. However, if more than a trivial amount of data are being predicted, execution time can be very long.

Since we’ll want to evaluate and plot the data, we’ll use augment(), which just runs predict() and binds the results to the data being predicted:

1

tab_pred <- augment(tab_fit, aquifier_test)

How does it work?

1

tab_pred |> metrics(outcome, .pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##                
## 1 rmse    standard      0.104 
## 2 rsq     standard      0.937 
## 3 mae     standard      0.0829

1

tab_pred |> cal_plot_regression(outcome, .pred)

That looks good, especially with no training.

Next Steps

There is a lot more functionality to add to the package, including additional prediction types and interpretability tools. Many of these are available in extensions .

We’ll also add a new parsnip model type for TabPFN and other integrations with tidymodels in the summer.

Acknowledgements

A huge thanks to Tomasz Kalinowski and Daniel Falbel for their support on this and all of their hard work on reticulate and torch.

Thanks also to the contributors to date: @frankiethull , @mthulin , and @t-kalinowski .

orbital 0.4.0

Emil Hvitfeldt — Mon, 12 Jan 2026 00:00:00 +0000

We’re over the moon to announce the release of orbital 0.4.0. orbital lets you predict in databases using tidymodels workflows.

You can install it from CRAN with:

install.packages("orbital")

This blog post will cover the highlights, which are post processing support and the new show_query() method.

You can see a full list of changes in the release notes .

Post processing support

The biggest improvement in this version is that orbital() now works for supported tailor methods. See vignette for a list of all supported post-processors.

Let’s start by fitting a classification model on the penguins data set, using {xgboost} as the engine. We will be showcasing using an adjustment that only works on binary classification and will thus recode species to have levels "Adelie" and "not_Adelie".

penguins$species <- forcats::fct_recode(
 penguins$species,
 not_Adelie = "Chinstrap", not_Adelie = "Gentoo"
)

After we have modified the data, we set up a simple workflow, with a preprocessor using recipes and the model specification using parsnip.

We also set up a post processor using the tailor package. A single adjustment will be done by adding adjust_equivocal_zone(). This will apply an equivocal zone to our binary classification model. Stopping predictions that are too close to the thresholds by labeling them as "[EQ]". Setting the argument value = 0.2 means that any predictions with a predicted probability of between 0.3 and 0.7 will be predicted as "[EQ]" instead.

rec_spec <- recipe(species ~ ., data = penguins) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_zv(all_predictors())

lr_spec <- boost_tree(tree_depth = 1, trees = 5) |>
  set_mode("classification") |>
  set_engine("xgboost")

tlr_spec <- tailor() |>
  adjust_equivocal_zone(value = 0.2)

wf_spec <- workflow(rec_spec, lr_spec, tlr_spec)
wf_fit <- fit(wf_spec, data = penguins)

With this fitted workflow object, we can call orbital() on it to create an orbital object. Notice that for adjust_equivocal_zone() to work, we need to set type = c("class", "prob") as both are required for the adjust_equivocal_zone() transformation.

orbital_obj <- orbital(wf_fit, type = c("class", "prob"))
orbital_obj
#> 
#> ── orbital Object ───────────────────────────────────────────────────────
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201 ...
#> • .pred_class = dplyr::case_when(1 - 1/(1 + exp(dplyr::case_when(b ...
#> • .pred_Adelie = 1 - 1/(1 + exp(dplyr::case_when(bill_length_mm < ...
#> • .pred_not_Adelie = 1 - (1 - 1/(1 + exp(dplyr::case_when(bill_len ...
#> • .pred_class = dplyr::case_when( .pred_Adelie > 0.5 + 0.2 ~ 'Adel ...
#> ─────────────────────────────────────────────────────────────────────────
#> 6 equations in total.

This object contains all the information that is needed to produce predictions. Which we can produce with predict() .

preds <- predict(orbital_obj, penguins)
preds
#> # A tibble: 344 × 3
#>    .pred_class .pred_Adelie .pred_not_Adelie
#>                              
#>  1 Adelie             0.845            0.155
#>  2 Adelie             0.845            0.155
#>  3 Adelie             0.845            0.155
#>  4 not_Adelie         0.291            0.709
#>  5 Adelie             0.845            0.155
#>  6 Adelie             0.845            0.155
#>  7 Adelie             0.845            0.155
#>  8 Adelie             0.845            0.155
#>  9 Adelie             0.845            0.155
#> 10 Adelie             0.845            0.155
#> # ℹ 334 more rows

The predictions are working; however, we don’t see any evidence that adjust_equivocal_zone() is working. A call to count() reveals that a couple of observation lands in the equivocal zone.

count(preds, .pred_class)
#> # A tibble: 3 × 2
#>   .pred_class     n
#>          
#> 1 Adelie        144
#> 2 [EQ]           15
#> 3 not_Adelie    185

And we can further verify that they are correct.

filter(preds, .pred_class == '[EQ]')
#> # A tibble: 15 × 3
#>    .pred_class .pred_Adelie .pred_not_Adelie
#>                              
#>  1 [EQ]               0.483            0.517
#>  2 [EQ]               0.483            0.517
#>  3 [EQ]               0.483            0.517
#>  4 [EQ]               0.483            0.517
#>  5 [EQ]               0.483            0.517
#>  6 [EQ]               0.483            0.517
#>  7 [EQ]               0.483            0.517
#>  8 [EQ]               0.348            0.652
#>  9 [EQ]               0.348            0.652
#> 10 [EQ]               0.348            0.652
#> 11 [EQ]               0.348            0.652
#> 12 [EQ]               0.348            0.652
#> 13 [EQ]               0.483            0.517
#> 14 [EQ]               0.483            0.517
#> 15 [EQ]               0.483            0.517

New show_query method

One of the main purposes of orbital is to allow for predictions in databases.

library(DBI)
library(RSQLite)

con_sqlite <- dbConnect(SQLite(), path = ":memory:")
penguins_sqlite <- copy_to(con_sqlite, penguins, name = "penguins_table")

Having set up a database we could have used orbital_sql() to show what the SQL query would have looked like. For quick testing, the output isn’t immediately ready to be pasted into its own file due to the fragments within the output.

The show_query() method has been implemented to see exactly what the generated SQL looks like.

show_query(orbital_obj, con_sqlite)
#> CASE WHEN ((`bill_length_mm` IS NULL)) THEN 43.9219298245614 WHEN NOT ((`bill_length_mm` IS NULL)) THEN `bill_length_mm` END AS bill_length_mm
#> CASE WHEN ((`flipper_length_mm` IS NULL)) THEN 201.0 WHEN NOT ((`flipper_length_mm` IS NULL)) THEN `flipper_length_mm` END AS flipper_length_mm
#> CASE
#> WHEN ((1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047))))) > 0.5) THEN 'Adelie'
#> ELSE 'not_Adelie'
#> END AS .pred_class
#> 1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047)))) AS .pred_Adelie
#> 1.0 - (1.0 - 1.0 / (1.0 + EXP(((((CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.627138138
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)
#> END + CASE
#> WHEN (`bill_length_mm` < 43.2999992) THEN 0.425288886
#> WHEN ((`bill_length_mm` >= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)
#> END) + CASE
#> WHEN (`bill_length_mm` < 42.4000015) THEN 0.380251437
#> WHEN ((`bill_length_mm` >= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)
#> END) + CASE
#> WHEN (`bill_length_mm` < 44.4000015) THEN 0.286071777
#> WHEN ((`bill_length_mm` >= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)
#> END) + CASE
#> WHEN (`flipper_length_mm` < 203.0) THEN 0.209298179
#> WHEN ((`flipper_length_mm` >= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)
#> END) + LOG(0.44186047 / (1.0 - 0.44186047))))) AS .pred_not_Adelie
#> CASE
#> WHEN (`.pred_Adelie` > (0.5 + 0.2)) THEN 'Adelie'
#> WHEN (`.pred_Adelie` < (0.5 - 0.2)) THEN 'not_Adelie'
#> ELSE '[EQ]'
#> END AS .pred_class

Acknowledgements

A big thank you to all the people who have contributed to orbital since the release of v0.4.0:

@EmilHvitfeldt , @frankiethull , @jeroenjanssens , and @topepo .

tidymodels & xgboost

Emil Hvitfeldt — Mon, 15 Dec 2025 00:00:00 +0000

The xgboost library has recently gotten a big CRAN release. Jumping from version 1.7.11.1 to 3.1.2.1. We at the tidymodels team have been following the development and have done our best to ensure that your experience is unaffected by this release.

In addition to all the new features and improvements that are now available for users relying on CRAN versions of packages, there are also a few breaking changes. Specifically between version 1.x and 2.x of the xgboost library. The xgboost team has kindly provided a migration guide for how to update your code if you are upgrading from before version 2.x.

If you are using xgboost purely through tidymodels via functions like parsnip::boost_tree() and embed::step_discretize_xgb() , you should not need to change anything, as we have updated our packages to work with both the new and old versions of xgboost. If you are having any issues, please let us know by filing an issue for the affected package.

We look forward to integrating parsnip more deeply into these new changes, such as support for categorical predictors and quantile regression .

Here are the package that we’ve updated or helped the maintainers update

tidypredict 1.0.0

Emil Hvitfeldt — Wed, 10 Dec 2025 00:00:00 +0000

We’re tickled pink to announce the release of version 1.0.0 of tidypredict . The main goal of tidypredict is to enable running predictions inside databases. It reads the model, extracts the components needed to calculate the prediction, and then creates an R formula that can be translated into SQL.

You can install them from CRAN with:

install.packages("tidypredict")

This blog post highlights the most important changes in this release, including faster computations for tree-based models, more efficient tree representations, glmnet model support, and a change in how random forests are handled. You can see a full list of changes in the release notes .

library(tidypredict)

Improved output for random forest models

The previous version of tidypredict tidypredict_fit() would return a list of expressions, one for each tree, when applied to random forest models. This didn’t align with what is returned by other types of models. In version 1.0.0, this has been changed to produce a single, combined expression that reflects how predictions should be made.

This is technically a breaking change, but one we believe is worthwhile, as it provides a more consistent output for tidypredict_fit() and hides the technical details about how to combine trees from different packages.

Faster parsing of trees

The parsing of xgboost, partykit, and ranger models should now be substantially faster than before. Examples have been shown to be 10 to 200 times faster. Please note that larger models, more trees, and deeper trees still take some time to parse.

More efficient tree expressions

All trees, whether they are a single tree or part of a collection of trees, such as in boosted trees or random forests, are encoded as case_when() statements by tidypredict. This means that the following tree.

model <- partykit::ctree(mpg ~ am + cyl, data = mtcars)
model
#> 
#> Model formula:
#> mpg ~ am + cyl
#> 
#> Fitted party:
#> [1] root
#> |   [2] cyl <= 4: 26.664 (n = 11, err = 203.4)
#> |   [3] cyl > 4
#> |   |   [4] cyl <= 6: 19.743 (n = 7, err = 12.7)
#> |   |   [5] cyl > 6: 15.100 (n = 14, err = 85.2)
#> 
#> Number of inner nodes:    2
#> Number of terminal nodes: 3

Would be turned into the following case_when() statement.

1
2
3
4
5


case_when(
 cyl <= 4 ~ 26.6636363636364,
 cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
 cyl > 6 & cyl > 4 ~= 15.1
)

With this new update, we have taken advantage of the .default argument whenever possible, which should lead to faster predictions, as we no longer need to calculate redundant conditionals.

tidypredict_fit(model)
#> case_when(cyl <= 4 ~ 26.6636363636364, cyl <= 6 & cyl > 4 ~ 19.7428571428571, 
#>     .default = 15.1)

Glmnet support

We now support the glmnet package. This package provides generalized linear models with lasso or elasticnet regularization.

The primary restriction when using a glmnet model with tidypredict() is that the model must have been fitted with the lambda argument set to a single value.

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 0.01)

tidypredict_fit(model)
#> 13.0081464696679 + (cyl * -0.0773532164346008) + (disp * 0.00969507138358544) + 
#>     (hp * -0.0192462098902709) + (drat * 0.816753237688302) + 
#>     (wt * -3.41564341709663) + (qsec * 0.758580151032383) + (vs * 
#>     0.277874296242861) + (am * 2.47356523820533) + (gear * 0.645144527527598) + 
#>     (carb * -0.300886812079305)

glmnet() computes a collection of models using many sets of penalty values. This can be very efficient, but for tidypredict, we need to predict with a single penalty. Note how, as we increase the penalty, the extracted expression correctly removes terms with coefficients of 0 instead of leaving them as (disp * 0).

model <- glmnet::glmnet(mtcars[, -1], mtcars$mpg, lambda = 1)

tidypredict_fit(model)
#> 35.3137765116027 + (cyl * -0.871451193824228) + (hp * -0.0101173960249783) + 
#>     (wt * -2.59443677687505)

tidypredict is used as the primary parser for models employed by the orbital package. This means that all the changes seen in this post also take effect when using orbital with tidymodels workflows. Such as using parsnip::linear_reg() with engine = "glmnet".

Acknowledgements

A big thank you to all the folks who helped make this release happen: @EmilHvitfeldt , and @jeroenjanssens .

Two New tidymodels Packages

Frances Lin — Sat, 22 Nov 2025 00:00:00 +0000

We’re very chuffed to announce the release of two new modeling packages: filtro and important.

You can install them from CRAN with:

1

install.packages(c("filtro", "important"))

This blog post will introduce both.

filtro

Feature selection is an important step in building machine learning models that are robust and reliable. By keeping only the most relevant predictors, we can reduce overfitting, improve model performance, and speed up computation.

filtro is a low-level tidy tools designed for filter-based supervised feature selection. filtro makes it easy to score, rank, and select features using a wide range of statistical and model-based metrics. The scoring metrics include: p-values, correlation, random forest feature importance, information gain, and more.

With filtro, we can quickly rank the variables and select either the top proportion or the top number of features that best contribute to our model. It also supports multi-parameter optimization via desirability functions . filtro is a standalone tool, but it integrates with other packages, allowing it to be used within the tidymodels workflows.

Currently, filtro implements a total of six filters. Like other elements of the framework, also filtro is extensible if you want to use a score we haven’t implemented yet. You can read more on how to do this on tidymodels.org .

The available score class objects are:

##  [1] "score_aov_fstat"          "score_aov_pval"          
##  [3] "score_cor_pearson"        "score_cor_spearman"      
##  [5] "score_gain_ratio"         "score_imp_rf"            
##  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"    
##  [9] "score_info_gain"          "score_roc_auc"           
## [11] "score_sym_uncert"         "score_xtab_pval_chisq"   
## [13] "score_xtab_pval_fisher"

Let’s look at an example. Kuhn and Johnson (2013) described a data set where 176 samples were collected from a chemical manufacturing process. The goal is to predict process yield. Predictors are continuous, count, and categorical; some are correlated, and some contain missing values.

Let’s create an initial split of the data (which are in the modeldata package):

1
2
3
4
5
6


library(tidymodels)
library(filtro)

set.seed(1)
yield_split <- initial_split(modeldata::chem_proc_yield)
yield_split

## 
## <132/44/176>

1
2


yield_train <- training(yield_split)
yield_test <- testing(yield_split)

We’d like to estimate the strength of the relationship between these 57 predictors and the process yield. We’ll quantify that in two ways. First is the old-fashioned Spearman rank correlation statistic. We can estimate these values and rank them by the absolute value of the correlations. We can also measure their value using a random forest variable importance. One quality of the predictors is that their values are correlated, so there may be some value in using an oblique random forest model. This creates a collection of tree-based models with splits that are linear combinations of the selected predictors.

To estimate the scores, we use the score objects contained in the package along with the fit() method:

1
2
3
4
5
6
7


yield_rank_res <-
  score_cor_spearman |>
  fit(yield ~ ., data = yield_train)

# The object contains the statistics:
yield_rank_res@results |> 
  arrange(desc(abs(score)))

## # A tibble: 57 × 4
##    name          score outcome predictor      
##                           
##  1 cor_spearman  0.655 yield   man_proc_32    
##  2 cor_spearman -0.537 yield   man_proc_36    
##  3 cor_spearman  0.519 yield   bio_material_03
##  4 cor_spearman  0.502 yield   bio_material_06
##  5 cor_spearman  0.491 yield   man_proc_09    
##  6 cor_spearman  0.478 yield   bio_material_02
##  7 cor_spearman  0.446 yield   man_proc_33    
##  8 cor_spearman  0.421 yield   bio_material_12
##  9 cor_spearman -0.420 yield   man_proc_13    
## 10 cor_spearman  0.412 yield   bio_material_04
## # ℹ 47 more rows

To score via a random forest model, we only need to switch out the score object:

1
2
3
4
5
6


yield_rf_res <-
  score_imp_rf_oblique |>
  fit(yield ~ ., data = yield_train)

yield_rf_res@results |> 
  arrange(desc(abs(score)))

## # A tibble: 57 × 4
##    name            score outcome predictor      
##                             
##  1 imp_rf_oblique 0.128  yield   man_proc_32    
##  2 imp_rf_oblique 0.0697 yield   man_proc_36    
##  3 imp_rf_oblique 0.0670 yield   man_proc_17    
##  4 imp_rf_oblique 0.0644 yield   man_proc_09    
##  5 imp_rf_oblique 0.0612 yield   man_proc_13    
##  6 imp_rf_oblique 0.0446 yield   bio_material_03
##  7 imp_rf_oblique 0.0315 yield   man_proc_33    
##  8 imp_rf_oblique 0.0263 yield   man_proc_11    
##  9 imp_rf_oblique 0.0263 yield   bio_material_04
## 10 imp_rf_oblique 0.0262 yield   bio_material_06
## # ℹ 47 more rows

We should probably combine the scores and do a joint ranking. To combine the two sets of statistics:

1
2
3
4


class_score_list <- list(yield_rank_res, yield_rf_res) |>
  bind_scores()

class_score_list

## # A tibble: 57 × 4
##    outcome predictor       cor_spearman imp_rf_oblique
##                                   
##  1 yield   bio_material_01        0.404        0.0178 
##  2 yield   bio_material_02        0.478        0.0190 
##  3 yield   bio_material_03        0.519        0.0446 
##  4 yield   bio_material_04        0.412        0.0263 
##  5 yield   bio_material_05        0.116        0.00639
##  6 yield   bio_material_06        0.502        0.0262 
##  7 yield   bio_material_07       -0.101        0.00151
##  8 yield   bio_material_08        0.369        0.00714
##  9 yield   bio_material_09        0.109        0.0122 
## 10 yield   bio_material_10        0.214        0.00998
## # ℹ 47 more rows

We can accomplish a joint ranking via desirability functions. Here, we set goals for each score (i.e., maximize, minimize, etc.). The algorithm rescales their values and uses a geometric mean for an overall ranking. The desirability2 package has some nice tools for this. Here’s how we do it:

1
2
3
4
5
6
7
8


library(desirability2)
class_score_list |>
  show_best_desirability_prop(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  ) |> 
  arrange(desc(.d_overall)) |> 
  select(-starts_with(".d_max_"))

## # A tibble: 57 × 5
##    outcome predictor       cor_spearman imp_rf_oblique .d_overall
##                                         
##  1 yield   man_proc_32            0.655         0.128      0.735 
##  2 yield   man_proc_09            0.491         0.0644     0.291 
##  3 yield   bio_material_03        0.519         0.0446     0.217 
##  4 yield   man_proc_33            0.446         0.0315     0.134 
##  5 yield   bio_material_06        0.502         0.0262     0.129 
##  6 yield   bio_material_04        0.412         0.0263     0.104 
##  7 yield   bio_material_02        0.478         0.0190     0.0926
##  8 yield   bio_material_01        0.404         0.0178     0.0719
##  9 yield   bio_material_11        0.381         0.0194     0.0714
## 10 yield   man_proc_12            0.391         0.0183     0.0705
## # ℹ 47 more rows

Using the scale = 2 option puts more weight on the random forest results.

It is unlikely that users will work with filtro directly; it is much better to incorporate these feature selection tools inside a model workflow (as we will see below).

Now that we’ve looked at filtro, next up is the important package (yes, this is what we named it).

important

The important package does two things. First, it provides yet another tool for calculating random forest-like permutation importance scores. We highly value other packages that perform these same calculations (such as DALEX and vip ). Our rationale for creating another package for this is that we’ve developed interfaces for censored regression, including dynamic metrics such as Brier scores or ROC curves that evaluate models at a specific time point. These dynamic methods aren’t available in other packages, and the peculiarities of these metrics make them difficult to incorporate into existing frameworks.

Other niceties about importance scores are that any metric from the yardstick package can be used, and we have optimized parallel processing for the underlying computations. For the latter feature, we support the future and mirai packages for parallel processing.

important also has three recipe steps for supervised feature selection (similar to what Steven Pawley did with his colino package ). The steps are:

Let’s look at the last one, which mirrors our analysis above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


library(important)
goals <-
  desirability(
    maximize(cor_spearman, low = 0.25, high = 1),
    maximize(imp_rf_oblique, scale = 2)
  )

yield_rec <-
  recipe(yield ~ ., data = yield_train) |>
  step_impute_knn(all_predictors(), neighbors = 10) |>
  step_predictor_desirability(
    all_predictors(),
    score = goals,
    prop_terms = 1 / 10
  )
yield_rec

##

## ── Recipe ───────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:    1
## predictor: 57

##

## ── Operations

## • K-nearest neighbor imputation for: all_predictors()

## • Feature selection via desirability functions (`cor_spearman`
##   and `imp_rf_oblique`) on: all_predictors()

When combined with a specific model, we can tune the number of neighbors as well as the proportion of predictors retained (10% above).

prep() will do the appropriate estimation steps:

1

trained_rec <- prep(yield_rec)

Which 10% of the predictors were retained? The tidy() method can list the scores and their rankings:

1
2
3
4


scores <- tidy(trained_rec, number = 2)
scores |>
  arrange(desc(.d_overall)) |>
  select(-starts_with(".d_max_"), -id)

## # A tibble: 57 × 5
##    terms           removed cor_spearman imp_rf_oblique .d_overall
##                                         
##  1 man_proc_32     FALSE          0.655         0.128       0.735
##  2 man_proc_36     FALSE         -0.530         0.0668      0.325
##  3 man_proc_09     FALSE          0.491         0.0673      0.304
##  4 man_proc_13     FALSE         -0.420         0.0725      0.275
##  5 bio_material_03 FALSE          0.519         0.0517      0.249
##  6 bio_material_06 TRUE           0.502         0.0445      0.210
##  7 man_proc_17     TRUE          -0.303         0.0749      0.158
##  8 man_proc_33     TRUE           0.443         0.0374      0.156
##  9 bio_material_02 TRUE           0.478         0.0330      0.151
## 10 bio_material_04 TRUE           0.412         0.0347      0.133
## # ℹ 47 more rows

1
2


# What percentage was removed?
mean(scores$removed * 100)

## [1] 91.22807

Summary

Both filtro and important satisfy a feature for tidymodels that has been highly ranked in our user surveys: supervised feature selection. filtro contains the underlying framework and important provides recipe steps that can be used in a workflow.

Q3 2025 tidymodels digest

Emil Hvitfeldt — Tue, 18 Nov 2025 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused.

Since our last update we have had some larger releases that you can read about in these posts.

The post will update, you on which packages have changed and the improvements you should know about that haven’t been covered in the above posts.

Here’s a list of the packages and their News sections:

Let’s look at a few specific updates.

Quiet linear svm models

When you used to fit a linear SVM model, you would get a message that you were not able to avoid.

1
2
3
4
5
6
7


library(parsnip)
library(modeldata)

res <- 
  svm_linear(mode = "classification", engine = "kernlab") |> 
  fit(Class ~ ., data = two_class_dat)
#>  Setting default kernel parameters

This message by itself was not that useful and was unable to turn off in a reasonable way. We have silenced this message to hopefully alleviate some of the noise that came from using this method.

library(parsnip)
library(modeldata)
#> 
#> Attaching package: 'modeldata'
#> The following object is masked from 'package:datasets':
#> 
#>     penguins

res <- 
  svm_linear(mode = "classification", engine = "kernlab") |> 
  fit(Class ~ ., data = two_class_dat)
res
#> parsnip model object
#> 
#> Support Vector Machine object of class "ksvm" 
#> 
#> SV type: C-svc  (classification) 
#>  parameter : cost C = 1 
#> 
#> Linear (vanilla) kernel function. 
#> 
#> Number of Support Vectors : 361 
#> 
#> Objective Function Value : -357.1487 
#> Training error : 0.178255 
#> Probability model included.

Fewer numeric overflow issues in brulee

The brulee package has been improved to try to help avoid numeric overflow in the loss functions. The following things have been done to help deal with this type of issue.

Starting values were transitioned to using Gaussian distribution (instead of uniform) with a smaller standard deviation.
The results always contain the initial results to use as a fallback if there is overflow during the first epoch.
brulee_mlp() has two additional parameters, grad_value_clip and grad_value_clip, that prevent issues.
The warning was changed to “Early stopping occurred at epoch {X} due to numerical overflow of the loss function.”

Additional torch optimizers in brulee

Several additional optimizers have been added: "ADAMw", "Adadelta", "Adagrad", and "RMSprop". Previously, the options were "SGD" and LBFGS". ## Acknowledgements

We want to sincerely thank everyone who contributed to these packages since their previous versions:

dials: @brendad8 , @hfrick , @topepo , and @Wander03 .
parsnip: @chillerb , @EmilHvitfeldt , @jmgirard , @topepo , and @ZWael .
rsample: @abichat , @hfrick , @mkiang , and @vincentarelbundock .
recipes: @EmilHvitfeldt , @SimonDedman , and @topepo .
probably: @abichat , @ayueme , @dchiu911 , @EmilHvitfeldt , @frankiethull , @gaborcsardi , @hfrick , @Jeffrothschild , @jgaeb , @jrwinget , @mark-burdon , @martinhulin , @simonpcouch , @teunbrand , @topepo , @wjakethompson , and @yellowbridge .
brulee: @genec1 , @talegari , and @topepo .

tune version 2.0.0

Max Kuhn — Wed, 05 Nov 2025 00:00:00 +0000

We’re very chuffed to announce the release of tune 2.0.0. tune is a package that can be used to resample models and/or optimize their tuning parameters

You can install it from CRAN with:

1

install.packages("tune")

This blog post will describe the two major updates to the package. You can see a full list of changes in the release notes .

Those two big improvements to the package: new parallel processing features and postprocessing.

Using future or mirai for parallel processing

Historically , we’ve used the foreach package to run calculations in parallel. Sadly, that package is no longer under active development. We’ve been progressively moving away from it, and as of this version, it is deprecated. In its place, we’ve added functionality for the future and mirai packages.

Previously, you would load a foreach parallel backend package, such as doParallel, doMC, or doFuture, and then register it. For example:

library(doParallel)
cl <- makePSOCKcluster()
registerDoParallel(cl)

Instead, you can use the future package via:

library(future)
plan("multisession")

or the mirai package by using

library(mirai)
daemons(num_cores)

Each of these is configurable to run in various ways, such as on remote servers.

tidymodels.org and the tune pkgdown site have more information to help users switch away from foreach.

Tuning your postprocessor

A postprocessor is an operation that modifies model predictions. For example, if your classifier can separate classes but its probability estimates are not accurate enough, you can add a calibrator operation that can attempt to adjust those probability estimates. Another good example is for binary classifiers, where the default threshold for classifying a prediction as an event can be adjusted based on its corresponding probability estimate.

Currently, we’ve enabled postprocessing using the tailor package . The operations that are currently available:

adjust_numeric_calibration(): Estimate and apply a calibration model for regression problems.
adjust_numeric_range(): Truncate the range of predictions.
adjust_probability_calibration(): Estimate and apply a calibration model for classification problems.
adjust_probability_threshold(): Covert binary class probabilities to hard class predictions using different thresholds.
adjust_equivocal_zone(): Decline to predict a sample if its strongest class probability is low.
adjust_predictions_custom(): A general mutate()-like adjustment.

If the operations have arguments, these can be tuned in the same way as the preprocessors (e.g., a recipe) or the supervised model. For example, let’s tune the probability threshold for a random forest classifier.

We’ll simulate some data with a class imbalance:

1
2
3
4
5


library(tidymodels)

set.seed(296)
sim_data <- sim_classification(2000, intercept = -12)
sim_data |> count(class)

## # A tibble: 2 × 2
##   class       n
##      
## 1 class_1   234
## 2 class_2  1766

We’ll resampling them via 10-fold cross-validation:

1

sim_rs <- vfold_cv(sim_data, strata = class)

We define a tailor object that tags the class probability threshold for optimization:

1
2
3


tlr_spec <- 
  tailor() |> 
  adjust_probability_threshold(threshold = tune())

We also specify a random forest that uses its default tuning parameters:

1
2
3


rf_spec <- rand_forest(mode = "classification")
rf_thrsh_wflow <- workflow(class ~ ., rf_spec, tlr_spec)
rf_thrsh_wflow

## ══ Workflow ════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## Postprocessor: tailor
## 
## ── Preprocessor ────────────────────────────────────────────────────────
## class ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger 
## 
## 
## ── Postprocessor ───────────────────────────────────────────────────────

##

## ── tailor ──────────────────────────────────────────────────────────────

## A binary postprocessor with 1 adjustment:

##

## • Adjust probability threshold to optimized value.

## NA
## NA
## NA

With a class imbalance, the default 50% threshold yields high specificity but low sensitivity. When we alter the threshold, those numbers will change, and we can select the best trade-off for our application. Let’s tune the workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


cls_mtr <- metric_set(roc_auc, sensitivity, specificity)

# To run all resamples in parallel:
mirai::daemons(10)

set.seed(985)
rf_thrsh_res <- 
  rf_thrsh_wflow |> 
  tune_grid(
    resamples = sim_rs,
    grid = tibble(threshold = seq(0, 0.6, by = 0.01)),
    metrics = cls_mtr
  )

Let’s visualize the results:

1

autoplot(rf_thrsh_res) + lims(y = 0:1)

We can see that we can improve sensitivity by reducing the threshold. The rate of decay in specificity is slow compared to the gain in sensitivity until thresholds less than 10% are used. The Brier score is constant over the threshold since it only uses the estimated class probabilities, which are unaffected by the threshold.

We’ve taken great pains to avoid redundant calculations. In this example, for each resample, a single random forest model is trained, and then the postprocessing grid is evaluated. This conditional execution strategy is used to fit the fewest possible preprocessors, models, and postprocessors.

For this classification example, recent updates to the desirability2 package can enable you to jointly find the best sensitivity/specificity trade-off using the threshold parameter and model calibration/separation using other parameters.

We’ll add more examples and tutorials to tidymodels.org to showcase what we can do with postprocessing.

What’s next

This had been a race towards posit::conf(2025). Our focus had to be on the two big features for this release (since we taught workshops that use them). There are a few other relatively minor issues to address as the year closes.

One is to swap the package that we currently use for Gaussian Processes in Bayesian optimization from the GPfit package to the GauPro package. The former is not actively supported, and the latter has a few features that we’d love to have. Specifically, better kernel methods for non-numeric tuning parameters (e.g., the type of activation function used in neural networks). Hopefully, we’ll have another planned release before the end of the year.

Another near-future development goal is to have comprehensive integration for quantile regression models. We’ve added a few parsnip engines already and will expand the support in yardstick and tune.

Acknowledgements

We’d like to thanks everyone who contributed since the previous version: @3styleJam , @Diyar0D , @EmilHvitfeldt , @hfrick , @MatthieuStigler , @MattJEM , @mthulin , @tjburch , and @topepo .

mall 0.2.0

Edgar Ruiz — Tue, 19 Aug 2025 00:00:00 +0000

mall uses Large Language Models (LLM) to run Natural Language Processing (NLP) operations against your data. This package is available for both R, and Python. Version 0.2.0 has been released to CRAN and PyPi respectively.

In R, you can install the latest version with:

1

install.packages("mall")

In Python, with:

1

pip install mlverse-mall

This release expands the number of LLM providers you can use with mall. Also, in Python it introduces the option to run the NLP operations over string vectors, and in R, it enables support for ‘parallelized’ requests.

It is also very exciting to announce a brand new cheatsheet for this package. It is available in print (PDF) and HTML format!

More LLM providers

The biggest highlight of this release is the the ability to use external LLM providers such as OpenAI , Gemini and Anthropic . Instead of writing integration for each provider one by one, mall uses specialized integration packages to act as intermediates.

In R, mall uses the ellmer package to integrate with a variety of LLM providers . To access the new feature, first create a chat connection, and then pass that connection to llm_use(). Here is an example of connecting and using OpenAI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


install.packages("ellmer")

library(mall)
library(ellmer)

chat <- chat_openai()
#> Using model = "gpt-4.1".

llm_use(chat, .cache = "_my_cache")
#> 
#> ── mall session object 
#> Backend: ellmerLLM session: model:gpt-4.1R session: cache_folder:_my_cache

In Python, mall uses chatlas as the integration point with the LLM. chatlas also integrates with several LLM providers . To use, first instantiate a chatlas chat connection class, and then pass that to the Polars data frame via the .llm.use() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


pip install chatlas

import mall
from chatlas import ChatOpenAI

chat = ChatOpenAI()

data = mall.MallData
reviews = data.reviews

reviews.llm.use(chat)
#> {'backend': 'chatlas', 'chat': 
#> , '_cache': '_mall_cache'}

Connecting mall to external LLM providers introduces a consideration of cost. Most providers charge for the use of their API, so there is a potential that a large table, with long texts, could be an expensive operation.

Parallel requests (R only)

A new feature introduced in ellmer 0.3.0 enables the access to submit multiple prompts in parallel, rather than in sequence. This makes it faster, and potentially cheaper, to process a table. If the provider supports this feature, ellmer is able to leverage it via the parallel_chat() function. Gemini and OpenAI support the feature.

In the new release of mall, the integration with ellmer has been specially written to take advantage of parallel chat. The internals have been re-written to submit the NLP-specific instructions as a system message in order reduce the size of each prompt. Additionally, the cache system has also been re-tooled to support batched requests.

NLP operations without a table

Since its initial version, mall has provided the ability for R users to perform the NLP operations over a string vector, in other words, without needing a table. Starting with the new release, mall also provides this same functionality in its Python version.

mall can process vectors contained in a list object. To use, initialize a new LLMVec class object with either an Ollama model, or a chatlas Chat object, and then access the same NLP functions as the Polars extension.

1
2
3
4
5
6
7


# Initialize a Chat object
from chatlas import ChatOllama
chat = ChatOllama(model = "llama3.2")

# Pass it to a new LLMVec
from mall import LLMVec
llm = LLMVec(chat)    

Access the functions via the new LLMVec object, and pass the text to be processed.

1
2
3
4
5


llm.sentiment(["I am happy", "I am sad"])
#> ['positive', 'negative']

llm.translate(["Este es el mejor dia!"], "english")
#> ['This is the best day!']

For more information visit the reference page: LLMVec

New cheatsheet

The brand new official cheatsheet is now available from Posit: Natural Language processing using LLMs in R/Python . Its mean feature is that one side of the page is dedicated to the R version, and the other side of the page to the Python version.

An web page version is also availabe in the official cheatsheet site here . It takes advantage of the tab feature that lets you select between R and Python explanations and examples.

recipes 1.3.0

Emil Hvitfeldt — Mon, 28 Apr 2025 00:00:00 +0000

We’re thrilled to announce the release of recipes 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps.

You can install it from CRAN with:

install.packages("recipes")

This blog post will walk through some of the highlights of this release, which includes changes to how strings_as_factors are specified, deprecation of step_select() , new contrasts argument for step_dummy() , and improvements for step_impute_bag() .

You can see a full list of changes in the release notes .

Let’s first load the package:

library(recipes)

`strings_as_factors`

Recipes by default convert predictor strings to factors, and the option for that is located in prep() . This caused an issue when you wanted to set strings_as_factors = FALSE for a recipe that is used somewhere else like in a workflow.

This is no longer an issue as we have moved the argument to recipe() itself. We are at the same time deprecating the use of strings_as_factors when used in prep() . Here is an example:

library(modeldata)
tate_text
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

We are loading the modeldata package to get tate_text which has a character column title. If we don’t do anything then it turns into a factor.

recipe(~., data = tate_text) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

But we can set strings_as_factors = FALSE in recipe() and it won’t anymore.

recipe(~., data = tate_text, strings_as_factors = FALSE) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>                                                        
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself.

Deprecating `step_select()`

We have started the process of deprecating step_select() . Given the number of issues people are having with the step and the fact that it doesn’t play well with workflows we think this is the right call.

There are two main use cases where step_select() was used: removing variables, and selecting variables. Removing variables when done with - in step_select()

recipe(mpg ~ ., mtcars) |>
  step_select(-starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>            
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These use cases can seamlessly be converted to use step_rm() without the - for the same result.

recipe(mpg ~ ., mtcars) |>
  step_rm(starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>            
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use select() to do that before passing the data into the recipe() . This is especially helpful since recipes are tighter with respect to their input types , so only passing the data you need to use is helpful.

If you need to do the selection after another step takes effect you should still be able to do so, by using step_rm() in the following manner.

1

step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))

`step_dummy()` contrasts argument

Contrasts such as contr.treatment() and contr.poly() are used in step_dummy() to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using options() like so:

options(contrasts = c(unordered = "contr.poly", ordered = "contr.poly"))

recipe(~species + island, penguins) |>
  step_dummy(all_nominal_predictors()) |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                                              
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

The issue with this approach is that it pulls from options() when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment.

To fix this issue we have given step_dummy() an argument contrasts that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment.

recipe(~species + island, penguins) |>
  step_dummy(
    all_nominal_predictors(), contrasts = "contr.poly") |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                                              
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

If you are using a contrasts from an external package such as hardhat::contr_one_hot() you will need to have the package loaded in the environments you are working in with library(hardhat) and setting contrasts = "contr_one_hot". You will also need to call library(hardhat) in any production environments you are using this recipe.

tidyselect can be used everywhere

Several steps such as step_pls() and step_impute_bag() require the selection of more than just the affected columns. step_pls() needs you to select an outcome variable and step_impute_bag() needs you to select which variables to impute with, impute_with, if you don’t want to use all predictors. Previously these needed to be strings or use special selectors like imp_vars() . You don’t have to do that anymore. You can now use tidyselect in these arguments too.

recipe(mpg ~ ., mtcars) |>
  step_pls(all_predictors(), outcome = mpg) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 3
#>      mpg   PLS1   PLS2
#>        
#>  1  21    0.693  0.895
#>  2  21    0.650  0.654
#>  3  22.8  2.78   0.378
#>  4  21.4  0.210 -0.368
#>  5  18.7 -1.95   0.845
#>  6  18.1  0.137 -0.624
#>  7  14.3 -2.77   0.364
#>  8  24.4  1.81  -1.30 
#>  9  22.8  2.12  -1.95 
#> 10  19.2  0.531 -1.51 
#> # ℹ 22 more rows

For arguments that allow for multiple selections now work with recipes selectors like all_numeric_predictors() and has_role() .

recipe(mpg ~ ., mtcars) |>
  step_impute_bag(all_predictors(), impute_with = has_role("predictor")) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 11
#>      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   mpg
#>              
#>  1     6  160    110  3.9   2.62  16.5     0     1     4     4  21  
#>  2     6  160    110  3.9   2.88  17.0     0     1     4     4  21  
#>  3     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8
#>  4     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4
#>  5     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7
#>  6     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1
#>  7     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3
#>  8     4  147.    62  3.69  3.19  20       1     0     4     2  24.4
#>  9     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8
#> 10     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These changes are backwards compatible meaning that the old ways still work with minimal warnings.

`step_impute_bag()` now takes up less memory

We have another benefit for users of step_impute_bag() . For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with step_impute_bag() .

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_impute_bag(starts_with("Lot_"), impute_with = all_numeric_predictors()) |>
  prep()

lobstr::obj_size(rec)
#> 20.23 MB

This recipe took up over 75 MB and now takes up 20 MB.

Acknowledgements

Many thanks to all the people who contributed to recipes since the last release!

@chillerb , @dshemetov , @EmilHvitfeldt , @kevbaer , @nhward , @regisely , and @topepo .

rsample 1.3.0

Hannah Frick — Thu, 03 Apr 2025 00:00:00 +0000

We’re thrilled to announce the release of rsample 1.3.0. rsample makes it easy to create resamples for assessing model performance. It is part of the tidymodels framework, a collection of R packages for modeling and machine learning using tidyverse principles.

You can install it from CRAN with:

install.packages("rsample")

This blog post will walk you through the more flexible grouping for calculating bootstrap confidence intervals and highlight the contributions made by participants of the tidyverse developer day.

You can see a full list of changes in the release notes .

library(rsample)

Flexible grouping for bootstrap intervals

Resampling allows you get an understanding of the variability of an estimate, e.g., a summary statistic of your data. If you want to lean on statistical theory and get confidence intervals for your estimate, you can reach for the bootstrap resampling scheme: calculating your summary statistic on the bootstrap samples enables you to calculate confidence intervals around your point estimate.

rsample contains a family of int_*() functions to calculate bootstrap confidence intervals of different flavors: percentile intervals, “BCa” intervals, and bootstrap-t intervals. If you want to dive into the technical details, Chapter 11 of CASI is a good place to start.

You can calculate the confidence intervals based on a grouping in your data. However, so far, rsample would only let you provide a single grouping variable. With this release, we are extending this functionality to allow a more flexible grouping.

The motivating application for us was to be able to calculate confidence intervals around multiple model performance metrics, including dynamic metrics for time-to-event models which depend on an evaluation time point. So in this case, the metric is one grouping variable and the evaluation time another. But let’s pull back complexity for an example of how the new rsample functionality works!

We have a dataset with delivery times for orders containing one or more items. We’ll do some data wrangling with it, so we are also loading dplyr.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(deliveries, package = "modeldata")

deliveries
#> # A tibble: 10,012 × 31
#>    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05
#>                                    
#>  1             16.1  11.9 Thu       3.15       0       0       2       0       0
#>  2             22.9  19.2 Tue       3.69       0       0       0       0       0
#>  3             30.3  18.4 Fri       2.06       0       0       0       0       1
#>  4             33.4  15.8 Thu       5.97       0       0       0       0       0
#>  5             27.2  19.6 Fri       2.52       0       0       0       1       0
#>  6             19.6  13.0 Sat       3.35       1       0       0       1       0
#>  7             22.1  15.5 Sun       2.46       0       0       1       1       0
#>  8             26.6  17.0 Thu       2.21       0       0       1       0       0
#>  9             30.8  16.7 Fri       2.62       0       0       0       0       0
#> 10             17.4  11.9 Sun       2.75       0       2       1       0       0
#> # ℹ 10,002 more rows
#> # ℹ 22 more variables: item_06 , item_07 , item_08 ,
#> #   item_09 , item_10 , item_11 , item_12 , item_13 ,
#> #   item_14 , item_15 , item_16 , item_17 , item_18 ,
#> #   item_19 , item_20 , item_21 , item_22 , item_23 ,
#> #   item_24 , item_25 , item_26 , item_27

Instead of fitting a whole model here, we are calculating a straightforward summary statistic for how much delivery time increases if an item is included in the order. So the item is one grouping factor. As a second one, we are using whether the order was delivered on a weekday or a weekend. Let’s start by making that weekend indicator and reshaping the data to make it easier to calculate our summary statistic.

Note that the name for the weekend indicator column, .weekend, starts with a dot. That is important as it is the convention to signal to rsample that this is an additional grouping variable.

item_data <- deliveries %>%
  mutate(.weekend = ifelse(day %in% c("Sat", "Sun"), "weekend", "weekday")) %>%
  select(time_to_delivery, .weekend, starts_with("item")) %>%
  tidyr::pivot_longer(starts_with("item"), names_to = "item", values_to = "value")

Next, we are making a small function that calculates the ratio of average delivery times with and without the item included in the order, as a estimate of how much a specific item in an order increases the delivery time.

relative_increase <- function(data) {
  data %>%
    mutate(includes_item = value > 0) %>%
    summarize(
      has = mean(time_to_delivery[includes_item]),
      has_not = mean(time_to_delivery[!includes_item]),
      .by = c(item, .weekend)
    ) %>%
    mutate(estimate = has / has_not) %>%
    select(term = item, .weekend, estimate)
}

We can calculate that on our entire dataset.

relative_increase(item_data)
#> # A tibble: 54 × 3
#>    term    .weekend estimate
#>              
#>  1 item_01 weekday      1.07
#>  2 item_02 weekday      1.02
#>  3 item_03 weekday      1.02
#>  4 item_04 weekday      1.00
#>  5 item_05 weekday      1.00
#>  6 item_06 weekday      1.01
#>  7 item_07 weekday      1.03
#>  8 item_08 weekday      1.01
#>  9 item_09 weekday      1.01
#> 10 item_10 weekday      1.06
#> # ℹ 44 more rows

This is fine, but what we really want here is to get confidence intervals around these estimates!

So let’s make bootstrap samples and calculate our statistic on those.

set.seed(1)
item_bootstrap <- bootstraps(item_data, times = 1000)

item_stats <-
  item_bootstrap %>%
  mutate(stats = purrr::map(splits, ~ analysis(.x) %>% relative_increase()))

Now we have everything we need to calculate the confidence intervals, stashed into the tibbles in the stats column: an estimate, a term (the primary grouping variable), and our additional grouping variable .weekend, starting with a dot. What’s left to do is call one of the int_*() functions and specify which column contains the statistics. Here, we’ll calculate percentile intervals with int_pctl() .

item_ci <- int_pctl(item_stats, statistics = stats, alpha = 0.1)
item_ci
#> # A tibble: 54 × 7
#>    term    .weekend .lower .estimate .upper .alpha .method   
#>                           
#>  1 item_01 weekday   1.05      1.07    1.09    0.1 percentile
#>  2 item_01 weekend   1.04      1.07    1.10    0.1 percentile
#>  3 item_02 weekday   1.00      1.02    1.03    0.1 percentile
#>  4 item_02 weekend   0.996     1.01    1.03    0.1 percentile
#>  5 item_03 weekday   1.01      1.02    1.04    0.1 percentile
#>  6 item_03 weekend   0.970     0.990   1.01    0.1 percentile
#>  7 item_04 weekday   0.989     1.00    1.02    0.1 percentile
#>  8 item_04 weekend   0.998     1.02    1.03    0.1 percentile
#>  9 item_05 weekday   0.987     1.00    1.02    0.1 percentile
#> 10 item_05 weekend   0.982     1.00    1.03    0.1 percentile
#> # ℹ 44 more rows

Tidyverse developer day

At the tidyverse developer day after posit::conf, rsample got a lot of love in form of contributions by various community members. People improved documentation and examples, move deprecations along, tightened checks to support good practice, and upgraded errors and warnings, both in style and content. None of these changes are flashy new features but all of them are essential to rsample working well!

So for example, leave-one-out (LOO) cross-validation is not a great choice of resampling scheme in most situations. From Tidy modeling with R :

For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.

It was possible, however, to create implicit LOO samples by using vfold_cv() with the number of folds set to the number of rows in the data. With a dev day contribution, this now errors:

vfold_cv(mtcars, v = nrow(mtcars))
#> Error in `vfold_cv()`:
#> ! Leave-one-out cross-validation is not supported by this function.
#> ✖ You set `v` to `nrow(data)`, which would result in a leave-one-out
#>   cross-validation.
#> ℹ Use `loo_cv()` in this case.

This is to make users pause and consider if this a good choice for their dataset. If you require LOO, you can still use loo_cv() .

Error messages in general have been a focus of ours across various tidymodels packages, rsample is no exception. We opened a bunch of issues to tackle all of rsample - and all got closed! Some of these changes are purely internal, upgrading manual formatting to let the cli package do the work. While the error message in most cases doesn’t look different, it’s a great deal more consistency in formatting.

For some error messages, the additional functionality in cli makes it easy to improve readability. This error message used to be one block of text, now it comes as three bullet points.

permutations(mtcars, everything())
#> Error in `permutations()`:
#> ! You have selected all columns to permute.
#> ℹ This effectively reorders the rows in the original data without changing the
#>   data structure.
#> → Please select fewer columns to permute.

Changes like these are super helpful to users and developers alike. A big thank you to all the contributors!

Acknowledgements

Many thanks to all the people who contributed to rsample since the last release!

@agmurray , @brshallo , @ccani007 , @dicook , @Dpananos , @EmilHvitfeldt , @gaborcsardi , @gregor-fausto , @hfrick , @JamesHWade , @jttoivon , @krz , @laurabrianna , @malcolmbarrett , @MatthieuStigler , @msberends , @nmercadeb , @PriKalra , @seb09 , @simonpcouch , @topepo , @ZWael , and @zz77zz .

Improved sparsity support in tidymodels

Emil Hvitfeldt — Wed, 19 Mar 2025 00:00:00 +0000

Photo by Oliver Olah on Unsplash

We’re stoked to announce tidymodels now fully supports sparse data from end to end. We have been working on this for over 5 years . This is an extension of the work we have done previously with blueprints, which would carry the data sparsely some of the way.

You will need recipes 1.2.0 , parsnip 1.3.0 , workflows 1.2.0 or later for this to work.

What are sparse data?

The term sparse data refers to a data set containing many zeroes. Sparse data appears in all kinds of fields and can be produced in a number of preprocessing methods. The reason why we care about sparse data is because of how computers store numbers. A 32-bit integer value takes 4 bytes to store. An array of 32-bit integers takes 40 bytes, and so on. This happens because each value is written down.

A sparse representation instead stores the locations and values of the non-zero entries. Suppose we have the following vector with 20 entries:

1

c(0, 0, 1, 0, 3, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

It could be represented sparsely using the 3 values positions = c(1, 3, 7), values = c(3, 5, 8), and length = 20. Now, we have seven values to represent a vector of 20 elements. Since some modeling tasks contain even sparser data, this type of representation starts to show real benefits in terms of execution time and memory consumption.

The tidymodels set of packages has undergone several internal changes to allow it to represent data sparsely internally when it would be beneficial. These changes allow you to fit models that contain sparse data faster and more memory efficiently than before. Moreover, it allows you to fit models previously not possible due to them not fitting in memory.

Sparse matrix support

The first benefit of these changes is that recipe(), prep(), bake(), fit(), and predict() now accept sparse matrices created using the Matrix package.

The permeability_qsar data set from the modeldata package contains quite a lot of zeroes in the predictors, so we will use it as a demonstration. Starting by coercing it into a sparse matrix.

library(tidymodels)
library(Matrix)
permeability_sparse <- as(as.matrix(permeability_qsar), "sparseMatrix")

We can now use this sparse matrix in our code the same way as a dense matrix or data frame:

rec_spec <- recipe(permeability ~ ., data = permeability_sparse) |>
  step_zv(all_predictors())

mod_spec <- boost_tree("regression", "xgboost")

wf_spec <- workflow(rec_spec, mod_spec)

Model training has the usual syntax:

wf_fit <- fit(wf_spec, permeability_sparse)

as does prediction:

predict(wf_fit, permeability_sparse)
#> # A tibble: 165 × 1
#>     .pred
#>     
#>  1 10.5  
#>  2  1.50 
#>  3 13.1  
#>  4  1.10 
#>  5  1.25 
#>  6  0.738
#>  7 29.3  
#>  8  2.44 
#>  9 36.3  
#> 10  4.31 
#> # ℹ 155 more rows

Note that only some models/engines work well with sparse data. These are all listed here https://www.tidymodels.org/find/sparse/ . If the model doesn’t support sparse data, it will be coerced into the default non-sparse representation and used as usual.

With a few exceptions, it should work like any other data set. However, this approach has two main limitations. The first is that we are limited to regression tasks since the outcome has to be numeric to be part of the sparse matrix.

The second limitation is that it only works with non-formula methods for parsnip and workflows. This means that you can use a recipe with add_recipe() or select variables directly with add_variables() when using a workflow. And you need to use fit_xy() instead of fit() when using a parsnip object by itself.

If this is of interest we also have a https://www.tidymodels.org/ post about using sparse matrices in tidymodels .

Sparse data from recipes steps

Where this sparsity support really starts to shine is when the recipe we use will generate sparse data. They come in two flavors, sparsity creation steps and sparsity preserving steps. Both listed here: https://www.tidymodels.org/find/sparse/ .

Some steps like step_dummy(), step_indicate_na(), and textrecipes::step_tf() will almost always produce a lot of zeroes. We take advantage of that by generating it sparsely when it is beneficial. If these steps end up producing sparse vectors, we want to make sure the sparsity is preserved. A couple of handfuls of steps, such as step_impute_mean() and step_scale(), have been updated to be able to work efficiently with sparse vectors. Both types of steps are detailed in the above-linked list of compatible methods.

What this means in practice is that if you use a model/engine that supports sparse data and have a recipe that produces enough sparse data, then the steps will switch to produce sparse data by using a new sparse data format to store the data (when appropriate) as the recipe is being processed. Then if the model can accept sparse objects, we convert the data from our new sparse format to a standard sparse matrix object. Increasing performance when possible while preserving performance otherwise.

Below is a simple recipe using the ames data set. step_dummy() is applied to all the categorical predictors, leading to a significant amount of zeroes.

rec_spec <- recipe(Sale_Price ~ ., data = ames) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

mod_spec <- boost_tree("regression", "xgboost")

wf_spec <- workflow(rec_spec, mod_spec)

When we go to fit it now, it takes around 125ms and allocates 37.2MB. Compared to before these changes it would take around 335ms and allocate 67.5MB.

wf_fit <- fit(wf_spec, ames)

We see similar speedups when we predictor with around 20ms and 25.2MB now, compared to around 60ms and 55.6MB before.

predict(wf_fit, ames)
#> # A tibble: 2,930 × 1
#>      .pred
#>      
#>  1 208649.
#>  2 115339.
#>  3 148634.
#>  4 239770.
#>  5 190082.
#>  6 184604.
#>  7 208572.
#>  8 177403 
#>  9 261000.
#> 10 198604.
#> # ℹ 2,920 more rows

These improvements are tightly related to memory allocation, which depends on the sparsity of the data set produced by the recipe. This is why it is hard to say how much benefit you will see. We have seen orders of magnitudes of improvements, both in terms of time and memory allocation. We have also been able to fit models where previously the data was too big to fit in memory.

Please see the post on tidymodels.org, which goes into more detail about when you are likely to benefit from this and how to change your recipes and workflows to take full advantage of this new feature.

There is also a https://www.tidymodels.org/ post going into a bit more detail about how to use recipes to produce sparse data .

Q1 2025 tidymodels digest

Max Kuhn — Thu, 27 Feb 2025 00:00:00 +0000

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

We’ve sent a steady stream of tidymodels packages to CRAN recently. We usually release them in batches since many of our packages are tightly coupled with one another. Internally, this process is referred to as the “cascade” of CRAN submissions.

The post will update you on which packages have changed and the major improvements you should know about.

Here’s a list of the packages and their News sections:

Let’s look at a few specific updates.

Improvements in errors and warnings

A group effort was made to improve our error and warning messages across many packages. This started with an internal “upkeep week” (which ended up being 3-4 weeks) and concluded at the Tidy Dev Day in Seattle after posit::conf(2024).

The goal was to use new tools in the cli and rlang packages to make messages more informative than they used to be. For example, using:

1

tidy(pca_extract_trained, number = 3, type = "variances")

used to result in the error message:

Error in `match.arg()`:
! 'arg' should be one of "coef", "variance"

The new system references the function that you called and not the underlying base R function that actually errored. It also suggests a solution:

Error in `tidy()`:
! `type` must be one of "coef" or "variance", not "variances".
i Did you mean "variance"?

The rlang package created a set of standalone files that contain high-quality type checkers and related functions. This also improves the information that users get from an error. For example, using an inappropriate formula value in fit(linear_reg(), "boop", mtcars), the old message was:

Error in `fit()`:
! The `formula` argument must be a formula, but it is a .

and now you see:

Error in `fit()`:
! `formula` must be a formula, not the string "boop".

This was a lot of work and we’re still aren’t finished. Two events helped us get as far as we did.

First, Simon Couch made the chores package (its previous name was “pal”), which enabled us to use AI tools to solve small-scope problems, such as converting old rlang error code to use the new cli syntax . I can’t overstate how much of a speed-up this was for us.

Second, at developer day, many external folks pitched in to make pull requests from a list of issues:

Organizing Tidy Dev Day issues.

I love these sessions for many reasons, but mostly because we meet users and contributors to our packages in person and work with them on specific tasks.

There is a lot more to do here; we have a lot of secondary packages that would benefit from these improvements too.

Quantile regression in parsnip

One big update in parsnip was a new modeling mode of "quantile regression". Daniel McDonald and Ryan Tibshirani largely provided some inertia for this work based on their disease modeling framework .

You can generate quantile predictions by first creating a model specification, which includes the quantiles that you want to predict:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


library(tidymodels)
tidymodels_prefer()

ames <- 
  modeldata::ames |> 
  mutate(Sale_Price = log10(Sale_Price)) |> 
  select(Sale_Price, Latitude)

quant_spec <- 
  linear_reg() |> 
  set_engine("quantreg") |> 
  set_mode("quantile regression", quantile_levels = c(0.1, 0.5, 0.9))
quant_spec

## Linear Regression Model Specification (quantile regression)
## 
## Computational engine: quantreg

## Quantile levels: 0.1, 0.5, and 0.9.

We’ll add some spline terms via a recipe and fit the model:

1
2
3
4
5
6
7
8
9


spline_rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_spline_natural(Latitude, deg_free = 10)

quant_fit <- 
  workflow(spline_rec, quant_spec) |> 
  fit(data = ames)

quant_fit

## ══ Workflow [trained] ═════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_spline_natural()
## 
## ── Model ──────────────────────────────────────────────────────────────
## Call:
## quantreg::rq(formula = ..y ~ ., tau = quantile_levels, data = data)
## 
## Coefficients:
##               tau= 0.1    tau= 0.5    tau= 0.9
## (Intercept) 4.71981123  5.07728741  5.25221335
## Latitude_01 1.22409173  0.70928577  0.79000849
## Latitude_02 0.19561816  0.04937750  0.02832633
## Latitude_03 0.16616065  0.02045910  0.14730573
## Latitude_04 0.30583648  0.08489487  0.15595080
## Latitude_05 0.21663212  0.02016258 -0.01110625
## Latitude_06 0.33541228  0.12005254  0.03006777
## Latitude_07 0.47732205  0.09146728  0.17394021
## Latitude_08 0.24028784  0.30450058  0.26144584
## Latitude_09 0.05840312 -0.14733781 -0.11911843
## Latitude_10 1.52800673  0.95994216  1.21750501
## 
## Degrees of freedom: 2930 total; 2919 residual

For prediction, tidymodels always returns a data frame with as many rows as the input data set (here: ames). The result for quantile predictions is a special vctrs class:

1
2


quant_pred <- predict(quant_fit, ames) 
quant_pred |> slice(1:4)

## # A tibble: 4 × 1
##   .pred_quantile
##        
## 1         [5.33]
## 2         [5.33]
## 3         [5.33]
## 4         [5.31]

1

class(quant_pred$.pred_quantile)

## [1] "quantile_pred" "vctrs_vctr"    "list"

where the output [5.31] shows the middle quantile.

We can expand the set of quantile predictions so that there are three rows for each source row in ames. There’s also an integer column called .row so that we can merge the data with the source data:

1

quant_pred$.pred_quantile[1]

## 
## [1] [5.33]
## # Quantile levels: 0.1 0.5 0.9

1

as_tibble(quant_pred$.pred_quantile[1])

## # A tibble: 3 × 3
##   .pred_quantile .quantile_levels  .row
##                         
## 1           5.08              0.1     1
## 2           5.33              0.5     1
## 3           5.52              0.9     1

Here are the predicted quantile values:

1
2
3
4
5
6
7
8


quant_pred$.pred_quantile |> 
  as_tibble() |> 
  full_join(ames |> add_rowindex(), by = ".row") |> 
  arrange(Latitude) |> 
  ggplot(aes(x = Latitude)) + 
  geom_point(data = ames, aes(y = Sale_Price), alpha = 1 / 5) +
  geom_line(aes(y = .pred_quantile, col = format(.quantile_levels)), 
            show.legend = FALSE, linewidth = 1.5) 

10%, 50%, and 90% quantile predictions.

For now, the new mode does not have many engines. We need to implement some performance statistics in the yardstick package before integrating these models into the whole tidymodels ecosystem.

In other news, we’ve added some additional neural network models based on some improvements in the brulee package. Namely, two-layer networks can be tuned for feed-forward networks on tabular data (using torch).

One other improvement has been simmering for a long time: the ability to exploit sparse data structures better. We’ve improved our fit() interfaces for the few model engines that can use sparsely encoded data. There is much more to come on this in a few months, especially around recipes, so stay tuned.

Finally, we’ve created a set of checklists that can be used when creating new models or engines. These are very helpful, even for us, since there is a lot of minutiae to remember.

Parallelism in tune

This was a small maintenance release mostly related to parallel processing. Up to now, tune facilitated parallelism using the foreach package. That package is mature but not actively developed, so we have been slowly moving toward using the future package(s).

The first step in this journey was to keep using foreach internally (but lean toward future) but to encourage users to move from directly invoking the foreach package and, instead, load and use the future package.

We’re now moving folks into the second stage. tune will now raise a warning when:

A parallel backend has been registered with foreach, and
No plan() has been specified with future.

This will allow users to transition their existing code to only future and allow us to update existing documentation and training materials.

We anticipate that the third stage, removing foreach entirely, will occur sometime before posit::conf(2025) in September.

Things to look forward to

We are working hard on a few major initiatives that we plan on showing off at posit::conf(2025) .

First is integrated support for sparse data. The emphasis is on “data” because users can use a data frame of sparse vectors or the usual sparse matrix format. This is a big deal because it does not force you to convert non-numeric data into a numeric matrix format. Again, we’ll discuss this more in the future, but you should be able to use sparse data frames in parsnip, recipes, tune, etc.

The second initiative is the longstanding goal of adding postprocessing to tidymodels. Just as you can add a preprocessor to a model workflow, you will be able to add a set of postprocessing adjustments to the predictions your model generates. See our previous post for a sneak peek.

Finally, this year’s summer internship focuses on supervised feature selection methods. We’ll also have releases (and probably another package) for these tools.

These should come to fruition (and CRAN) before or around August 2025.

Acknowledgements

We want to sincerely thank everyone who contributed to these packages since their previous versions:

@AlbertoImg , @asb2111 , @balraadjsings , @bcjaeger , @beansrowning , @BrennanAntone , @cheryldietrich , @chillerb , @conarr5 , @corybrunson , @dajmcdon , @davidrsch , @Edgar-Zamora , @EmilHvitfeldt , @gaborcsardi , @gimholte , @grantmcdermott , @grouptheory , @hfrick , @ilaria-kode , @JamesHWade , @jesusherranz , @jkylearmstrong , @joranE , @joscani , @Joscelinrocha , @josho88 , @joshuagi , @JosiahParry , @jrosell , @jrwinget , @KarlKoe , @kscott-1 , @lilykoff , @lionel- , @LouisMPenrod , @luisDVA , @marcelglueck , @marcozanotti , @martaalcalde , @mattwarkentin , @mihem , @mitchellmanware , @naokiohno , @nhward , @npelikan , @obgeneralao , @owenjonesuob , @pbhogale , @Peter4801 , @pgg1309 , @reisner , @rfsaldanha , @rkb965 , @RobLBaker , @RodDalBen , @SantiagoD999 , @shum461 , @simonpcouch , @szimmer , @talegari , @therealjpetereit , @topepo , @walkerjameschris , and @ZWael

orbital 0.3.0

Emil Hvitfeldt — Mon, 13 Jan 2025 00:00:00 +0000

We’re thrilled to announce the release of orbital 0.3.0. orbital lets you predict in databases using tidymodels workflows.

You can install it from CRAN with:

install.packages("orbital")

This blog post will cover the highlights, which are classification support and the new augment method.

You can see a full list of changes in the release notes .

Classification support

The biggest improvement in this version is that orbital() now works for supported classification models. See vignette for list of all supported models.

Let’s start by fitting a classification model on the penguins data set, using {xgboost} as the engine.

rec_spec <- recipe(species ~ ., data = penguins) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_zv(all_predictors())

lr_spec <- boost_tree() |>
  set_mode("classification") |>
  set_engine("xgboost")

wf_spec <- workflow(rec_spec, lr_spec)
wf_fit <- fit(wf_spec, data = penguins)

With this fitted workflow object, we can call orbital() on it to create an orbital object.

orbital_obj <- orbital(wf_fit)
orbital_obj
#> 
#> ── orbital Object ──────────────────────────────────────────────────────────────
#> • island = dplyr::if_else(is.na(island), "unknown", island)
#> • sex = dplyr::if_else(is.na(sex), "unknown", sex)
#> • island_Dream = as.numeric(island == "Dream")
#> • island_Torgersen = as.numeric(island == "Torgersen")
#> • sex_male = as.numeric(sex == "male")
#> • sex_unknown = as.numeric(sex == "unknown")
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...
#> • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...
#> • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)
#> • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...
#> • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...
#> • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)
#> • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...
#> • Adelie = 0 + dplyr::case_when((bill_depth_mm < 15.1 | is.na(bill_depth_ ...
#> • Chinstrap = 0 + dplyr::case_when((island_Dream < 0.5 | is.na(island_Dre ...
#> • Gentoo = 0 + dplyr::case_when((bill_depth_mm < 15.95 | is.na(bill_depth ...
#> • .pred_class = dplyr::case_when(Adelie > Chinstrap & Adelie > Gentoo ~ " ...
#> ────────────────────────────────────────────────────────────────────────────────
#> 18 equations in total.

This object contains all the information that is needed to produce predictions. Which we can produce with predict() .

predict(orbital_obj, penguins)
#> # A tibble: 344 × 1
#>    .pred_class
#>          
#>  1 Adelie     
#>  2 Adelie     
#>  3 Adelie     
#>  4 Adelie     
#>  5 Adelie     
#>  6 Adelie     
#>  7 Adelie     
#>  8 Adelie     
#>  9 Adelie     
#> 10 Adelie     
#> # ℹ 334 more rows

The main thing to note here is that the orbital package produces character vectors instead of factors. This is done as a unifying approach since many databases don’t have factor types.

Speaking of databases, you can predict() on an orbital object using tables from databases. Below we create an ephemeral in-memory RSQLite database.

library(DBI)
library(RSQLite)

con_sqlite <- dbConnect(SQLite(), path = ":memory:")
penguins_sqlite <- copy_to(con_sqlite, penguins, name = "penguins_table")

And we can predict with it like normal. All the calculations are sent to the database for execution.

predict(orbital_obj, penguins_sqlite)
#> # Source:   SQL [?? x 1]
#> # Database: sqlite 3.47.1 []
#>    .pred_class
#>          
#>  1 Adelie     
#>  2 Adelie     
#>  3 Adelie     
#>  4 Adelie     
#>  5 Adelie     
#>  6 Adelie     
#>  7 Adelie     
#>  8 Adelie     
#>  9 Adelie     
#> 10 Adelie     
#> # ℹ more rows

This works the same with many types of databases .

Classification is different from regression in part because it comes with multiple prediction types. The above example showed the default which is hard classification. You can set the type of prediction you want with the type argument to orbital. For classification models, possible options are "class" and "prob".

orbital_obj_prob <- orbital(wf_fit, type = c("class", "prob"))
orbital_obj_prob
#> 
#> ── orbital Object ──────────────────────────────────────────────────────────────
#> • island = dplyr::if_else(is.na(island), "unknown", island)
#> • sex = dplyr::if_else(is.na(sex), "unknown", sex)
#> • island_Dream = as.numeric(island == "Dream")
#> • island_Torgersen = as.numeric(island == "Torgersen")
#> • sex_male = as.numeric(sex == "male")
#> • sex_unknown = as.numeric(sex == "unknown")
#> • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...
#> • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...
#> • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...
#> • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)
#> • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...
#> • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...
#> • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)
#> • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...
#> • Adelie = 0 + dplyr::case_when((bill_depth_mm < 15.1 | is.na(bill_depth_ ...
#> • Chinstrap = 0 + dplyr::case_when((island_Dream < 0.5 | is.na(island_Dre ...
#> • Gentoo = 0 + dplyr::case_when((bill_depth_mm < 15.95 | is.na(bill_depth ...
#> • .pred_class = dplyr::case_when(Adelie > Chinstrap & Adelie > Gentoo ~ " ...
#> • norm = exp(Adelie) + exp(Chinstrap) + exp(Gentoo)
#> • .pred_Adelie = exp(Adelie) / norm
#> • .pred_Chinstrap = exp(Chinstrap) / norm
#> • .pred_Gentoo = exp(Gentoo) / norm
#> ────────────────────────────────────────────────────────────────────────────────
#> 22 equations in total.

Notice how we can select both "class" and "prob". The predictions now include both hard and soft class predictions.

predict(orbital_obj_prob, penguins)
#> # A tibble: 344 × 4
#>    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo
#>                                     
#>  1 Adelie             0.989         0.00554      0.00560
#>  2 Adelie             0.989         0.00554      0.00560
#>  3 Adelie             0.989         0.00554      0.00560
#>  4 Adelie             0.709         0.0245       0.267  
#>  5 Adelie             0.989         0.00554      0.00560
#>  6 Adelie             0.989         0.00554      0.00560
#>  7 Adelie             0.989         0.00554      0.00560
#>  8 Adelie             0.989         0.00554      0.00560
#>  9 Adelie             0.979         0.00549      0.0158 
#> 10 Adelie             0.980         0.00559      0.0148 
#> # ℹ 334 more rows

That works equally well in databases.

predict(orbital_obj_prob, penguins_sqlite)
#> # Source:   SQL [?? x 4]
#> # Database: sqlite 3.47.1 []
#>    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo
#>                                     
#>  1 Adelie             0.989         0.00554      0.00560
#>  2 Adelie             0.989         0.00554      0.00560
#>  3 Adelie             0.989         0.00554      0.00560
#>  4 Adelie             0.709         0.0245       0.267  
#>  5 Adelie             0.989         0.00554      0.00560
#>  6 Adelie             0.989         0.00554      0.00560
#>  7 Adelie             0.989         0.00554      0.00560
#>  8 Adelie             0.989         0.00554      0.00560
#>  9 Adelie             0.979         0.00549      0.0158 
#> 10 Adelie             0.980         0.00559      0.0148 
#> # ℹ more rows

New augment method

The users of tidymodels have found the augment() function to be a handy tool. This function performs predictions and returns them alongside the original data set.

This release adds augment() support for orbital objects.

augment(orbital_obj, penguins)
#> # A tibble: 344 × 8
#>    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm
#>                                                  
#>  1 Adelie      Adelie  Torgersen           39.1          18.7               181
#>  2 Adelie      Adelie  Torgersen           39.5          17.4               186
#>  3 Adelie      Adelie  Torgersen           40.3          18                 195
#>  4 Adelie      Adelie  Torgersen           NA            NA                  NA
#>  5 Adelie      Adelie  Torgersen           36.7          19.3               193
#>  6 Adelie      Adelie  Torgersen           39.3          20.6               190
#>  7 Adelie      Adelie  Torgersen           38.9          17.8               181
#>  8 Adelie      Adelie  Torgersen           39.2          19.6               195
#>  9 Adelie      Adelie  Torgersen           34.1          18.1               193
#> 10 Adelie      Adelie  Torgersen           42            20.2               190
#> # ℹ 334 more rows
#> # ℹ 2 more variables: body_mass_g , sex

The function works for most databases, but for technical reasons doesn’t work with all. It has been confirmed to not work work in spark databases or arrow tables.

augment(orbital_obj, penguins_sqlite)
#> # Source:   SQL [?? x 8]
#> # Database: sqlite 3.47.1 []
#>    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm
#>                                                  
#>  1 Adelie      Adelie  Torgersen           39.1          18.7               181
#>  2 Adelie      Adelie  Torgersen           39.5          17.4               186
#>  3 Adelie      Adelie  Torgersen           40.3          18                 195
#>  4 Adelie      Adelie  Torgersen           NA            NA                  NA
#>  5 Adelie      Adelie  Torgersen           36.7          19.3               193
#>  6 Adelie      Adelie  Torgersen           39.3          20.6               190
#>  7 Adelie      Adelie  Torgersen           38.9          17.8               181
#>  8 Adelie      Adelie  Torgersen           39.2          19.6               195
#>  9 Adelie      Adelie  Torgersen           34.1          18.1               193
#> 10 Adelie      Adelie  Torgersen           42            20.2               190
#> # ℹ more rows
#> # ℹ 2 more variables: body_mass_g , sex

Acknowledgements

A big thank you to all the people who have contributed to orbital since the release of v0.3.0:

@EmilHvitfeldt , @joscani , @jrosell , @npelikan , and @szimmer .

Introducing mall for R...and Python

Edgar Ruiz — Wed, 30 Oct 2024 00:00:00 +0000

The beginning

A few months ago, while working on the Databricks with R workshop, I came across some of their custom SQL functions. These particular functions are prefixed with “ai_”, and they run NLP with a simple SQL call:

1
2
3
4
5


> SELECT ai_analyze_sentiment('I am happy');
  positive

> SELECT ai_analyze_sentiment('I am sad');
  negative

This was a revelation to me. It showcased a new way to use LLMs in our daily work as analysts. To-date, I had primarily employed LLMs for code completion and development tasks. However, this new approach focuses on using LLMs directly against our data instead.

My first reaction was to try and access the custom functions via R. With dbplyr we can access SQL functions in R, and it was great to see them work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


orders |>
  mutate(
    sentiment = ai_analyze_sentiment(o_comment)
  )
#> # Source:   SQL [6 x 2]
#>   o_comment                   sentiment
#>                               
#> 1 ", pending theodolites …    neutral  
#> 2 "uriously special foxes …   neutral  
#> 3 "sleep. courts after the …  neutral  
#> 4 "ess foxes may sleep …      neutral  
#> 5 "ts wake blithely unusual … mixed    
#> 6 "hins sleep. fluffily …     neutral

One downside of this integration is that even though accessible through R, we require a live connection to Databricks in order to utilize an LLM in this manner, thereby limiting the number of people who can benefit from it.

According to their documentation, Databricks is leveraging the Llama 3.1 70B model. While this is a highly effective Large Language Model, its enormous size poses a significant challenge for most users’ machines, making it impractical to run on standard hardware.

Reaching viability

LLM development has been accelerating at a rapid pace. Initially, only online Large Language Models (LLMs) were viable for daily use. This sparked concerns among companies hesitant to share their data externally. Moreover, the cost of using LLMs online can be substantial, per-token charges can add up quickly.

The ideal solution would be to integrate an LLM into our own systems, requiring three essential components:

A model that can fit comfortably in memory
A model that achieves sufficient accuracy for NLP tasks
An intuitive interface between the model and the user’s laptop

In the past year, having all three of these elements was nearly impossible. Models capable of fitting in-memory were either inaccurate or excessively slow. However, recent advancements, such as Llama from Meta and cross-platform interaction engines like Ollama , have made it feasible to deploy these models, offering a promising solution for companies looking to integrate LLMs into their workflows.

The project

This project started as an exploration, driven by my interest in leveraging a “general-purpose” LLM to produce results comparable to those from Databricks AI functions. The primary challenge was determining how much setup and preparation would be required for such a model to deliver reliable and consistent results.

Without access to a design document or open-source code, I relied solely on the LLM’s output as a testing ground. This presented several obstacles, including the numerous options available for fine-tuning the model. Even within prompt engineering, the possibilities are vast. To ensure the model was not too specialized or focused on a specific subject or outcome, I needed to strike a delicate balance between accuracy and generality.

Fortunately, after conducting extensive testing, I discovered that a simple “one-shot” prompt yielded the best results. By “best,” I mean that the answers were both accurate for a given row and consistent across multiple rows. Consistency was crucial, as it meant providing answers that were one of the specified options (positive, negative, or neutral), without any additional explanations.

The following is an example of a prompt that worked reliably against Llama 3.2:

>>> You are a helpful sentiment engine. Return only one of the 
... following answers: positive, negative, neutral. No capitalization. 
... No explanations. The answer is based on the following text: 
... I am happy
positive

As a side note, my attempts to submit multiple rows at once proved unsuccessful. In fact, I spent a significant amount of time exploring different approaches, such as submitting 10 or 2 rows simultaneously, formatting them in JSON or CSV formats. The results were often inconsistent, and it didn’t seem to accelerate the process enough to be worth the effort.

Once I became comfortable with the approach, the next step was wrapping the functionality within an R package.

The approach

One of my goals was to make the mall package as “ergonomic” as possible. In other words, I wanted to ensure that using the package in R and Python integrates seamlessly with how data analysts use their preferred language on a daily basis.

For R, this was relatively straightforward. I simply needed to verify that the functions worked well with pipes (%>% and |>) and could be easily incorporated into packages like those in the tidyverse:

1
2
3
4
5
6


reviews |> 
  llm_sentiment(review) |> 
  filter(.sentiment == "positive") |> 
  select(review) 
#>                                                               review
#> 1 This has been the best TV I've ever used. Great screen, and sound.

However, for Python, being a non-native language for me, meant that I had to adapt my thinking about data manipulation. Specifically, I learned that in Python, objects (like pandas DataFrames) “contain” transformation functions by design.

This insight led me to investigate if the Pandas API allows for extensions, and fortunately, it did! After exploring the possibilities, I decided to start with Polar, which allowed me to extend its API by creating a new namespace. This simple addition enabled users to easily access the necessary functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


>>> import polars as pl
>>> import mall
>>> df = pl.DataFrame(dict(x = ["I am happy", "I am sad"]))
>>> df.llm.sentiment("x")
shape: (2, 2)
┌────────────┬───────────┐
│ x          ┆ sentiment │
│ ---        ┆ ---       │
│ str        ┆ str       │
╞════════════╪═══════════╡
│ I am happy ┆ positive  │
│ I am sad   ┆ negative  │
└────────────┴───────────┘

By keeping all the new functions within the llm namespace, it becomes very easy for users to find and utilize the ones they need:

What’s next

I think it will be easier to know what is to come for mall once the community uses it and provides feedback. I anticipate that adding more LLM back ends will be the main request. The other possible enhancement will be when new updated models are available, then the prompts may need to be updated for that given model. I experienced this going from LLama 3.1 to Llama 3.2. There was a need to tweak one of the prompts. The package is structured in a way the future tweaks like that will be additions to the package, and not replacements to the prompts, so as to retains backwards compatibility.

This is the first time I write an article about the history and structure of a project. This particular effort was so unique because of the R + Python, and the LLM aspects of it, that I figured it is worth sharing.

If you wish to learn more about mall, feel free to visit its official site: https://mlverse.github.io/mall/

Postprocessing is coming to tidymodels

Simon Couch — Tue, 08 Oct 2024 00:00:00 +0000

We’re bristling with elation to share about a set of upcoming features for postprocessing with tidymodels. Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations. The developmental versions of many tidymodels core packages include changes to support postprocessors, and we’re ready to share about our work and hear the community’s thoughts on our progress so far.

Postprocessing support with tidymodels hasn’t yet made it to CRAN, but you can install the needed versions of tidymodels packages with the following code.

pak::pak(
  paste0(
    "tidymodels/",
    c("tune", "workflows", "rsample", "tailor")
  )
)

Now, we load packages with those developmental versions installed.

library(tidymodels)
library(probably)
library(tailor)

Existing tidymodels users might have spotted something funky already; who is this tailor character?

Meet tailor👋

The tailor package introduces tailor objects, which compose iterative adjustments to model predictions. tailor is to postprocessing as recipes is to preprocessing; applying your mental model of recipes to tailor should get you a good bit of the way there.

Tool	Applied to...	Initialize with...	Composes...	Train with...	Predict with...
recipes	Training data	`recipe()`	`step_*()`s	`prep()`	`bake()`
tailor	Model predictions	`tailor()`	`adjust_*()`ments	`fit()`	`predict()`

First, users can initialize a tailor object with tailor() .

tailor()
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A postprocessor with 0 adjustments.

Tailors compose “adjustments,” analogous to steps from the recipes package.

tailor() %>%
  adjust_probability_threshold(threshold = .7)
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7.

As an example, we’ll apply this tailor to the two_class_example data made available after loading tidymodels.

head(two_class_example)
#>    truth      Class1       Class2 predicted
#> 1 Class2 0.003589243 0.9964107574    Class2
#> 2 Class1 0.678621054 0.3213789460    Class1
#> 3 Class2 0.110893522 0.8891064779    Class2
#> 4 Class1 0.735161703 0.2648382969    Class1
#> 5 Class2 0.016239960 0.9837600397    Class2
#> 6 Class1 0.999275071 0.0007249286    Class1

This data gives the true value of an outcome variable truth as well as predicted probabilities (Class1 and Class2). The hard class predictions, in predicted, are "Class1" if the probability assigned to "Class1" is above .5, and "Class2" otherwise.

The model predicts "Class1" more often than it does "Class2".

two_class_example %>% count(predicted)
#>   predicted   n
#> 1    Class1 277
#> 2    Class2 223

If we wanted the model to predict "Class2" more often, we could increase the probability threshold assigned to "Class1" above which the hard class prediction will be "Class1". In the tailor package, this adjustment is implemented in adjust_probability_threshold() , which can be situated in a tailor object.

tlr <-
  tailor() %>%
  adjust_probability_threshold(threshold = .7)

tlr
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7.

tailors must be fitted before they can predict on new data. For adjustments like adjust_probability_threshold() , there’s no training that actually happens at the fit() step besides recording the name and type of relevant variables. For other adjustments, like numeric calibration with adjust_numeric_calibration() , parameters are actually estimated at the fit() stage and separate data should be used to train the postprocessor and evaluate its performance. More on this in Tailors in context .

In this case, though, we can fit() on the whole dataset. The resulting object is still a tailor, but is now flagged as trained.

tlr_trained <- fit(
  tlr,
  two_class_example,
  outcome = truth,
  estimate = predicted,
  probabilities = c(Class1, Class2)
)

tlr_trained
#> 
#> ── tailor ──────────────────────────────────────────────────────────────────────
#> A binary postprocessor with 1 adjustment:
#> 
#> • Adjust probability threshold to 0.7. [trained]

When used with a model workflow via add_tailor() , the arguments to fit() a tailor will be set automatically. Generally, as in recipes, we recommend that users add tailors to model workflows for training and prediction rather than using them standalone for greater ease of use and to prevent data leakage, but tailors are totally functional by themselves, too.

Now, when passed new data, the trained tailor will determine the outputted class based on whether the probability assigned to the level "Class1" is above .7, resulting in more predictions of "Class2" than before.

predict(tlr_trained, two_class_example) %>% count(predicted)
#> # A tibble: 2 × 2
#>   predicted     n
#>        
#> 1 Class1      236
#> 2 Class2      264

Changing the probability threshold is one of many possible adjustments available in tailor.

For probabilities: calibration
For transformation of probabilities to hard class predictions: thresholds , equivocal zones
For numeric outcomes: calibration , range

Support for tailors is now plumbed through workflows (via add_tailor() ) and tune, and rsample includes a set of infrastructural changes to prevent data leakage behind the scenes. That said, we haven’t yet implemented support for tuning parameters in tailors, but we plan to implement that before this functionality heads to CRAN.

Tailors in context

As an example, let’s model a study of food delivery times in minutes (i.e., the time from the initial order to receiving the food) for a single restaurant. The deliveries data is available upon loading the tidymodels meta-package.

data(deliveries)

# split into training and testing sets
set.seed(1)
delivery_split <- initial_split(deliveries)
delivery_train <- training(delivery_split)
delivery_test  <- testing(delivery_split)

# resample the training set using 10-fold cross-validation
set.seed(1)
delivery_folds <- vfold_cv(delivery_train)

# print out the training set
delivery_train
#> # A tibble: 7,509 × 31
#>    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05
#>                                    
#>  1             21.2  16.1 Tue       3.02       0       0       0       0       0
#>  2             17.9  12.4 Sun       3.37       0       0       0       0       0
#>  3             22.4  14.2 Fri       2.59       0       0       0       0       0
#>  4             30.9  19.1 Sat       2.77       0       0       0       0       0
#>  5             30.1  16.5 Fri       2.05       0       0       0       1       0
#>  6             35.3  14.7 Sat       4.57       0       0       2       1       1
#>  7             13.1  11.5 Sat       2.09       0       0       0       0       0
#>  8             18.3  13.4 Tue       2.35       0       2       1       0       0
#>  9             25.2  20.5 Sat       2.43       0       0       0       1       0
#> 10             30.7  16.7 Fri       2.24       0       0       0       1       0
#> # ℹ 7,499 more rows
#> # ℹ 22 more variables: item_06 , item_07 , item_08 ,
#> #   item_09 , item_10 , item_11 , item_12 , item_13 ,
#> #   item_14 , item_15 , item_16 , item_17 , item_18 ,
#> #   item_19 , item_20 , item_21 , item_22 , item_23 ,
#> #   item_24 , item_25 , item_26 , item_27

Let’s deliberately define a regression model that has poor predicted values: a boosted tree with only three ensemble members.

delivery_wflow <-
  workflow() %>%
  add_formula(time_to_delivery ~ .) %>%
  add_model(boost_tree(mode = "regression", trees = 3))

Evaluating against resamples:

set.seed(1)
delivery_res <- 
  fit_resamples(
    delivery_wflow, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

The $R^2$ looks quite strong!

collect_metrics(delivery_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>                                
#> 1 rmse    standard   9.52     10 0.0533  Preprocessor1_Model1
#> 2 rsq     standard   0.853    10 0.00357 Preprocessor1_Model1

Let’s take a closer look at the predictions, though. How well are they calibrated? We can use the cal_plot_regression() helper from the probably package to put together a quick diagnostic plot.

collect_predictions(delivery_res) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

Ooof.

In comes tailor! Numeric calibration can help address the correlated errors here. We can add a tailor to our existing workflow to “bump up” predictions towards their true value.

delivery_wflow_improved <-
  delivery_wflow %>%
  add_tailor(tailor() %>% adjust_numeric_calibration())

The resampling code looks the same from here.

set.seed(1)
delivery_res_improved <- 
  fit_resamples(
    delivery_wflow_improved, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

Checking out the same plot reveals a much better fit!

collect_predictions(delivery_res_improved) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

There’s actually some tricky data leakage prevention happening under the hood here. When you add tailors to workflow and fit them with tune, this is all taken care of for you. If you’re interested in using tailors outside of that context, check out this documentation section in add_tailor().

What’s to come

We’re excited about how this work is shaping up and would love to hear yall’s thoughts on what we’ve brought together so far. Please do comment on our social media posts about this blog entry or leave issues on the tailor GitHub repository and let us know what you think!

Before these changes head out to CRAN, we’ll also be implementing tuning functionality for postprocessors. You’ll be able to tag arguments like adjust_probability_threshold(threshold) or adjust_probability_calibration(method) with tune() to optimize across several values. Besides that, post-processing with tidymodels should “just work” on the developmental versions of our packages—let us know if you come across anything wonky.

Acknowledgements

Postprocessing support has been a longstanding feature request across many of our repositories; we’re grateful for the community discussions there for shaping this work. Additionally, we thank Ryan Tibshirani and Daniel McDonald for fruitful discussions on how we might scope these features.

recipes 1.1.0

Emil Hvitfeldt — Mon, 08 Jul 2024 00:00:00 +0000

We’re thrilled to announce the release of recipes 1.1.0. recipes lets you create a pipeable sequence of feature engineering steps.

You can install it from CRAN with:

install.packages("recipes")

This blog post will go over some of the bigger changes in this release. Improvements in column type checking, allowing more data types to be passed to recipes, use of long formulas and better error for misspelled argument names.

You can see a full list of changes in the release notes .

Column type checking

A longtime issue in recipes came from the fact that recipes didn’t keep a prototype (ptype) of the data it was specified with. This would cause unexpected things to happen or uninformative error messages to appear if different data was used to prep() than was used to create the recipe() .

Every recipe you create starts with a call to recipe() . In the below example, we create a recipe where x2 starts by being a character vector, but the recipe is prepped where x2 is a numeric vector. This didn’t produce any warnings or errors, silently doing something unintended.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


data_template <- tibble(
  outcome = rnorm(10), 
  x1 = rnorm(10), 
  x2 = sample(letters, 10, T)
)

rec <- recipe(outcome ~ ., data_template) %>%
  step_bin2factor(all_numeric_predictors())

data_training <- tibble(outcome = rnorm(1000), x1 = rnorm(1000), x2 = rnorm(1000))

prep(rec, training = data_training)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x1 | Trained

Now, we get an error detailing how the data is different.

data_template <- tibble(outcome = rnorm(10), x1 = rnorm(10), x2 = sample(letters, 10, T))

rec <- recipe(outcome ~ ., data_template) %>%
  step_bin2factor(all_numeric_predictors())

data_training <- tibble(outcome = rnorm(1000), x1 = rnorm(1000), x2 = rnorm(1000))

prep(rec, training = data_training)
#> Error in `prep()`:
#> ✖ The following variable has the wrong class:
#> • `x2` must have class , not .

Note that recipes created before version 1.1.0 don’t contain any ptype information, and will not undergo checking. Rerunning the code to create the recipe will add ptype information to the recipe.

Input checking in `recipe()`

We have relaxed the requirements of data frames, while making feedback more helpful when something goes wrong.

The data was previously passed through model.frame() inside the recipe, which restricted what could be handled. Previously prohibited input included data frames with list-columns or sf data frames. Both of these are now supported, as long as they are a data.frame object.

data_listcolumn <- tibble(
  y = 1:4,
  x = list(1:3, 4:6, 3:1, 1:10)
)

recipe(y ~ ., data = data_listcolumn)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
pathshp <- system.file("shape/nc.shp", package = "sf")
data_sf <- st_read(pathshp, quiet = TRUE)

recipe(AREA ~ ., data = data_sf)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 14

We are excited to see what people can do with these new options.

Another way to tell a recipe what variables should be included and what roles they should have is to use add_role() and update_role() . But if you were not careful, you could end up in situations where the same variable is labeled as both the outcome and predictor.

# didn't used to throw a warning
recipe(mtcars) |>
  update_role(everything(), new_role = "predictor") |>
  add_role("mpg", new_role = "outcome")
#> Error in `add_role()`:
#> ! `mpg` cannot get "outcome" role as it already has role "predictor".

This error can be avoided by using update_role() instead of add_role() .

recipe(mtcars) |>
  update_role(everything(), new_role = "predictor") |>
  update_role("mpg", new_role = "outcome")
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10

Long formulas in `recipe()`

Related to the changes we saw above, we now fully support very long formulas without hitting a C stack usage error.

data_wide <- matrix(1:10000, ncol = 10000)
data_wide <- as.data.frame(data_wide)
names(data_wide) <- c(paste0("x", 1:10000))

long_formula <- as.formula(paste("~ ", paste(names(data_wide), collapse = " + ")))

recipe(long_formula, data_wide)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> predictor: 10000

Better error for misspelled argument names

If you have used recipes long enough you are very likely to have run into the following error.

1
2
3
4
5
6


recipe(mpg ~ ., data = mtcars) |>
  step_pca(all_numeric_predictors(), number = 4) |>
  prep()
#> Error in `step_pca()`:
#> Caused by error in `prep()`:
#> ! Can't rename variables in this context.

The first time you saw it, it didn’t make much sense. Hopefully, you figured out that step_pca() doesn’t have a number argument, and instead uses num_comp to determine the number of principal components to return. This confusion will be a thing of the past as we now include this improved error message.

recipe(mpg ~ ., data = mtcars) |>
  step_pca(all_numeric_predictors(), number = 4) |>
  prep()
#> Error in `step_pca()`:
#> Caused by error in `prep()` at recipes/R/recipe.R:479:9:
#> ! The following argument was specified but do not exist: `number`.

Quality of life increases in `step_dummy()`

I would imagine that one of the most used steps is step_dummy() . We have improved the errors and warnings it spits out when things go sideways.

If you apply step_dummy() to a variable that contains a lot of levels, it will produce a lot of columns, and the resulting object may not fit in memory. This can lead to the following error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


data_id <- tibble(
  id = as.character(1:100000), 
  x1 = rnorm(100000), 
  x2 = sample(letters, 100000, TRUE)
)

recipe(~ ., data = data_id) |>
  step_dummy(all_nominal_predictors()) |>
  prep()
#> Error: vector memory exhausted (limit reached?)

Instead, you now get a more helpful error message.

data_id <- tibble(
  id = as.character(1:100000), 
  x1 = rnorm(100000), 
  x2 = sample(letters, 100000, TRUE)
)

recipe(~ ., data = data_id) |>
  step_dummy(all_nominal_predictors()) |>
  prep()
#> Error in `step_dummy()`:
#> Caused by error:
#> ! `id` contains too many levels (100000), which would result in a
#>   data.frame too large to fit in memory.

Likewise, you will get helpful errors if step_dummy() gets a NA or unseen values.

data_train <- tibble(x = c("a", "b"))
data_unseen <- tibble(x = "c")

rec_spec <- recipe(~., data = data_train) %>%
  step_dummy(x) %>%
  prep()

rec_spec %>%
  bake(data_unseen)
#> Warning: ! There are new levels in `x`: "c".
#> ℹ Consider using step_novel() (`?recipes::step_novel()`) before `step_dummy()`
#>   to handle unseen values.
#> # A tibble: 1 × 1
#>     x_b
#>   
#> 1    NA

data_na <- tibble(x = NA)

rec_spec %>%
  bake(data_na)
#> Warning: ! There are new levels in `x`: NA.
#> ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before
#>   `step_dummy()` to handle missing values.
#> # A tibble: 1 × 1
#>     x_b
#>   
#> 1    NA

Acknowledgements

A big thank you to all the people who have contributed to recipes since the release of v1.0.10:

@brynhum , @DemetriPananos , @diegoperoni , @EmilHvitfeldt , @JiahuaQu , @joranE , @nhward , @olivroy , and @simonpcouch .

Chocolate Chocolate Chip Cookies

preheat oven 350°F

1/3c butter
1/2 + 1/3c sugar

mix until fluffy

1 tsp vanilla
1 egg

mix until combined

1/2c cocoa
1/2 tsp baking soda
1c flour

mix until combined

3/4c chocolate chips

bake for about 8 mins, depending on size! they will crack on top, but still be soft.

bonsai 0.3.0

Simon Couch — Tue, 25 Jun 2024 00:00:00 +0000

We’re brimming with glee to announce the release of bonsai 0.3.0. bonsai is a parsnip extension package for tree-based models, and includes support for random forest and gradient-boosted tree frameworks like partykit and LightGBM. This most recent release of the package introduces support for the "aorsf" engine, which implements accelerated oblique random forests (Jaeger et al. 2022, Jaeger et al. 2024).

You can install it from CRAN with:

install.packages("bonsai")

This blog post will demonstrate a modeling workflow where the benefits of using oblique random forests shine through.

You can see a full list of changes in the release notes .

library(tidymodels)
library(bonsai)
library(plsmod)
library(corrr)

The `meats` data

The modeldata package, loaded automatically with the tidymodels meta-package, includes several example datasets to demonstrate modeling problems. We’ll make use of a dataset called meats in this post. Each row is a measurement of a sample of finely chopped meat.

meats
#> # A tibble: 215 × 103
#>    x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010 x_011 x_012 x_013
#>                
#>  1  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63  2.63  2.63  2.64
#>  2  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88  2.88  2.89  2.90
#>  3  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60  2.60  2.61  2.61
#>  4  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84  2.84  2.85  2.85
#>  5  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81  2.81  2.82  2.82
#>  6  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06  3.07  3.08  3.09
#>  7  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04  3.05  3.06  3.07
#>  8  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54  2.54  2.54  2.54
#>  9  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33  3.34  3.35  3.36
#> 10  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47  3.48  3.48  3.49
#> # ℹ 205 more rows
#> # ℹ 90 more variables: x_014 , x_015 , x_016 , x_017 ,
#> #   x_018 , x_019 , x_020 , x_021 , x_022 ,
#> #   x_023 , x_024 , x_025 , x_026 , x_027 ,
#> #   x_028 , x_029 , x_030 , x_031 , x_032 ,
#> #   x_033 , x_034 , x_035 , x_036 , x_037 ,
#> #   x_038 , x_039 , x_040 , x_041 , x_042 , …

From that dataset’s documentation:

These data are recorded on a Tecator Infratec Food and Feed Analyzer… For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.

We’ll try to predict the protein content, as a percentage, using the absorbance measurements.

Before we take a further look, let’s split up our data. I’ll first select off two other possible outcome variables and, after splitting into training and testing sets, resample the data using 5-fold cross-validation with 2 repeats.

meats <- meats %>% select(-water, -fat)

set.seed(1)
meats_split <- initial_split(meats)
meats_train <- training(meats_split)
meats_test <- testing(meats_split)
meats_folds <- vfold_cv(meats_train, v = 5, repeats = 2)

The tricky parts of this modeling problem are that:

There are few observations to work with (215 total).
Each of these 100 absorbance measurements are highly correlated.

Visualizing that correlation:

meats_train %>%
  correlate() %>%
  autoplot() +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank())
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'

Almost all of these pairwise correlations between predictors are near 1, besides the last variable and every other variable. That last variable with weaker correlation values? It’s the outcome.

Baseline models

There are several existing model implementations in tidymodels that are resilient to highly correlated predictors. The first one I’d probably reach for is an elastic net: an interpolation of the LASSO and Ridge regularized linear regression models. Evaluating that modeling approach against resamples:

# define a regularized linear model
spec_lr <- 
  linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# try out different penalization approaches
res_lr <- tune_grid(spec_lr, protein ~ ., meats_folds)

show_best(res_lr, metric = "rmse")
#> # A tibble: 5 × 8
#>         penalty mixture .metric .estimator  mean     n std_err .config          
#>                                         
#> 1 0.0000324       0.668 rmse    standard    1.24    10  0.0516 Preprocessor1_Mo…
#> 2 0.00000000524   0.440 rmse    standard    1.25    10  0.0548 Preprocessor1_Mo…
#> 3 0.000000461     0.839 rmse    standard    1.26    10  0.0538 Preprocessor1_Mo…
#> 4 0.00000550      0.965 rmse    standard    1.26    10  0.0540 Preprocessor1_Mo…
#> 5 0.0000000489    0.281 rmse    standard    1.26    10  0.0534 Preprocessor1_Mo…
show_best(res_lr, metric = "rsq")
#> # A tibble: 5 × 8
#>         penalty mixture .metric .estimator  mean     n std_err .config          
#>                                         
#> 1 0.0000324       0.668 rsq     standard   0.849    10  0.0126 Preprocessor1_Mo…
#> 2 0.00000000524   0.440 rsq     standard   0.848    10  0.0128 Preprocessor1_Mo…
#> 3 0.000000461     0.839 rsq     standard   0.846    10  0.0114 Preprocessor1_Mo…
#> 4 0.00000550      0.965 rsq     standard   0.846    10  0.0111 Preprocessor1_Mo…
#> 5 0.0000000489    0.281 rsq     standard   0.846    10  0.0126 Preprocessor1_Mo…

That best RMSE value of 1.24 gives us a baseline to work with, and the best R-squared 0.85 seems like a good start.

Many tree-based model implementations in tidymodels generally handle correlated predictors well. Just to be apples-to-apples with "aorsf", let’s use a different random forest engine to get a better sense for baseline performance:

spec_rf <- 
  rand_forest(mtry = tune(), min_n = tune()) %>%
  # this is the default engine, but for consistency's sake:
  set_engine("ranger") %>%
  set_mode("regression")

res_rf <- tune_grid(spec_rf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry

show_best(res_rf, metric = "rmse")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    96     4 rmse    standard    2.37    10  0.0905 Preprocessor1_Model08
#> 2    41     6 rmse    standard    2.39    10  0.0883 Preprocessor1_Model01
#> 3    88    10 rmse    standard    2.43    10  0.0816 Preprocessor1_Model06
#> 4    79    17 rmse    standard    2.51    10  0.0740 Preprocessor1_Model07
#> 5    27    18 rmse    standard    2.52    10  0.0778 Preprocessor1_Model04
show_best(res_rf, metric = "rsq")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    96     4 rsq     standard   0.424    10  0.0385 Preprocessor1_Model08
#> 2    41     6 rsq     standard   0.409    10  0.0394 Preprocessor1_Model01
#> 3    88    10 rsq     standard   0.387    10  0.0365 Preprocessor1_Model06
#> 4    79    17 rsq     standard   0.353    10  0.0404 Preprocessor1_Model07
#> 5    27    18 rsq     standard   0.346    10  0.0397 Preprocessor1_Model04

Not so hot. Just to show I’m not making a straw man here, I’ll evaluate a few more alternative modeling approaches behind the curtain and print out their best performance metrics:

Gradient boosted tree with LightGBM. Best RMSE: 2.34. Best R-squared: 0.43.
Partial least squares regression. Best RMSE: 1.39. Best R-squared: 0.81.
Support vector machine. Best RMSE: 2.28. Best R-squared: 0.46.

This is a tricky one.

Introducing accelerated oblique random forests

The 0.3.0 release of bonsai introduces support for accelerated oblique random forests via the "aorsf" engine for classification and regression in tidymodels. (Tidy survival modelers might note that we already support "aorsf" for censored regression via the censored parsnip extension package!)

Unlike trees in conventional random forests, which create splits using thresholds based on individual predictors (e.g. x_001 > 3), oblique random forests use linear combinations of predictors to create splits (e.g. x_001 * x_002 > 7.5) and have been shown to improve predictive performance related to conventional random forests for a variety of applications (Menze et al. 2011). “Oblique” references the appearance of decision boundaries when a set of splits is plotted; I’ve grabbed a visual from the aorsf README that demonstrates:

In the above, we’d like to separate the purple dots from the orange squares. A tree in a traditional random forest, represented on the left, can only generate splits based on one of X1 or X2 at a time. A tree in an oblique random forest, represented on the right, can consider both X1 and X2 in creating decision boundaries, often resulting in stronger predictive performance.

Where does the “accelerated” come from? Generally, finding optimal oblique splits is computationally more intensive than finding single-predictor splits. The aorsf package uses something called “Newton Raphson scoring”—the same algorithm under the hood in the survival package—to identify splits based on linear combinations of predictor variables. This approach speeds up that process greatly, resulting in fit times that are analogous to implementations of traditional random forests in R (and hundreds of times faster than existing oblique random forest implementations, Jaeger et al. 2024).

The code to tune this model with the "aorsf" engine is the same as for "ranger", except we switch out the engine argument to set_engine() :

spec_aorsf <- 
  rand_forest(
    mtry = tune(),
    min_n = tune()
  ) %>%
  set_engine("aorsf") %>%
  set_mode("regression")

res_aorsf <- tune_grid(spec_aorsf, protein ~ ., meats_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry

show_best(res_aorsf, metric = "rmse")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    87    11 rmse    standard   0.786    10  0.0370 Preprocessor1_Model02
#> 2    98     8 rmse    standard   0.789    10  0.0363 Preprocessor1_Model10
#> 3    48     5 rmse    standard   0.793    10  0.0363 Preprocessor1_Model01
#> 4    16    17 rmse    standard   0.803    10  0.0325 Preprocessor1_Model09
#> 5    31    18 rmse    standard   0.813    10  0.0359 Preprocessor1_Model05
show_best(res_aorsf, metric = "rsq")
#> # A tibble: 5 × 8
#>    mtry min_n .metric .estimator  mean     n std_err .config              
#>                                   
#> 1    48     5 rsq     standard   0.946    10 0.00446 Preprocessor1_Model01
#> 2    98     8 rsq     standard   0.945    10 0.00482 Preprocessor1_Model10
#> 3    87    11 rsq     standard   0.945    10 0.00484 Preprocessor1_Model02
#> 4    16    17 rsq     standard   0.941    10 0.00370 Preprocessor1_Model09
#> 5    31    18 rsq     standard   0.940    10 0.00547 Preprocessor1_Model05

Holy smokes. The best RMSE from aorsf is 0.79, much more performant than the previous best RMSE from the elastic net with a value of 1.24, and the best R-squared is 0.95, much stronger than the previous best (also from the elastic net) of 0.85.

Especially if your modeling problems involve few samples of many, highly correlated predictors, give the "aorsf" modeling engine a whirl in your workflows and let us know what you think!

References

Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, Jaime L. Speiser, Matthew W. Segar, Ambarish Pandey, Nicholas M. Pajewski. 2024. “Accelerated and Interpretable Oblique Random Survival Forests.” Journal of Computational and Graphical Statistics 33.1: 192-207.

Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, and Nicholas M. Pajewski. 2022. “aorsf: An R package for Supervised Learning Using the Oblique Random Survival Forest.” The Journal of Open Source Software.

Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht. (2011). “On Oblique Random Forests.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 453–469). Springer.

Acknowledgements

Thank you to @bcjaeger , the aorsf author, for doing most of the work to implement aorsf support in bonsai. Thank you to @hfrick , @joranE , @jrosell , @nipnipj , @p-schaefer , @seb-mueller , and @tcovert for their contributions on the bonsai repository since version 0.2.1.

Machine Learning on Posit Open Source

torch Ecosystem Updates

torch v0.17.0

torchvision v0.9.0

New models:

New datasets and features:

Other releases

New maintainer

tidymodels Cheatsheets

Create Models with parsnip

Model catalog

Operations

Preprocessing Data with recipes

Step catalog

Role & type

Need them on the go? Print them!

New tidymodels Releases for April 2026

Ordered Outcomes

Quantile Regression

dials

yardstick

parsnip

tune

finetune

tidymodels

tabpfn 0.1.0

What is TabPFN?

License for the Underyling Model

Usage

Next Steps

Acknowledgements

orbital 0.4.0

Post processing support

New show_query method

Acknowledgements

tidymodels & xgboost

tidypredict 1.0.0

Improved output for random forest models

Faster parsing of trees

More efficient tree expressions

Glmnet support

Acknowledgements

Two New tidymodels Packages

filtro

important

Summary

Q3 2025 tidymodels digest

Quiet linear svm models

Fewer numeric overflow issues in brulee

Additional torch optimizers in brulee

tune version 2.0.0

Using future or mirai for parallel processing

Tuning your postprocessor

What’s next

Acknowledgements

mall 0.2.0

More LLM providers

Parallel requests (R only)

NLP operations without a table

New cheatsheet

recipes 1.3.0

strings_as_factors

Deprecating step_select()

step_dummy() contrasts argument

tidyselect can be used everywhere

step_impute_bag() now takes up less memory

Acknowledgements

rsample 1.3.0

Flexible grouping for bootstrap intervals

Tidyverse developer day

Acknowledgements

Improved sparsity support in tidymodels

What are sparse data?

Sparse matrix support

Sparse data from recipes steps

Q1 2025 tidymodels digest

Improvements in errors and warnings

Quantile regression in parsnip

Parallelism in tune

Things to look forward to

`strings_as_factors`

Deprecating `step_select()`

`step_dummy()` contrasts argument

`step_impute_bag()` now takes up less memory

Input checking in `recipe()`

Long formulas in `recipe()`

Quality of life increases in `step_dummy()`

The `meats` data