<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Machine Learning on Posit Open Source</title>
    <link>https://opensource.posit.co/topics/machine-learning/</link>
    <description>Recent content in Machine Learning on Posit Open Source</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 30 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://opensource.posit.co/topics/machine-learning/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>torch Ecosystem Updates</title>
      <link>https://opensource.posit.co/blog/2026-04-30_torch-ecosystem-updates-2026/</link>
      <pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2026-04-30_torch-ecosystem-updates-2026/</guid>
      <dc:creator>Daniel Falbel</dc:creator>
      <dc:creator>Tomasz Kalinowski</dc:creator><description><![CDATA[<p>We&rsquo;ve just published a new round of CRAN releases across the 






<a href="https://github.com/mlverse/torch" target="_blank" rel="noopener">torch</a>
 ecosystem. Here&rsquo;s a tour of what&rsquo;s new in each package.</p>
<h2 id="torch-v0170">torch v0.17.0
</h2>
<p>The most exciting experimental new feature is support for the 






<a href="https://github.com/mlverse/cudatoolkit" target="_blank" rel="noopener">cudatoolkit</a>

packages. With this, you no longer need a global CUDA toolkit installation in order to use torch on the GPU.</p>
<p>You can now do:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="s">&#34;cuda12.8&#34;</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl">  <span class="n">repos</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;https://mlverse.r-universe.dev&#34;</span><span class="p">,</span> <span class="s">&#34;https://cloud.r-project.org&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;torch&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>The <code>{cuda12.8}</code> package bundles all the CUDA runtime libraries and torch can find it and use it by default.
See more details in the 


  
  
  





<a href="https://torch.mlverse.org/docs/articles/installation#cudatoolkit" target="_blank" rel="noopener">installation docs</a>
.</p>
<p>We also highlight the update to LibTorch v2.8.0 led by 






<a href="https://github.com/TroyHernandez" target="_blank" rel="noopener">Troy Hernandez</a>
 (






<a href="https://github.com/mlverse/torch/pull/1419" target="_blank" rel="noopener">#1419</a>
).</p>
<p>Additionally, this release includes many small bug fixes and small additions to the API. See the full release notes
in the 


  
  
  





<a href="https://torch.mlverse.org/docs/news/#torch-0170" target="_blank" rel="noopener">changelog</a>
.</p>
<h2 id="torchvision-v090">torchvision v0.9.0
</h2>
<p>






<a href="https://github.com/mlverse/torchvision" target="_blank" rel="noopener">torchvision</a>
 provides datasets, model architectures, and image transformations for computer vision. This is a big release with new models, datasets, and many improvements — largely driven by community contributors.</p>
<h3 id="new-models">New models:
</h3>
<ul>
<li><code>model_maskrcnn_resnet50_fpn()</code> and <code>model_maskrcnn_resnet50_fpn_v2()</code> for instance segmentation.</li>
<li><code>model_convnext_*_detection()</code> for object detection (tiny/small/base).</li>
<li><code>model_convnext_*_fcn()</code> and <code>model_convnext_*_upernet()</code> for semantic segmentation (tiny/small/base).</li>
</ul>
<h3 id="new-datasets-and-features">New datasets and features:
</h3>
<ul>
<li><code>vggface2_dataset()</code> for loading the VGGFace2 dataset.</li>
<li>New <code>coco_segmentation_dataset()</code>, split from <code>coco_detection_dataset()</code>, reducing memory usage by ~50%.</li>
<li>Collection dataset catalog with <code>search_collection()</code>, <code>get_collection_catalog()</code>, and <code>list_collection_datasets()</code> for discovering and exploring datasets.</li>
<li>New visualization utilities <code>draw_segmentation_masks()</code> and <code>vision_make_grid()</code>.</li>
</ul>
<p>See the full release notes in the 






<a href="https://github.com/mlverse/torchvision/releases/tag/v0.9.0" target="_blank" rel="noopener">changelog</a>
.</p>
<p>A huge thank you to the community contributors who made this release possible: 






<a href="https://github.com/cregouby" target="_blank" rel="noopener">@cregouby</a>
, 






<a href="https://github.com/ANAMASGARD" target="_blank" rel="noopener">@ANAMASGARD</a>
, 






<a href="https://github.com/Chandraveersingh1717" target="_blank" rel="noopener">@Chandraveersingh1717</a>
, 






<a href="https://github.com/DerrickUnleashed" target="_blank" rel="noopener">@DerrickUnleashed</a>
, and 






<a href="https://github.com/srishtiii28" target="_blank" rel="noopener">@srishtiii28</a>
.</p>
<h2 id="other-releases">Other releases
</h2>
<p>Most of the other packages don&rsquo;t have significant changes, and the releases add minimal improvements to docs, CI infrastructure and CRAN related updates.</p>
<ul>
<li><strong>






<a href="https://github.com/mlverse/luz/releases/tag/v0.5.2" target="_blank" rel="noopener">luz</a>
</strong> v0.5.2 — Higher-level API for torch with a Keras-like interface for training neural networks.</li>
<li><strong>






<a href="https://github.com/mlverse/hfhub/releases/tag/v0.1.2" target="_blank" rel="noopener">hfhub</a>
</strong> v0.1.2 — Download and cache files from Hugging Face Hub repositories, making it easy to use pretrained models and datasets from R.</li>
<li><strong>






<a href="https://github.com/mlverse/tok/releases/tag/v0.2.2" target="_blank" rel="noopener">tok</a>
</strong> v0.2.2 — Fast tokenizers for R, powered by the Hugging Face Tokenizers library written in Rust. Supports BPE, WordPiece, and other tokenization algorithms.</li>
<li><strong>






<a href="https://github.com/mlverse/torchdatasets/releases/tag/v0.3.2" target="_blank" rel="noopener">torchdatasets</a>
</strong> v0.3.2 — Extra ready-to-use datasets for torch, complementing the built-in datasets in torchvision.</li>
<li><strong>






<a href="https://github.com/mlverse/safetensors/releases/tag/v0.2.1" target="_blank" rel="noopener">safetensors</a>
</strong> v0.2.1 — Read and write the Safetensors file format, a safe and fast format for storing and loading tensors.</li>
<li><strong>






<a href="https://github.com/mlverse/tfevents/releases/tag/v0.0.5" target="_blank" rel="noopener">tfevents</a>
</strong> v0.0.5 — Write event files compatible with TensorBoard from R for experiment tracking and visualization.</li>
<li><strong>






<a href="https://github.com/mlverse/wav/releases/tag/v0.2.0" target="_blank" rel="noopener">wav</a>
</strong> v0.2.0 — Read and write WAV files in R.</li>
</ul>
<h2 id="new-maintainer">New maintainer
</h2>
<p>We&rsquo;re excited to welcome 






<a href="https://github.com/t-kalinowski" target="_blank" rel="noopener">Tomasz Kalinowski</a>
 as the new maintainer of torch and the broader mlverse ecosystem.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2026-04-30_torch-ecosystem-updates-2026/thumbnail.svg" length="5539" type="image/svg&#43;xml" />
    </item>
    <item>
      <title>tidymodels Cheatsheets</title>
      <link>https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/</link>
      <pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/</guid>
      <dc:creator>Edgar Ruiz</dc:creator><description><![CDATA[<p>After almost 8 years, tidymodels finally has its first cheatsheets, and not just one, but two! The 






<a href="https://opensource.posit.co/resources/cheatsheets/ml-preprocessing-data/">first one</a>
, covering data preprocessing with <code>recipes</code>, was released a couple of months ago. Today, we are delighted to announce 






<a href="https://opensource.posit.co/resources/cheatsheets/ml-create-models/">a second cheatsheet</a>
, this time focusing on modeling with <code>parsnip</code>.</p>
<p>Both cheatsheets have a dedicated HTML version on the Posit Open Source site, so you can browse and search them without opening a PDF. In this post we&rsquo;ll walk through what each one covers, starting with the newest.</p>
<h2 id="create-models-with-parsnip">Create Models with <strong>parsnip</strong>
</h2>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets-parsnip.png"
      alt="Both pages of the Create Models with parsnip cheatsheet side by side, showing sections for Basics, Legends, Classification Only, Regression Only, General Use, Discriminant, Ensemble, Support Vector Machine, Feature Rules, Survival, and Operations."  title="The \&#34;Create Models with parsnip\&#34; cheatsheet — click to enlarge" 
      loading="lazy"
    ><figcaption class="text-sm text-center text-gray-500">The &quot;Create Models with parsnip&quot; cheatsheet — click to enlarge</figcaption>
  </figure></div>
</p>
<p>The cheatsheet is organized into three main parts: an introduction to parsnip&rsquo;s basics, a catalog of all models available through the package, and a hands-on operations reference for fitting and inspecting models. The basics section introduces how parsnip provides a single, unified interface for defining and fitting models, regardless of the underlying package powering them.</p>
<h3 id="model-catalog">Model catalog
</h3>
<p>The largest section of the cheatsheet catalogs all models available through parsnip, grouped by use case:</p>
<ul>
<li><strong>Classification only:</strong> models for binary and multiclass prediction. It also includes probability-based classification using Bayes&rsquo; theorem and models for ordinal responses.</li>
<li><strong>Regression only:</strong> models for predicting continuous numeric outcomes, from standard linear regression to generalized linear models for count data.</li>
<li><strong>General use:</strong> a versatile mix of model types that work for both classification and regression, including decision trees, nearest neighbors, neural networks, and spline-based approaches.</li>
<li><strong>Discriminant analysis:</strong> models that estimate the distribution of predictors separately for each class and use Bayes&rsquo; theorem to assign probabilities, available in linear, quadratic, flexible, and regularized variants.</li>
<li><strong>Ensemble methods:</strong> models that combine many individual learners into a stronger prediction, including random forests, gradient boosting, bagged trees, and Bayesian additive regression trees.</li>
<li><strong>Support Vector Machines:</strong> models that find an optimal boundary between classes, or fit a robust regression, using linear, polynomial, or radial kernel functions.</li>
<li><strong>Feature rules:</strong> models that extract simple, human-readable rules from tree ensembles and use them as the basis for prediction.</li>
<li><strong>Survival models:</strong> models for time-to-event data, covering both proportional hazards and fully parametric approaches.</li>
</ul>









  
  
    
  

  
  
    
  





  
  
    
  
    
  
  


<div class="grid gap-12 items-start mt-12 md:grid-cols-[3fr_2fr] ">
  
  
    
    
    
      <div class="prose max-w-none ">One design choice in particular makes this section much easier to navigate: <strong>pills</strong>. Each model&rsquo;s compatible engines and supported modes are shown as small, visually distinct tags, so you can see at a glance which mode a given engine supports, without having to read through the description text. Each mode is encoded in the pill with a number: Classification (1), Regression (2), Censored Regression (3), and Quantile Regression (4). A legend mapping each number to its mode is available at the top of page one.</div>
    
  
    
    
    
      <div class="prose max-w-none "><div class="not-prose"><figure>
      <img class="h-auto max-w-full rounded-lg"
        src="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets-pills.png"
        alt="A close-up of the decision_tree() entry in the parsnip cheatsheet, showing engine pills labeled partykit, rpart, and spark with mode support numbers. Annotations point out the engine name, the modes each engine supports, and the total number of engines available."  title="Engine pills show the name and supported modes of each engine at a glance" 
        loading="lazy"
      ><figcaption class="text-sm text-center text-gray-500">Engine pills show the name and supported modes of each engine at a glance</figcaption>
    </figure></div></div>
    
  
</div>

<p>And true to the R cheatsheet tradition, individual models or groups of related models are paired with <strong>small illustrations</strong>, thoughtfully designed for visual impact to aid recall. Each one attempts to accurately represent the function or functions it accompanies, making them a genuine navigation aid rather than decoration, especially when you have a vague memory of &ldquo;that tree-based ensemble that used Bayesian analysis&rdquo; and need to scan quickly.</p>
<h3 id="operations">Operations
</h3>
<p>The last section covers the practical workflow of fitting and using a model. Each function is paired with a <strong>quick runnable example</strong>, and the examples build on each other starting from the two lines of code right below the section title, making it easy to follow the full workflow from model specification to results.</p>
<div class="text-right">



























<a href="https://opensource.posit.co/resources/cheatsheets/ml-create-models/"
  class="btn-shortcode inline-flex mb-5 mr-5 items-center px-4 py-3 text-sm leading-5 gap-2 rounded-lg bg-blue-500 text-white font-semibold align-middle hover:bg-blue-600 transition no-underline">Explore the parsnip cheatsheet<span class="icon-[boxicons--arrow-right] w-4 h-4 flex-none"></span></a>

</div>
<h2 id="preprocessing-data-with-recipes">Preprocessing Data with <strong>recipes</strong>
</h2>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets-recipes.png"
      alt="Both pages of the Preprocessing Data with recipes cheatsheet side by side, showing sections for Basics, Filters, In-place Transformations, Imputation, Encodings, Dummy Variables, Multivariate Transformations, Date and Time, Row operations, Other, and Role and type."  title="The \&#34;Preprocessing Data with recipes\&#34; cheatsheet — click to enlarge" 
      loading="lazy"
    ><figcaption class="text-sm text-center text-gray-500">The &quot;Preprocessing Data with recipes&quot; cheatsheet — click to enlarge</figcaption>
  </figure></div>
</p>
<p>After a quick Basics section covering the core workflow, the vast majority of the cheatsheet is dedicated to <code>step_*()</code> functions, the building blocks of any recipe, before finishing with role and type management.</p>
<h3 id="step-catalog">Step catalog
</h3>
<p>The steps are organized into groups based on what they do, each listed with its arguments and a short description:</p>
<ul>
<li><strong>Filters:</strong> steps for removing variables that are sparse, zero-variance, linearly dependent, highly correlated, or missing too many values</li>
<li><strong>In-place Transformations:</strong> basis functions (splines, polynomials), discretization, and normalization steps</li>
<li><strong>Imputation:</strong> steps for filling in missing values, ranging from simple statistical substitution to model-based approaches</li>
<li><strong>Encodings:</strong> type converters (e.g. factor to string, numeric to factor), value converters, and other factor-handling steps</li>
<li><strong>Dummy Variables:</strong> one-hot and binary encoding, text pattern matching, and conversion helpers</li>
<li><strong>Multivariate Transformations:</strong> signal extraction (PCA, ICA, PLS, and friends) and centroid-based distance measures</li>
<li><strong>Date &amp; Time:</strong> steps for converting date and datetime columns into usable numeric or factor features</li>
<li><strong>Row operations:</strong> sampling, shuffling, slicing, and removing rows with missing values</li>
<li><strong>Other:</strong> interaction terms, renaming, rolling window statistics, geographic distances, and ratios</li>
</ul>
<p>As with the parsnip cheatsheet, each group of steps is paired with <strong>small, thoughtfully designed illustrations</strong> to help you visually locate a step family when scanning.</p>
<h3 id="role--type">Role &amp; type
</h3>









  
  
    
  

  
  
    
  





  
  
    
  
    
  
  


<div class="grid gap-12 items-start mt-12 md:grid-cols-[3fr_2fr] ">
  
  
    
    
    
      <div class="prose max-w-none ">The last section focuses on the selection and management of variable roles and types within the recipe. The selection side covers ways to target variables by their role (outcome, predictor, or any custom role) as well as by their type (numeric, factor, logical, and so on), including a handy set of convenience selectors for the most common combinations. The management side shows how to add, update, and remove roles, showing you how to gain fine-grained control over how each variable participates in the recipe.</div>
    
  
    
    
    
      <div class="prose max-w-none "><div class="not-prose"><figure>
      <img class="h-auto max-w-full rounded-lg"
        src="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets-selectors.png"
        alt="A close-up of the Role and type section of the recipes cheatsheet, showing the Selectors and Convenience Selectors subsections. Each convenience selector function is listed alongside colored pills indicating which variable types it targets."  title="Easily find the right selector function with this at-a-glance guide" 
        loading="lazy"
      ><figcaption class="text-sm text-center text-gray-500">Easily find the right selector function with this at-a-glance guide</figcaption>
    </figure></div></div>
    
  
</div>

<br/>
<div class="text-right">



























<a href="https://opensource.posit.co/resources/cheatsheets/ml-preprocessing-data/"
  class="btn-shortcode inline-flex mb-5 mr-5 items-center px-4 py-3 text-sm leading-5 gap-2 rounded-lg bg-blue-500 text-white font-semibold align-middle hover:bg-blue-600 transition no-underline">Explore the recipes cheatsheet<span class="icon-[boxicons--arrow-right] w-4 h-4 flex-none"></span></a>

</div>
<h2 id="need-them-on-the-go-print-them">Need them on the go? Print them!
</h2>
<p>A lot of care went into ensuring both cheatsheets hold up when printed, particularly in black and white. We know that many folks print cheatsheets to keep at their desk for quick reference, and we wanted to make sure they remain fully usable in that medium. That meant making sure font sizes and weights stay legible on paper, that the illustrations remain perceptible without color, and that contrast levels are strong enough that no text ends up too pale to read or too heavy to parse. Accessibility in print mattered to us just as much as clarity on screen.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets-bw.png"
      alt="Both cheatsheets shown side by side in black and white, demonstrating that the text, illustrations, and layout remain clear and readable without color."  title="New tidymodels cheatsheets are fully readable when printed" 
      loading="lazy"
    ><figcaption class="text-sm text-center text-gray-500">New tidymodels cheatsheets are fully readable when printed</figcaption>
  </figure></div>
</p>
<script>
(function() {
  'use strict';
  const lightbox = document.createElement('div');
  lightbox.id = 'image-lightbox';
  lightbox.className = 'fixed inset-0 z-50 hidden items-center justify-center bg-blue-100/80 transition-opacity';
  lightbox.innerHTML = `
    <button id="lightbox-close" class="absolute top-4 right-4 text-gray-700 text-4xl font-light hover:text-gray-900 transition-colors z-10" aria-label="Close lightbox">
      <svg class="w-10 h-10" fill="none" stroke="currentColor" viewBox="0 0 24 24">
        <path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M6 18L18 6M6 6l12 12"></path>
      </svg>
    </button>
    <img id="lightbox-image" class="max-w-[90vw] max-h-[90vh] object-contain" alt="">
  `;
  document.body.appendChild(lightbox);
  const lightboxImg = document.getElementById('lightbox-image');
  const closeBtn = document.getElementById('lightbox-close');
  const proseImages = document.querySelectorAll('.prose img:not(a img)');
  proseImages.forEach(img => {
    img.style.cursor = 'pointer';
    img.setAttribute('role', 'button');
    img.setAttribute('tabindex', '0');
    img.addEventListener('click', function() {
      lightboxImg.src = this.src;
      lightboxImg.alt = this.alt || '';
      lightbox.classList.remove('hidden');
      lightbox.classList.add('flex');
      document.body.style.overflow = 'hidden';
    });
    img.addEventListener('keydown', function(e) {
      if (e.key === 'Enter' || e.key === ' ') {
        e.preventDefault();
        this.click();
      }
    });
  });
  function closeLightbox() {
    lightbox.classList.add('hidden');
    lightbox.classList.remove('flex');
    document.body.style.overflow = '';
  }
  closeBtn.addEventListener('click', closeLightbox);
  lightbox.addEventListener('click', function(e) {
    if (e.target === lightbox) { closeLightbox(); }
  });
  document.addEventListener('keydown', function(e) {
    if (e.key === 'Escape' && !lightbox.classList.contains('hidden')) { closeLightbox(); }
  });
})();
</script>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2026-04-29_tidymodels-cheatsheets/tidymodels-cheatsheets.png" length="1674954" type="image/png" />
    </item>
    <item>
      <title>New tidymodels Releases for April 2026</title>
      <link>https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/</link>
      <pubDate>Mon, 27 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/</guid>
      <dc:creator>Max Kuhn</dc:creator>
      <dc:creator>Hannah Frick</dc:creator>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<p>We&rsquo;ve released a sequence of tidymodels packages over the last few weeks: dials (1.4.3), parsnip (1.5.0), tune (2.1.0), yardstick (1.4.0), and tidymodels (1.5.0). You can install them via:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># tidymodels installs all of the new versions</span>
</span></span><span class="line"><span class="cl"><span class="nf">require</span><span class="p">(</span><span class="n">pak</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">pak</span><span class="o">::</span><span class="nf">pak</span><span class="p">(</span><span class="s">&#34;tidymodels&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Here are links to the NEWS files for each package:</p>
<ul>
<li>


  
  
  





<a href="https://dials.tidymodels.org/news/index.html#dials-143" target="_blank" rel="noopener">dials</a>
</li>
<li>


  
  
  





<a href="https://parsnip.tidymodels.org/news/index.html#parsnip-150" target="_blank" rel="noopener">parsnip</a>
</li>
<li>


  
  
  





<a href="https://tune.tidymodels.org/news/index.html#tune-210" target="_blank" rel="noopener">tune</a>
</li>
<li>


  
  
  





<a href="https://yardstick.tidymodels.org/news/index.html#yardstick-140" target="_blank" rel="noopener">yardstick</a>
</li>
<li>


  
  
  





<a href="https://finetune.tidymodels.org/news/index.html#finetune-130" target="_blank" rel="noopener">finetune</a>
</li>
<li>


  
  
  





<a href="https://tidymodels.tidymodels.org/news/index.html#tidymodels-150" target="_blank" rel="noopener">tidymodels</a>
</li>
</ul>
<p>Let&rsquo;s first talk about the two biggest updates enabled by this group of releases, then we&rsquo;ll cover some of the other changes for each package.</p>
<h2 id="ordered-outcomes">Ordered Outcomes
</h2>
<p>parsnip has a new model type, <code>ordinal_reg()</code>, analogous to <code>multinom_reg()</code>, for fitting various generalized linear models with ordered class levels.</p>
<p>The 






<a href="https://github.com/corybrunson/ordered" target="_blank" rel="noopener">ordered package by Cory Brunson</a>
 is now on CRAN. This contains the specific engine code for these models, including:</p>
<ul>
<li><code>ordinal_reg()</code>: three engines: <code>&quot;polr&quot;</code>, <code>&quot;ordinalNet&quot;</code>, and <code>&quot;vglm&quot;</code>.</li>
<li><code>gen_additive_mod()</code>: <code>&quot;vgam&quot;</code></li>
<li><code>decision_tree()</code>: <code>&quot;rpartScore&quot;</code></li>
<li><code>rand_forest()</code>: <code>&quot;ordinalForest&quot;</code></li>
</ul>
<p>These models can be fitted, tuned, and evaluated with tidymodels. For the evaluation, we&rsquo;ve added a specific performance metric for ordered categories: the 


  
  
  





<a href="https://aml4td.org/chapters/cls-metrics.html#sec-ordered-categories" target="_blank" rel="noopener">ranked probability score</a>
 (RPS). The function <code>ranked_prob_score()</code> is in the new yardstick release and requires an ordered factor for the outcome.</p>
<h2 id="quantile-regression">Quantile Regression
</h2>
<p>We 


  
  
  





<a href="https://tidyverse.org/blog/2025/02/tidymodels-2025-q1/#quantile-regression-in-parsnip" target="_blank" rel="noopener">previously reported</a>
 that parsnip supports quantile regression models. With the latest set of releases, 


  
  
  





<a href="https://parsnip.tidymodels.org/news/index.html#quantile-regression-1-5-0" target="_blank" rel="noopener">new boosting and neural network engines</a>
 are available, and these models can now be tuned and evaluated using a relevant metric. yardstick now includes the <em>weighted interval score</em> (






<a href="https://doi.org/10.1371/journal.pcbi.1008618" target="_blank" rel="noopener">Bracher <em>et al</em> (2021)</a>
) to evaluate the quality of the quantile predictions.</p>
<p>Here&rsquo;s a simple one-dimensional example using the Ames data; we&rsquo;ll predict the sale price as a function of latitude. To start, let&rsquo;s make a training/test split, generate some resamples, and plot the training data.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1"># We&#39;ll also need the qrnn package for the neural network engine</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">1215</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ames_split</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">ames</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">Latitude</span><span class="p">,</span> <span class="n">Sale_Price</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">initial_split</span><span class="p">(</span><span class="n">strata</span> <span class="o">=</span> <span class="n">Sale_Price</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ames_train</span> <span class="o">&lt;-</span> <span class="nf">training</span><span class="p">(</span><span class="n">ames_split</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ames_test</span> <span class="o">&lt;-</span> <span class="nf">testing</span><span class="p">(</span><span class="n">ames_split</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">ames_rs</span> <span class="o">&lt;-</span> <span class="nf">vfold_cv</span><span class="p">(</span><span class="n">ames_train</span><span class="p">,</span> <span class="n">strata</span> <span class="o">=</span> <span class="n">Sale_Price</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">ames_train</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">Latitude</span><span class="p">,</span> <span class="n">Sale_Price</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_point</span><span class="p">(</span><span class="n">alpha</span> <span class="o">=</span> <span class="m">1</span> <span class="o">/</span> <span class="m">5</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_smooth</span><span class="p">(</span><span class="n">se</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Latitude&#34;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;Sale Price (USD)&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<img src="https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/index.markdown_strict_files/figure-markdown_strict/split-ames-1.png" style="width:90.0%" data-fig-align="center" />
<p>Note that we almost always model these data with a log transformation on the outcome due to its inherent skewness. That helps us avoid making negative predictions, be more robust to overly influential points (i.e., locations with very large sale prices), and stabilize the variance. <em>However</em>, we don&rsquo;t necessarily have to do that with quantile regression. The objective functions used to estimate parameters do not impose requirements on the normality of the data or heterogeneity of residuals. For this analysis, let&rsquo;s stick with the original units of the outcome (USD).</p>
<p>There are a few engines for quantile regression, and we&rsquo;ll use a neural network model. To get started, the quantiles to be predicted need to be specified. We make a model specification with a few additions:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># Pre-defined quantiles of interest</span>
</span></span><span class="line"><span class="cl"><span class="n">qnt_lvls</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">0.05</span><span class="p">,</span> <span class="m">0.25</span><span class="p">,</span> <span class="m">0.5</span><span class="p">,</span> <span class="m">0.75</span><span class="p">,</span> <span class="m">0.95</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">nnet_spec</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="nf">mlp</span><span class="p">(</span><span class="n">hidden_units</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">(),</span> <span class="n">penalty</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">(),</span> <span class="n">epochs</span> <span class="o">=</span> <span class="m">10</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># Set the quantile levels with the mode:</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_mode</span><span class="p">(</span><span class="s">&#34;quantile regression&#34;</span><span class="p">,</span> <span class="n">quantile_levels</span> <span class="o">=</span> <span class="n">qnt_lvls</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># A new engine for quantile regression with neural networks via the</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># qrnn package. We&#39;ll add an engine argument to specify the</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># optimization method for training the model:</span>
</span></span><span class="line"><span class="cl">  <span class="nf">set_engine</span><span class="p">(</span><span class="s">&#34;qrnn&#34;</span><span class="p">,</span> <span class="n">method</span> <span class="o">=</span> <span class="s">&#34;adam&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Scale the single predictor to help the model initialize its</span>
</span></span><span class="line"><span class="cl"><span class="c1"># parameters.</span>
</span></span><span class="line"><span class="cl"><span class="n">nnet_rec</span> <span class="o">&lt;-</span> <span class="nf">recipe</span><span class="p">(</span><span class="n">Sale_Price</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">ames_train</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_normalize</span><span class="p">(</span><span class="nf">all_predictors</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">nnet_wflow</span> <span class="o">&lt;-</span> <span class="nf">workflow</span><span class="p">(</span><span class="n">nnet_rec</span><span class="p">,</span> <span class="n">nnet_spec</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>From there, we can use any of our tuning functions to optimize the number of hidden units and the amount of weight decay. By default, the weighted interval score is used for this particular mode.</p>
<p>We&rsquo;ll consider 25 tuning parameter candidates to optimize model performance.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">971</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">nnet_res</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">nnet_wflow</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">tune_grid</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">resamples</span> <span class="o">=</span> <span class="n">ames_rs</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">grid</span> <span class="o">=</span> <span class="m">25</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">control</span> <span class="o">=</span> <span class="nf">control_grid</span><span class="p">(</span><span class="n">save_workflow</span> <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>We can get the performance metric and visualize which tuning parameter combinations have the smallest weighted interval score:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">nnet_mtr</span> <span class="o">&lt;-</span> <span class="nf">collect_metrics</span><span class="p">(</span><span class="n">nnet_res</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">nnet_mtr</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">penalty</span><span class="p">,</span> <span class="n">hidden_units</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">mean</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_point</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">scale_x_log10</span><span class="p">()</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">coord_fixed</span><span class="p">(</span><span class="n">ratio</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">labs</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">&#34;Penalty&#34;</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">&#34;# Hidden Units&#34;</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="s">&#34;WIS&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<img src="https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/index.markdown_strict_files/figure-markdown_strict/quantile-tune-res-1.png" data-fig-align="center" width="480" />
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">select_best</span><span class="p">(</span><span class="n">nnet_res</span><span class="p">,</span> <span class="n">metric</span> <span class="o">=</span> <span class="s">&#34;weighted_interval_score&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<pre><code># A tibble: 1 × 3
  hidden_units     penalty .config         
         &lt;int&gt;       &lt;dbl&gt; &lt;chr&gt;           
1           10 0.000000215 pre0_mod24_post0
</code></pre>
<p>The model appears to prefer a smaller penalty and more hidden units.</p>
<p>It&rsquo;s hard to conceptualize how well the model functions with just these numbers. To show that the metric does select good models, let&rsquo;s fit the best, median, and worst models and see how they look on the test set.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">8281</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">best_model</span> <span class="o">&lt;-</span> <span class="nf">fit_best</span><span class="p">(</span><span class="n">nnet_res</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">8281</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">worst_model</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">nnet_mtr</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">slice_max</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="m">1</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">hidden_units</span><span class="p">,</span> <span class="n">penalty</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">finalize_workflow</span><span class="p">(</span><span class="n">nnet_wflow</span><span class="p">,</span> <span class="n">parameters</span> <span class="o">=</span> <span class="n">_</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">ames_train</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">8281</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">mid_model</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">nnet_mtr</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="c1"># Since we have an odd number of grid points:</span>
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="n">mean</span> <span class="o">==</span> <span class="nf">median</span><span class="p">(</span><span class="n">mean</span><span class="p">))</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">hidden_units</span><span class="p">,</span> <span class="n">penalty</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">finalize_workflow</span><span class="p">(</span><span class="n">nnet_wflow</span><span class="p">,</span> <span class="n">parameters</span> <span class="o">=</span> <span class="n">_</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">ames_train</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Now let&rsquo;s plot the results. We&rsquo;ll color the predicted quantiles: black indicates the predicted median sale price, orange lines indicate the inner quartiles, and smoky periwinkle lines indicate the 0.05 and 0.95 quantiles (which could serve as 90% prediction intervals).</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">bind_rows</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">best_model</span> <span class="o">|&gt;</span> <span class="nf">augment</span><span class="p">(</span><span class="n">ames_test</span><span class="p">)</span> <span class="o">|&gt;</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Model</span> <span class="o">=</span> <span class="s">&#34;Best Results&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">mid_model</span> <span class="o">|&gt;</span> <span class="nf">augment</span><span class="p">(</span><span class="n">ames_test</span><span class="p">)</span> <span class="o">|&gt;</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Model</span> <span class="o">=</span> <span class="s">&#34;Meh Results&#34;</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">  <span class="n">worst_model</span> <span class="o">|&gt;</span> <span class="nf">augment</span><span class="p">(</span><span class="n">ames_test</span><span class="p">)</span> <span class="o">|&gt;</span> <span class="nf">mutate</span><span class="p">(</span><span class="n">Model</span> <span class="o">=</span> <span class="s">&#34;Worst Results&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">mutate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">.pred_quantile</span> <span class="o">=</span> <span class="nf">map</span><span class="p">(</span><span class="n">.pred_quantile</span><span class="p">,</span> <span class="o">~</span> <span class="nf">as_tibble</span><span class="p">(</span><span class="n">.x</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">unnest</span><span class="p">(</span><span class="n">.pred_quantile</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="n">Latitude</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">Latitude</span><span class="p">))</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_point</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">Sale_Price</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">1</span> <span class="o">/</span> <span class="m">30</span><span class="p">,</span> <span class="n">cex</span> <span class="o">=</span> <span class="m">3</span> <span class="o">/</span> <span class="m">4</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_path</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">aes</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">      <span class="n">y</span> <span class="o">=</span> <span class="n">.pred_quantile</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="n">group</span> <span class="o">=</span> <span class="n">.quantile_levels</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">      <span class="n">col</span> <span class="o">=</span> <span class="nf">factor</span><span class="p">(</span><span class="n">.quantile_levels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="n">show.legend</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">linewidth</span> <span class="o">=</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">scale_color_manual</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">values</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">&#34;#8785B2FF&#34;</span><span class="p">,</span> <span class="s">&#34;#D95F30FF&#34;</span><span class="p">,</span> <span class="s">&#34;black&#34;</span><span class="p">,</span> <span class="s">&#34;#D95F30FF&#34;</span><span class="p">,</span> <span class="s">&#34;#8785B2FF&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">Model</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<img src="https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/index.markdown_strict_files/figure-markdown_strict/fit-plots-1.png" data-fig-align="center" width="576" />
<p>These plots show that configurations with very large score values have poor fits (linear in this case). The &ldquo;meh&rdquo; model is nonlinear but not responsive enough to the datas&rsquo; ups and downs. The best model, with more hidden units and a low penalty, appears to be flexible enough to model the data well.</p>
<p>We&rsquo;ll have more metrics in yardstick that can use quantile predictions in the future. For example, we can extend the ones that we have, such as <code>rmse()</code> or <code>rsq()</code>, to use a predicted value from the center of the predictive distribution, such as the 0.5 quantile.</p>
<p>Now we&rsquo;ll describe various other improvements in the recently released versions.</p>
<h2 id="dials">dials
</h2>
<p>The latest dials release contains several new parameters for new-ish models in parsnip: For the <code>ordinal_reg()</code> models, dials now contains <code>ordinal_link()</code> and <code>odds_link()</code>. For the <code>tab_pfn()</code>, dials contains <code>num_estimators()</code>, <code>softmax_temperature()</code>, <code>balance_probabilities()</code>, <code>average_before_softmax()</code>, and <code>training_set_limit()</code>.</p>
<p>The other user-facing changes were related to input checking and related error messages. The most prominent example is that <code>parameters()</code> and the <code>grid_*()</code> functions now give more information in the error message when non-parameter objects are passed in: which inputs aren&rsquo;t a parameter object and what they are instead.</p>
<p>






<a href="https://github.com/corybrunson" target="_blank" rel="noopener">@corybrunson</a>
, 






<a href="https://github.com/daltonkw" target="_blank" rel="noopener">@daltonkw</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/jeroenjanssens" target="_blank" rel="noopener">@jeroenjanssens</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, and 






<a href="https://github.com/vmikk" target="_blank" rel="noopener">@vmikk</a>
 contributed to the package since the last release.</p>
<h2 id="yardstick">yardstick
</h2>
<p>Beyond the two new metrics <code>ranked_prob_score()</code> and <code>weighted_interval_score()</code> described above, this release adds a further 8 metrics.</p>
<p>Three new regression metrics:</p>
<ul>
<li><code>mse()</code> &mdash; mean squared error (the squared counterpart to the existing <code>rmse()</code>).</li>
<li><code>rmse_relative()</code> &mdash; root mean squared error normalized by the observed value range.</li>
<li><code>gini_coef()</code> &mdash; normalized Gini coefficient.</li>
</ul>
<p>Three new classification metrics:</p>
<ul>
<li><code>fall_out()</code> &mdash; false positive rate (1 − specificity).</li>
<li><code>miss_rate()</code> &mdash; false negative rate (1 − sensitivity).</li>
<li><code>markedness()</code> &mdash; predictive power of a classifier, computed as PPV + NPV − 1.</li>
</ul>
<p>Two new probability-based classification metrics:</p>
<ul>
<li><code>roc_dist()</code> &mdash; Euclidean distance from the perfect-classifier corner of ROC space.</li>
<li><code>sedi()</code> &mdash; Symmetric Extremal Dependence Index.</li>
</ul>
<p>In addition to these new metrics, we have also updated the documention of all metrics. Now each metric shows the formula used to calculate it, as well as the valid values it can produce.</p>
<p>We also have pages that list all metrics of the same type. These can be found with 






<a href="https://yardstick.tidymodels.org/reference/class-metrics.html" target="_blank" rel="noopener">?class-metrics</a>
, 






<a href="https://yardstick.tidymodels.org/reference/numeric-metrics.html" target="_blank" rel="noopener">?numeric-metrics</a>
 or linked within each metric documentation.</p>
<p>We are thankful to the developers who contributed to this version: 






<a href="https://github.com/abichat" target="_blank" rel="noopener">@abichat</a>
, 






<a href="https://github.com/astamm" target="_blank" rel="noopener">@astamm</a>
, 






<a href="https://github.com/corybrunson" target="_blank" rel="noopener">@corybrunson</a>
, 






<a href="https://github.com/DarioS" target="_blank" rel="noopener">@DarioS</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/FvD" target="_blank" rel="noopener">@FvD</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/JavOrraca" target="_blank" rel="noopener">@JavOrraca</a>
, 






<a href="https://github.com/jeroenjanssens" target="_blank" rel="noopener">@jeroenjanssens</a>
, 






<a href="https://github.com/jkylearmstrong-temple" target="_blank" rel="noopener">@jkylearmstrong-temple</a>
, 






<a href="https://github.com/mle2718" target="_blank" rel="noopener">@mle2718</a>
, 






<a href="https://github.com/nathant181" target="_blank" rel="noopener">@nathant181</a>
, 






<a href="https://github.com/SimonDedman" target="_blank" rel="noopener">@SimonDedman</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, and 






<a href="https://github.com/tripartio" target="_blank" rel="noopener">@tripartio</a>
</p>
<h2 id="parsnip">parsnip
</h2>
<p>Version 1.5.0 of parsnip had a variety of changes. Besides the additions for the two new model types shown above:</p>
<ul>
<li>We enabled case weight usage for the <code>&quot;nnet&quot;</code> engines of <code>mlp()</code> and <code>bag_mlp()</code> as well as for the <code>&quot;dbarts&quot;</code> engine of <code>bart()</code>.</li>
</ul>
<p>Many of the other changes are most likely to be noticed by developers:</p>
<ul>
<li>
<p>The interface for declaring tunable parameters has been simplified and is the same for main arguments as well as engine parameters. Also, these values can now be set inside extension packages.</p>
</li>
<li>
<p>We now export the generics for <code>predict_quantile()</code>, <code>predict_class()</code>, <code>predict_classprob()</code>, and <code>predict_hazard()</code>.</p>
</li>
<li>
<p><code>format_predictions()</code> is a new unified function for formatting prediction outputs, consolidating the logic from the individual <code>format_*()</code> functions. The individual functions <code>format_num()</code>, <code>format_class()</code>, <code>format_classprobs()</code>, <code>format_time()</code>, <code>format_survival()</code>, <code>format_linear_pred()</code>, and <code>format_hazard()</code> are now deprecated.</p>
</li>
</ul>
<p>Thanks to those who contributed to parsnip since the last release: 






<a href="https://github.com/CeresBarros" target="_blank" rel="noopener">@CeresBarros</a>
, 






<a href="https://github.com/corybrunson" target="_blank" rel="noopener">@corybrunson</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/iamYannC" target="_blank" rel="noopener">@iamYannC</a>
, 






<a href="https://github.com/jack-davison" target="_blank" rel="noopener">@jack-davison</a>
, 






<a href="https://github.com/jameslamb" target="_blank" rel="noopener">@jameslamb</a>
, 






<a href="https://github.com/martinju" target="_blank" rel="noopener">@martinju</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</p>
<h2 id="tune">tune
</h2>
<p>The core functionality of tune is to do all the model fitting (including pre- and postprocessing) and performance evaluation across various resamples and tuning parameter combinations. For grid search, we could take the full parameter grid, splice one parameter combination into the workflow at a time, and run with it. That can be pretty inefficient though. So what actually happens in tune are a few optimizations in how we do all that fitting and evaluating:
For preprocessing, we do it once for a resample (per preprocessing parameter combination) and then evaluate all model candidates on it. This lets us avoid unnecessarily repeating the same preprocessing multiple times.
For model fitting, we make use of what Max calls 






<a href="https://parsnip.tidymodels.org/articles/Submodels.html" target="_blank" rel="noopener">&ldquo;the submodel trick&rdquo;</a>
: For certain models, like a boosted tree, you can use <em>a submodel</em> to make predictions without having to refit the model. A boosted tree ensemble fitted with 20 trees can be used to make predictions for any number of trees up to the 20 used for fitting. That allows us to evaluate different tuning parameter candidates for, here, the number of trees, without having to refit the model. When we added postprocessing, we temporarily disabled this (to ensure we got the integration right) - now we&rsquo;ve brought it back. We make use of this speedup for both the main model as well as the calibration model.</p>
<p>Another big update is that the Gaussian process model package was changed from GPfit to GauPro because the former is no longer actively maintained. There are some differences:</p>
<ul>
<li>
<p>Fit diagnostics are computed and reported. If the fit quality is poor, an &ldquo;uncertainty sample&rdquo; that is furthest away from the existing data is used as the new candidate.</p>
</li>
<li>
<p>The GP no longer uses binary indicators for qualitative predictors. Instead, a &ldquo;categorical kernel&rdquo; is used for those parameter columns. Fewer starting values are required with this change.</p>
</li>
<li>
<p>For numeric predictors, the Matern 3/2 kernel is always used.</p>
</li>
</ul>
<p>Some other changes of note:</p>
<ul>
<li>
<p>When calculating resampling estimates, we can now use a weighted mean based on the number of rows in the assessment set thanks to Tyler Burch. You can opt-in to this using the new <code>add_resample_weights()</code> function. See <code>?calculate_resample_weights</code></p>
</li>
<li>
<p>The warning threshold when check the size of a workflow is now a parameter to the control functions and has a new default of 100MB.</p>
</li>
</ul>
<p>Some bug fixes:</p>
<ul>
<li>
<p>Models with submodel parameters would train all calibration models on predictions from a single submodel value instead of the correct value for each submodel. We sorted this out.</p>
</li>
<li>
<p>We fixed a bug for cases where we tune a grid without a model parameter but with a postprocessing parameter.</p>
</li>
<li>
<p>Another bug was fixed for <code>augment()</code> when using <code>last_fit()</code> objects</p>
</li>
</ul>
<p>Thanks to the following contributors: 






<a href="https://github.com/edgararuiz" target="_blank" rel="noopener">@edgararuiz</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/jeroenjanssens" target="_blank" rel="noopener">@jeroenjanssens</a>
, 






<a href="https://github.com/jjcurtin" target="_blank" rel="noopener">@jjcurtin</a>
, 






<a href="https://github.com/mikewolfe" target="_blank" rel="noopener">@mikewolfe</a>
, 






<a href="https://github.com/mthulin" target="_blank" rel="noopener">@mthulin</a>
, 






<a href="https://github.com/ncalliencsu" target="_blank" rel="noopener">@ncalliencsu</a>
, 






<a href="https://github.com/rvalieris" target="_blank" rel="noopener">@rvalieris</a>
, 






<a href="https://github.com/StevenWallaert" target="_blank" rel="noopener">@StevenWallaert</a>
, 






<a href="https://github.com/tjburch" target="_blank" rel="noopener">@tjburch</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
</p>
<h2 id="finetune">finetune
</h2>
<p>This release was mostly focused on internal changes to support the new version of tune.</p>
<h2 id="tidymodels">tidymodels
</h2>
<p>A basic release that updates the version numbers to require the latest releases of the core packages.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2026-04-27_tidymodels-april-2026/2026-april-tidymodels.jpg" length="491241" type="image/jpeg" />
    </item>
    <item>
      <title>tabpfn 0.1.0</title>
      <link>https://opensource.posit.co/blog/2026-03-31_tabpfn-0-1-0/</link>
      <pubDate>Tue, 31 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2026-03-31_tabpfn-0-1-0/</guid>
      <dc:creator>Max Kuhn</dc:creator><description><![CDATA[<!--
TODO:
* [ ] Look over / edit the post's title in the yaml
* [ ] Edit (or delete) the description; note this appears in the Twitter card
* [ ] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
* [ ] Find photo & update yaml metadata
* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
* [ ] `hugodown::use_tidy_thumbnails()`
* [ ] Add intro sentence, e.g. the standard tagline for the package
* [ ] `usethis::use_tidy_thanks()`
-->
<p>We&rsquo;re stoked to announce the release of 






<a href="https://tabpfn.tidymodels.org/" target="_blank" rel="noopener">tabpfn</a>
 0.1.0. 






<a href="https://github.com/PriorLabs/TabPFN" target="_blank" rel="noopener">TabPFN</a>
 is a precompiled deep learning Python model for prediction. The <em>R package tabpfn</em> is an interface to this model via reticulate.</p>
<p>You can install it from CRAN with:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;tabpfn&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<h2 id="what-is-tabpfn">What is TabPFN?
</h2>
<p>The &ldquo;tab&rdquo; means <em>tabular</em>, which is code for everyday rectangular data structures that we find in csv files and databases.</p>
<p>The &ldquo;pfn&rdquo; is more complicated &ndash; it stands for &ldquo;prior fitted network&rdquo;. The model is trained on fully synthetic datasets. The developers created a complex graph model that can simulate a wide variety of data-generating methods, including correlation structures, distributional skewness, missing-data mechanisms, interactions, latent variables, and more. It can also simulate random supervised relationships linking potential predictors to the outcome data. The training process for the model simulated a very large number of these data sets that, in effect, constitute a &ldquo;training set data point&rdquo;. For example, during training, if a batch size of 64 was used, that means 64 randomly generated datasets were used in that iteration.</p>
<p>From these data sets, a complex deep learning model is created that captures a huge number of possible relationships. The model is sophisticated enough and trained in a manner that allows it to effectively emulate Bayesian estimation.</p>
<p>When we use the pre-trained model, our training set matters, even though there is no new estimation. The model includes an 






<a href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29" target="_blank" rel="noopener">attention mechanism</a>
 that &ldquo;primes the model&rdquo; by focusing on the types of relationships in your training data. In that way, the pre-fitted network is deliberately biased to effectively predict our new samples. This leads to 






<a href="https://scholar.google.com/scholar?as_sdt=0%2C7&amp;as_vis=1&amp;q=%22in&#43;context&#43;learning%22" target="_blank" rel="noopener">in-context learning</a>
.</p>
<p>And it works; in fact, it works really well.</p>
<h2 id="license-for-the-underyling-model">License for the Underyling Model
</h2>
<p>






<a href="https://priorlabs.ai/" target="_blank" rel="noopener">PriorLabs</a>
 created TabPFN. Version 2.5 of the model, which contained several improvements, requires an API key for accessing the model parameter. Without one, an error occurs:</p>
<blockquote>
<p>This model is gated and requires you to accept its terms.  Please follow these steps: 1. Visit 






<a href="https://huggingface.co/Prior-Labs/tabpfn_2_5" target="_blank" rel="noopener">https://huggingface.co/Prior-Labs/tabpfn_2_5</a>
 in your browser and accept the terms of use. 2. Log in to your Hugging Face account via the command line by running: hf auth login (Alternatively, you can set the <code>HF_TOKEN</code> environment variable with a read token).</p>
</blockquote>
<p>The license includes provisions for &ldquo;Non-Commercial Use Only&rdquo; if you are just trying it out.</p>
<p>Instructions for installing the package and obtaining the API key are in the 


  
  
  





<a href="https://tabpfn.tidymodels.org/reference/tab_pfn.html#license-requirements" target="_blank" rel="noopener">package&rsquo;s manual</a>
.</p>
<p>Also, the model is most efficient when a GPU is available (by an order of magnitude or two). This may seem obvious to anyone already working with deep learning models, but it is a fairly new requirement for those strictly working with traditional tabular data models.</p>
<h2 id="usage">Usage
</h2>
<p>The syntax is idiomatic R: it supports fitting interfaces via data frames/vectors, formulas, and recipes. The standard R <code>predict()</code> method is used for prediction. <code>augument()</code> is also available for prediction.</p>
<p>When evaluating pre-trained models, there is a possibility that they may have memorized well-known datasets (e.g., Ames housing, Palmer penguins). TabPFN isn&rsquo;t trained that way, but just in case we are worried about that, we&rsquo;ll use lesser-known data. 






<a href="https://scholar.google.com/scholar?as_sdt=0%2C7&amp;q=Worley%2C&#43;B.&#43;A.&#43;%281987%29.&#43;%22Deterministic&#43;uncertainty&#43;analysis%22" target="_blank" rel="noopener">Worley (1987)</a>
 derived a mechanistic model for the flow rate of liquids from two aquifers positioned vertically (i.e., the &ldquo;upper&rdquo; and &ldquo;lower&rdquo; aquifers). We&rsquo;ll generate some of that data and add completely noisy predictors to increase the difficulty. The outcome is very skewed, so we&rsquo;ll log that too.</p>
<p>Additionally, we&rsquo;ll load the tidymodels library for simulation, data splitting, and visualization.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tabpfn</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">probably</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">17</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">aquifier_data</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl"> <span class="nf">sim_regression</span><span class="p">(</span><span class="m">2000</span><span class="p">,</span>  <span class="n">method</span> <span class="o">=</span> <span class="s">&#34;worley_1987&#34;</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl"> <span class="nf">bind_cols</span><span class="p">(</span><span class="nf">sim_noise</span><span class="p">(</span><span class="m">2000</span><span class="p">,</span> <span class="m">50</span><span class="p">))</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl"> <span class="nf">mutate</span><span class="p">(</span><span class="n">outcome</span> <span class="o">=</span> <span class="nf">log10</span><span class="p">(</span><span class="n">outcome</span><span class="p">))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>We&rsquo;ll use a stratified 3:1 training and testing split:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">8223</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">aquifier_split</span> <span class="o">&lt;-</span> <span class="nf">initial_split</span><span class="p">(</span><span class="n">aquifier_data</span><span class="p">,</span> <span class="n">strata</span> <span class="o">=</span> <span class="n">outcome</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">aquifier_split</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## &lt;Training/Testing/Total&gt;
## &lt;1500/500/2000&gt;</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">aquifier_train</span> <span class="o">&lt;-</span> <span class="nf">training</span><span class="p">(</span><span class="n">aquifier_split</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">aquifier_test</span>  <span class="o">&lt;-</span> <span class="nf">testing</span><span class="p">(</span><span class="n">aquifier_split</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>and &ldquo;fit&rdquo; the model:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tab_fit</span> <span class="o">&lt;-</span> <span class="nf">tab_pfn</span><span class="p">(</span><span class="n">outcome</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">aquifier_train</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Again, the model does not actually fit anything new. This computes the embeddings for the training set data and stores them for the prediction stage.</p>
<p>To make predictions, <code>predict()</code> returns the model&rsquo;s results. As previously mentioned, a GPU is not strictly required for these computations. However, if more than a trivial amount of data are being predicted, execution time can be very long.</p>
<p>Since we&rsquo;ll want to evaluate and plot the data, we&rsquo;ll use <code>augment()</code>, which just runs <code>predict()</code> and binds the results to the data being predicted:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tab_pred</span> <span class="o">&lt;-</span> <span class="nf">augment</span><span class="p">(</span><span class="n">tab_fit</span><span class="p">,</span> <span class="n">aquifier_test</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>How does it work?</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tab_pred</span> <span class="o">|&gt;</span> <span class="nf">metrics</span><span class="p">(</span><span class="n">outcome</span><span class="p">,</span> <span class="n">.pred</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
## 1 rmse    standard      0.104 
## 2 rsq     standard      0.937 
## 3 mae     standard      0.0829</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tab_pred</span> <span class="o">|&gt;</span> <span class="nf">cal_plot_regression</span><span class="p">(</span><span class="n">outcome</span><span class="p">,</span> <span class="n">.pred</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2026-03-31_tabpfn-0-1-0/figure/unnamed-chunk-6-1.png"
      alt="plot of chunk unnamed-chunk-6" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>That looks good, especially with no training.</p>
<h2 id="next-steps">Next Steps
</h2>
<p>There is a lot more functionality to add to the package, including additional prediction types and interpretability tools. Many of these are available in 






<a href="https://github.com/priorlabs/tabpfn-extensions" target="_blank" rel="noopener">extensions</a>
.</p>
<p>We&rsquo;ll also add a new parsnip model type for TabPFN and other integrations with tidymodels in the summer.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>A huge thanks to Tomasz Kalinowski and Daniel Falbel for their support on this and all of their hard work on reticulate and torch.</p>
<p>Thanks also to the contributors to date: 






<a href="https://github.com/frankiethull" target="_blank" rel="noopener">@frankiethull</a>
, 






<a href="https://github.com/mthulin" target="_blank" rel="noopener">@mthulin</a>
, and 






<a href="https://github.com/t-kalinowski" target="_blank" rel="noopener">@t-kalinowski</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2026-03-31_tabpfn-0-1-0/thumbnail-wd.jpg" length="57422" type="image/jpeg" />
    </item>
    <item>
      <title>orbital 0.4.0</title>
      <link>https://opensource.posit.co/blog/2026-01-12_orbital-0-4-0/</link>
      <pubDate>Mon, 12 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2026-01-12_orbital-0-4-0/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [ ] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re over the moon to announce the release of 






<a href="https://orbital.tidymodels.org/" target="_blank" rel="noopener">orbital</a>
 0.4.0. orbital lets you predict in databases using tidymodels workflows.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"orbital"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will cover the highlights, which are post processing support and the new <code>show_query()</code> method.</p>
<p>You can see a full list of changes in the 


  
  
  





<a href="https://orbital.tidymodels.org/news/index.html#orbital-040" target="_blank" rel="noopener">release notes</a>
.</p>
<h2 id="post-processing-support">Post processing support
</h2>
<p>The biggest improvement in this version is that 






<a href="https://orbital.tidymodels.org/reference/orbital.html" target="_blank" rel="noopener"><code>orbital()</code></a>
 now works for supported 






<a href="https://tailor.tidymodels.org/" target="_blank" rel="noopener">tailor</a>
 methods. See 


  
  
  





<a href="https://orbital.tidymodels.org/articles/supported-models.html#tailor-adjustments" target="_blank" rel="noopener">vignette</a>
 for a list of all supported post-processors.</p>
<p>Let&rsquo;s start by fitting a classification model on the <code>penguins</code> data set, using {xgboost} as the engine. We will be showcasing using an adjustment that only works on binary classification and will thus recode <code>species</code> to have levels <code>&quot;Adelie&quot;</code> and <code>&quot;not_Adelie&quot;</code>.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>penguins</span><span class='o'>$</span><span class='nv'>species</span> <span class='o'>&lt;-</span> <span class='nf'>forcats</span><span class='nf'>::</span><span class='nf'><a href='https://forcats.tidyverse.org/reference/fct_recode.html'>fct_recode</a></span><span class='o'>(</span></span>
<span> <span class='nv'>penguins</span><span class='o'>$</span><span class='nv'>species</span>,</span>
<span> not_Adelie <span class='o'>=</span> <span class='s'>"Chinstrap"</span>, not_Adelie <span class='o'>=</span> <span class='s'>"Gentoo"</span></span>
<span><span class='o'>)</span></span></code></pre>
</div>
<p>After we have modified the data, we set up a simple workflow, with a preprocessor using recipes and the model specification using parsnip.</p>
<p>We also set up a post processor using the tailor package. A single adjustment will be done by adding <code>adjust_equivocal_zone()</code>. This will apply an equivocal zone to our binary classification model. Stopping predictions that are too close to the thresholds by labeling them as <code>&quot;[EQ]&quot;</code>. Setting the argument <code>value = 0.2</code> means that any predictions with a predicted probability of between 0.3 and 0.7 will be predicted as <code>&quot;[EQ]&quot;</code> instead.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>rec_spec</span> <span class='o'>&lt;-</span> <span class='nf'>recipe</span><span class='o'>(</span><span class='nv'>species</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_unknown</span><span class='o'>(</span><span class='nf'>all_nominal_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_dummy</span><span class='o'>(</span><span class='nf'>all_nominal_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_impute_mean</span><span class='o'>(</span><span class='nf'>all_numeric_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_zv</span><span class='o'>(</span><span class='nf'>all_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>lr_spec</span> <span class='o'>&lt;-</span> <span class='nf'>boost_tree</span><span class='o'>(</span>tree_depth <span class='o'>=</span> <span class='m'>1</span>, trees <span class='o'>=</span> <span class='m'>5</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>set_mode</span><span class='o'>(</span><span class='s'>"classification"</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>set_engine</span><span class='o'>(</span><span class='s'>"xgboost"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>tlr_spec</span> <span class='o'>&lt;-</span> <span class='nf'>tailor</span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>adjust_equivocal_zone</span><span class='o'>(</span>value <span class='o'>=</span> <span class='m'>0.2</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>wf_spec</span> <span class='o'>&lt;-</span> <span class='nf'>workflow</span><span class='o'>(</span><span class='nv'>rec_spec</span>, <span class='nv'>lr_spec</span>, <span class='nv'>tlr_spec</span><span class='o'>)</span></span>
<span><span class='nv'>wf_fit</span> <span class='o'>&lt;-</span> <span class='nf'>fit</span><span class='o'>(</span><span class='nv'>wf_spec</span>, data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span></span></code></pre>
</div>
<p>With this fitted workflow object, we can call 






<a href="https://orbital.tidymodels.org/reference/orbital.html" target="_blank" rel="noopener"><code>orbital()</code></a>
 on it to create an orbital object. Notice that for <code>adjust_equivocal_zone()</code> to work, we need to set <code>type = c(&quot;class&quot;, &quot;prob&quot;)</code> as both are required for the <code>adjust_equivocal_zone()</code> transformation.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>orbital_obj</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://orbital.tidymodels.org/reference/orbital.html'>orbital</a></span><span class='o'>(</span><span class='nv'>wf_fit</span>, type <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"class"</span>, <span class='s'>"prob"</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='nv'>orbital_obj</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>orbital Object</span> <span style='color: #00BBBB;'>───────────────────────────────────────────────────────</span></span></span>
<span><span class='c'>#&gt; • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, ...</span></span>
<span><span class='c'>#&gt; • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201 ...</span></span>
<span><span class='c'>#&gt; • .pred_class = dplyr::case_when(1 - 1/(1 + exp(dplyr::case_when(b ...</span></span>
<span><span class='c'>#&gt; • .pred_Adelie = 1 - 1/(1 + exp(dplyr::case_when(bill_length_mm &lt; ...</span></span>
<span><span class='c'>#&gt; • .pred_not_Adelie = 1 - (1 - 1/(1 + exp(dplyr::case_when(bill_len ...</span></span>
<span><span class='c'>#&gt; • .pred_class = dplyr::case_when( .pred_Adelie &gt; 0.5 + 0.2 ~ 'Adel ...</span></span>
<span><span class='c'>#&gt; ─────────────────────────────────────────────────────────────────────────</span></span>
<span><span class='c'>#&gt; 6 equations in total.</span></span>
<span></span></code></pre>
</div>
<p>This object contains all the information that is needed to produce predictions. Which we can produce with 






<a href="https://rdrr.io/r/stats/predict.html" target="_blank" rel="noopener"><code>predict()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>preds</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='nv'>preds</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 3</span></span></span>
<span><span class='c'>#&gt;    .pred_class .pred_Adelie .pred_not_Adelie</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>            <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> not_Adelie         0.291            0.709</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie             0.845            0.155</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>The predictions are working; however, we don&rsquo;t see any evidence that <code>adjust_equivocal_zone()</code> is working. A call to <code>count()</code> reveals that a couple of observation lands in the equivocal zone.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'>count</span><span class='o'>(</span><span class='nv'>preds</span>, <span class='nv'>.pred_class</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 3 × 2</span></span></span>
<span><span class='c'>#&gt;   .pred_class     n</span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>       <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span> Adelie        144</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span> [EQ]           15</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span> not_Adelie    185</span></span>
<span></span></code></pre>
</div>
<p>And we can further verify that they are correct.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/filter.html'>filter</a></span><span class='o'>(</span><span class='nv'>preds</span>, <span class='nv'>.pred_class</span> <span class='o'>==</span> <span class='s'>'[EQ]'</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 15 × 3</span></span></span>
<span><span class='c'>#&gt;    .pred_class .pred_Adelie .pred_not_Adelie</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>            <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> [EQ]               0.348            0.652</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> [EQ]               0.348            0.652</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> [EQ]               0.348            0.652</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>11</span> [EQ]               0.348            0.652</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>12</span> [EQ]               0.348            0.652</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>13</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>14</span> [EQ]               0.483            0.517</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>15</span> [EQ]               0.483            0.517</span></span>
<span></span></code></pre>
</div>
<h2 id="new-show_query-method">New show_query method
</h2>
<p>One of the main purposes of orbital is to allow for predictions in databases.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://dbi.r-dbi.org'>DBI</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://rsqlite.r-dbi.org'>RSQLite</a></span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>con_sqlite</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://dbi.r-dbi.org/reference/dbConnect.html'>dbConnect</a></span><span class='o'>(</span><span class='nf'><a href='https://rsqlite.r-dbi.org/reference/SQLite.html'>SQLite</a></span><span class='o'>(</span><span class='o'>)</span>, path <span class='o'>=</span> <span class='s'>":memory:"</span><span class='o'>)</span></span>
<span><span class='nv'>penguins_sqlite</span> <span class='o'>&lt;-</span> <span class='nf'>copy_to</span><span class='o'>(</span><span class='nv'>con_sqlite</span>, <span class='nv'>penguins</span>, name <span class='o'>=</span> <span class='s'>"penguins_table"</span><span class='o'>)</span></span></code></pre>
</div>
<p>Having set up a database we could have used 






<a href="https://orbital.tidymodels.org/reference/orbital_sql.html" target="_blank" rel="noopener"><code>orbital_sql()</code></a>
 to show what the SQL query would have looked like. For quick testing, the output isn&rsquo;t immediately ready to be pasted into its own file due to the <code>&lt;SQL&gt;</code> fragments within the output.</p>
<p>The <code>show_query()</code> method has been implemented to see exactly what the generated SQL looks like.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'>show_query</span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>con_sqlite</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; CASE WHEN ((`bill_length_mm` IS NULL)) THEN 43.9219298245614 WHEN NOT ((`bill_length_mm` IS NULL)) THEN `bill_length_mm` END AS bill_length_mm</span></span>
<span><span class='c'>#&gt; CASE WHEN ((`flipper_length_mm` IS NULL)) THEN 201.0 WHEN NOT ((`flipper_length_mm` IS NULL)) THEN `flipper_length_mm` END AS flipper_length_mm</span></span>
<span><span class='c'>#&gt; CASE</span></span>
<span><span class='c'>#&gt; WHEN ((1.0 - 1.0 / (1.0 + EXP(((((CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.627138138</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)</span></span>
<span><span class='c'>#&gt; END + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 43.2999992) THEN 0.425288886</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.380251437</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 44.4000015) THEN 0.286071777</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`flipper_length_mm` &lt; 203.0) THEN 0.209298179</span></span>
<span><span class='c'>#&gt; WHEN ((`flipper_length_mm` &gt;= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)</span></span>
<span><span class='c'>#&gt; END) + LOG(0.44186047 / (1.0 - 0.44186047))))) &gt; 0.5) THEN 'Adelie'</span></span>
<span><span class='c'>#&gt; ELSE 'not_Adelie'</span></span>
<span><span class='c'>#&gt; END AS .pred_class</span></span>
<span><span class='c'>#&gt; 1.0 - 1.0 / (1.0 + EXP(((((CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.627138138</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)</span></span>
<span><span class='c'>#&gt; END + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 43.2999992) THEN 0.425288886</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.380251437</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 44.4000015) THEN 0.286071777</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`flipper_length_mm` &lt; 203.0) THEN 0.209298179</span></span>
<span><span class='c'>#&gt; WHEN ((`flipper_length_mm` &gt;= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)</span></span>
<span><span class='c'>#&gt; END) + LOG(0.44186047 / (1.0 - 0.44186047)))) AS .pred_Adelie</span></span>
<span><span class='c'>#&gt; 1.0 - (1.0 - 1.0 / (1.0 + EXP(((((CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.627138138</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.449751347)</span></span>
<span><span class='c'>#&gt; END + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 43.2999992) THEN 0.425288886</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 43.2999992 OR (`bill_length_mm` IS NULL))) THEN (-0.398178101)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 42.4000015) THEN 0.380251437</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 42.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.306771189)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`bill_length_mm` &lt; 44.4000015) THEN 0.286071777</span></span>
<span><span class='c'>#&gt; WHEN ((`bill_length_mm` &gt;= 44.4000015 OR (`bill_length_mm` IS NULL))) THEN (-0.330096036)</span></span>
<span><span class='c'>#&gt; END) + CASE</span></span>
<span><span class='c'>#&gt; WHEN (`flipper_length_mm` &lt; 203.0) THEN 0.209298179</span></span>
<span><span class='c'>#&gt; WHEN ((`flipper_length_mm` &gt;= 203.0 OR (`flipper_length_mm` IS NULL))) THEN (-0.348002464)</span></span>
<span><span class='c'>#&gt; END) + LOG(0.44186047 / (1.0 - 0.44186047))))) AS .pred_not_Adelie</span></span>
<span><span class='c'>#&gt; CASE</span></span>
<span><span class='c'>#&gt; WHEN (`.pred_Adelie` &gt; (0.5 + 0.2)) THEN 'Adelie'</span></span>
<span><span class='c'>#&gt; WHEN (`.pred_Adelie` &lt; (0.5 - 0.2)) THEN 'not_Adelie'</span></span>
<span><span class='c'>#&gt; ELSE '[EQ]'</span></span>
<span><span class='c'>#&gt; END AS .pred_class</span></span>
<span></span></code></pre>
</div>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>A big thank you to all the people who have contributed to orbital since the release of v0.4.0:</p>
<p>






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/frankiethull" target="_blank" rel="noopener">@frankiethull</a>
, 






<a href="https://github.com/jeroenjanssens" target="_blank" rel="noopener">@jeroenjanssens</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2026-01-12_orbital-0-4-0/thumbnail-wd.jpg" length="493114" type="image/jpeg" />
    </item>
    <item>
      <title>tidymodels &amp; xgboost</title>
      <link>https://opensource.posit.co/blog/2025-12-15_tidymodels-xgboost/</link>
      <pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-12-15_tidymodels-xgboost/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<p>The 






<a href="https://xgboost.readthedocs.io/en/stable/r_docs/R-package/docs/index.html" target="_blank" rel="noopener">xgboost</a>
 library has recently gotten a big CRAN release. Jumping from version 1.7.11.1 to 3.1.2.1. We at the tidymodels team have been following the development and have done our best to ensure that your experience is unaffected by this release.</p>
<p>In addition to all the new features and improvements that are now available for users relying on CRAN versions of packages, there are also a few breaking changes. Specifically between version 1.x and 2.x of the xgboost library. The xgboost team has kindly provided a 






<a href="https://xgboost.readthedocs.io/en/stable/R-package/migration_guide.html" target="_blank" rel="noopener">migration guide</a>
 for how to update your code if you are upgrading from before version 2.x.</p>
<p>If you are using xgboost purely through tidymodels via functions like 






<a href="https://parsnip.tidymodels.org/reference/boost_tree.html" target="_blank" rel="noopener"><code>parsnip::boost_tree()</code></a>
 and 






<a href="https://embed.tidymodels.org/reference/step_discretize_xgb.html" target="_blank" rel="noopener"><code>embed::step_discretize_xgb()</code></a>
, you should not need to change anything, as we have updated our packages to work with both the new and old versions of xgboost. If you are having any issues, please let us know by filing an issue for the affected package.</p>
<p>We look forward to integrating parsnip more deeply into these new changes, such as support for 






<a href="https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html" target="_blank" rel="noopener">categorical predictors</a>
 and 


  
  
  





<a href="https://xgboost.readthedocs.io/en/stable/python/examples/quantile_regression.html#quantile-regression" target="_blank" rel="noopener">quantile regression</a>
.</p>
<p>Here are the package that we&rsquo;ve updated or helped the maintainers update</p>
<ul>
<li>


  
  
  





<a href="https://rstudio.github.io/bundle/dev/news/index.html#bundle-013" target="_blank" rel="noopener">bundle</a>
</li>
<li>


  
  
  





<a href="https://butcher.tidymodels.org/news/index.html#butcher-040" target="_blank" rel="noopener">butcher</a>
</li>
<li>


  
  
  





<a href="https://embed.tidymodels.org/news/index.html#embed-121" target="_blank" rel="noopener">embed</a>
</li>
<li>






<a href="https://github.com/tidymodels/lime/releases/tag/v0.5.4" target="_blank" rel="noopener">lime</a>
</li>
<li>






<a href="https://business-science.github.io/modeltime/" target="_blank" rel="noopener">modeltime</a>
</li>
<li>


  
  
  





<a href="https://orbital.tidymodels.org/news/index.html#orbital-041" target="_blank" rel="noopener">orbital</a>
</li>
<li>


  
  
  





<a href="https://parsnip.tidymodels.org/news/index.html#parsnip-140" target="_blank" rel="noopener">parsnip</a>
</li>
<li>


  
  
  





<a href="https://tidypredict.tidymodels.org/news/index.html#tidypredict-100" target="_blank" rel="noopener">tidypredict</a>
</li>
<li>


  
  
  





<a href="https://rstudio.github.io/vetiver-r/dev/news/index.html#vetiver-027" target="_blank" rel="noopener">vetiver</a>
</li>
<li>






<a href="https://github.com/holub008/xrf/releases/tag/0.3.0" target="_blank" rel="noopener">xf</a>
</li>
</ul>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-12-15_tidymodels-xgboost/thumbnail-wd.jpg" length="283800" type="image/jpeg" />
    </item>
    <item>
      <title>tidypredict 1.0.0</title>
      <link>https://opensource.posit.co/blog/2025-12-10_tidypredict-1-0-0/</link>
      <pubDate>Wed, 10 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-12-10_tidypredict-1-0-0/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re tickled pink to announce the release of version 1.0.0 of 






<a href="https://tidypredict.tidymodels.org/" target="_blank" rel="noopener">tidypredict</a>
. The main goal of tidypredict is to enable running predictions inside databases. It reads the model, extracts the components needed to calculate the prediction, and then creates an R formula that can be translated into SQL.</p>
<p>You can install them from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"tidypredict"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post highlights the most important changes in this release, including faster computations for tree-based models, more efficient tree representations, glmnet model support, and a change in how random forests are handled. You can see a full list of changes in the 


  
  
  





<a href="https://tidypredict.tidymodels.org/news/index.html#tidypredict-100" target="_blank" rel="noopener">release notes</a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://tidypredict.tidymodels.org'>tidypredict</a></span><span class='o'>)</span></span></code></pre>
</div>
<h2 id="improved-output-for-random-forest-models">Improved output for random forest models
</h2>
<p>The previous version of tidypredict 






<a href="https://tidypredict.tidymodels.org/reference/tidypredict_fit.html" target="_blank" rel="noopener"><code>tidypredict_fit()</code></a>
 would return a list of expressions, one for each tree, when applied to random forest models. This didn&rsquo;t align with what is returned by other types of models. In version 1.0.0, this has been changed to produce a single, combined expression that reflects how predictions should be made.</p>
<p>This is technically a breaking change, but one we believe is worthwhile, as it provides a more consistent output for 






<a href="https://tidypredict.tidymodels.org/reference/tidypredict_fit.html" target="_blank" rel="noopener"><code>tidypredict_fit()</code></a>
 and hides the technical details about how to combine trees from different packages.</p>
<h2 id="faster-parsing-of-trees">Faster parsing of trees
</h2>
<p>The parsing of xgboost, partykit, and ranger models should now be substantially faster than before. Examples have been shown to be 10 to 200 times faster. Please note that larger models, more trees, and deeper trees still take some time to parse.</p>
<h2 id="more-efficient-tree-expressions">More efficient tree expressions
</h2>
<p>All trees, whether they are a single tree or part of a collection of trees, such as in boosted trees or random forests, are encoded as <code>case_when()</code> statements by tidypredict. This means that the following tree.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>model</span> <span class='o'>&lt;-</span> <span class='nf'>partykit</span><span class='nf'>::</span><span class='nf'><a href='https://rdrr.io/pkg/partykit/man/ctree.html'>ctree</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>am</span> <span class='o'>+</span> <span class='nv'>cyl</span>, data <span class='o'>=</span> <span class='nv'>mtcars</span><span class='o'>)</span></span>
<span><span class='nv'>model</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Model formula:</span></span>
<span><span class='c'>#&gt; mpg ~ am + cyl</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Fitted party:</span></span>
<span><span class='c'>#&gt; [1] root</span></span>
<span><span class='c'>#&gt; |   [2] cyl &lt;= 4: 26.664 (n = 11, err = 203.4)</span></span>
<span><span class='c'>#&gt; |   [3] cyl &gt; 4</span></span>
<span><span class='c'>#&gt; |   |   [4] cyl &lt;= 6: 19.743 (n = 7, err = 12.7)</span></span>
<span><span class='c'>#&gt; |   |   [5] cyl &gt; 6: 15.100 (n = 14, err = 85.2)</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Number of inner nodes:    2</span></span>
<span><span class='c'>#&gt; Number of terminal nodes: 3</span></span>
<span></span></code></pre>
</div>
<p>Would be turned into the following <code>case_when()</code> statement.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">case_when</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"> <span class="n">cyl</span> <span class="o">&lt;=</span> <span class="m">4</span> <span class="o">~</span> <span class="m">26.6636363636364</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="n">cyl</span> <span class="o">&lt;=</span> <span class="m">6</span> <span class="o">&amp;</span> <span class="n">cyl</span> <span class="o">&gt;</span> <span class="m">4</span> <span class="o">~</span> <span class="m">19.7428571428571</span><span class="p">,</span> 
</span></span><span class="line"><span class="cl"> <span class="n">cyl</span> <span class="o">&gt;</span> <span class="m">6</span> <span class="o">&amp;</span> <span class="n">cyl</span> <span class="o">&gt;</span> <span class="m">4</span> <span class="o">~=</span> <span class="m">15.1</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>With this new update, we have taken advantage of the <code>.default</code> argument whenever possible, which should lead to faster predictions, as we no longer need to calculate redundant conditionals.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tidypredict.tidymodels.org/reference/tidypredict_fit.html'>tidypredict_fit</a></span><span class='o'>(</span><span class='nv'>model</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; case_when(cyl &lt;= 4 ~ 26.6636363636364, cyl &lt;= 6 &amp; cyl &gt; 4 ~ 19.7428571428571, </span></span>
<span><span class='c'>#&gt;     .default = 15.1)</span></span>
<span></span></code></pre>
</div>
<h2 id="glmnet-support">Glmnet support
</h2>
<p>We now support the glmnet package. This package provides generalized linear models with lasso or elasticnet regularization.</p>
<p>The primary restriction when using a glmnet model with <code>tidypredict()</code> is that the model must have been fitted with the <code>lambda</code> argument set to a single value.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>model</span> <span class='o'>&lt;-</span> <span class='nf'>glmnet</span><span class='nf'>::</span><span class='nf'><a href='https://glmnet.stanford.edu/reference/glmnet.html'>glmnet</a></span><span class='o'>(</span><span class='nv'>mtcars</span><span class='o'>[</span>, <span class='o'>-</span><span class='m'>1</span><span class='o'>]</span>, <span class='nv'>mtcars</span><span class='o'>$</span><span class='nv'>mpg</span>, lambda <span class='o'>=</span> <span class='m'>0.01</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://tidypredict.tidymodels.org/reference/tidypredict_fit.html'>tidypredict_fit</a></span><span class='o'>(</span><span class='nv'>model</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; 13.0081464696679 + (cyl * -0.0773532164346008) + (disp * 0.00969507138358544) + </span></span>
<span><span class='c'>#&gt;     (hp * -0.0192462098902709) + (drat * 0.816753237688302) + </span></span>
<span><span class='c'>#&gt;     (wt * -3.41564341709663) + (qsec * 0.758580151032383) + (vs * </span></span>
<span><span class='c'>#&gt;     0.277874296242861) + (am * 2.47356523820533) + (gear * 0.645144527527598) + </span></span>
<span><span class='c'>#&gt;     (carb * -0.300886812079305)</span></span>
<span></span></code></pre>
</div>
<p><code>glmnet()</code> computes a collection of models using many sets of penalty values. This can be very efficient, but for tidypredict, we need to predict with a single penalty. Note how, as we increase the penalty, the extracted expression correctly removes terms with coefficients of <code>0</code> instead of leaving them as <code>(disp * 0)</code>.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>model</span> <span class='o'>&lt;-</span> <span class='nf'>glmnet</span><span class='nf'>::</span><span class='nf'><a href='https://glmnet.stanford.edu/reference/glmnet.html'>glmnet</a></span><span class='o'>(</span><span class='nv'>mtcars</span><span class='o'>[</span>, <span class='o'>-</span><span class='m'>1</span><span class='o'>]</span>, <span class='nv'>mtcars</span><span class='o'>$</span><span class='nv'>mpg</span>, lambda <span class='o'>=</span> <span class='m'>1</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://tidypredict.tidymodels.org/reference/tidypredict_fit.html'>tidypredict_fit</a></span><span class='o'>(</span><span class='nv'>model</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; 35.3137765116027 + (cyl * -0.871451193824228) + (hp * -0.0101173960249783) + </span></span>
<span><span class='c'>#&gt;     (wt * -2.59443677687505)</span></span>
<span></span></code></pre>
</div>
<p>tidypredict is used as the primary parser for models employed by the 






<a href="https://orbital.tidymodels.org/" target="_blank" rel="noopener">orbital</a>
 package. This means that all the changes seen in this post also take effect when using orbital with tidymodels workflows. Such as using 






<a href="https://parsnip.tidymodels.org/reference/linear_reg.html" target="_blank" rel="noopener"><code>parsnip::linear_reg()</code></a>
 with <code>engine = &quot;glmnet&quot;</code>.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>A big thank you to all the folks who helped make this release happen: 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, and 






<a href="https://github.com/jeroenjanssens" target="_blank" rel="noopener">@jeroenjanssens</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-12-10_tidypredict-1-0-0/thumbnail-wd.jpg" length="315661" type="image/jpeg" />
    </item>
    <item>
      <title>Two New tidymodels Packages</title>
      <link>https://opensource.posit.co/blog/2025-11-22_two-new-tidymodels-packages/</link>
      <pubDate>Sat, 22 Nov 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-11-22_two-new-tidymodels-packages/</guid>
      <dc:creator>Frances Lin</dc:creator>
      <dc:creator>Max Kuhn</dc:creator><description><![CDATA[<!--
TODO:
* [ ] Look over / edit the post's title in the yaml
* [ ] Edit (or delete) the description; note this appears in the Twitter card
* [ ] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
* [ ] Find photo & update yaml metadata
* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
* [ ] `hugodown::use_tidy_thumbnails()`
* [ ] Add intro sentence, e.g. the standard tagline for the package
* [ ] `usethis::use_tidy_thanks()`
-->
<p>We&rsquo;re very chuffed to announce the release of <em>two</em> new modeling packages: filtro and important.</p>
<p>You can install them from CRAN with:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s">&#34;filtro&#34;</span><span class="p">,</span> <span class="s">&#34;important&#34;</span><span class="p">))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>This blog post will introduce both.</p>
<h2 id="filtro">filtro
</h2>
<p>Feature selection is an important step in building machine learning models that are robust and reliable. By keeping only the most relevant predictors, we can reduce overfitting, improve model performance, and speed up computation.</p>
<p>






<a href="https://filtro.tidymodels.org/" target="_blank" rel="noopener">filtro</a>
 is a low-level tidy tools designed for filter-based supervised feature selection. filtro makes it easy to score, rank, and select features using a wide range of statistical and model-based metrics. The scoring metrics include: p-values, correlation, random forest feature importance, information gain, and more.</p>
<p>With filtro, we can quickly rank the variables and select either the top proportion or the top number of features that best contribute to our model. It also supports 






<a href="https://scholar.google.com/scholar?hl=en&amp;as_sdt=0%2C7&amp;q=%22multi-parameter&#43;optimization%22&amp;btnG=" target="_blank" rel="noopener">multi-parameter optimization</a>
 via 






<a href="https://scholar.google.com/scholar?hl=en&amp;as_sdt=0%2C7&amp;q=%22desirability&#43;functions%22&amp;btnG=" target="_blank" rel="noopener">desirability functions</a>
. filtro is a standalone tool, but it integrates with other packages, allowing it to be used within the tidymodels workflows.</p>
<p>Currently, filtro implements a total of six filters. Like other elements of the framework, also filtro is extensible if you want to use a score we haven&rsquo;t implemented yet. You can read more on how to do this on 






<a href="https://www.tidymodels.org/learn/develop/filtro/" target="_blank" rel="noopener">tidymodels.org</a>
.</p>
<p>The available score class objects are:</p>
<div class="code-block"><pre tabindex="0"><code>##  [1] &#34;score_aov_fstat&#34;          &#34;score_aov_pval&#34;          
##  [3] &#34;score_cor_pearson&#34;        &#34;score_cor_spearman&#34;      
##  [5] &#34;score_gain_ratio&#34;         &#34;score_imp_rf&#34;            
##  [7] &#34;score_imp_rf_conditional&#34; &#34;score_imp_rf_oblique&#34;    
##  [9] &#34;score_info_gain&#34;          &#34;score_roc_auc&#34;           
## [11] &#34;score_sym_uncert&#34;         &#34;score_xtab_pval_chisq&#34;   
## [13] &#34;score_xtab_pval_fisher&#34;</code></pre></div>
<p>Let&rsquo;s look at an example. 






<a href="https://www.google.com/search?q=Kuhn&#43;and&#43;Johnson&#43;Applied&#43;Predictive&#43;Modeling&#43;2013" target="_blank" rel="noopener">Kuhn and Johnson (2013)</a>
 described a data set where 176 samples were collected from a chemical manufacturing process. The goal is to predict process yield. Predictors are continuous, count, and categorical; some are correlated, and some contain missing values.</p>
<p>Let’s create an initial split of the data (which are in the modeldata package):</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">filtro</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">yield_split</span> <span class="o">&lt;-</span> <span class="nf">initial_split</span><span class="p">(</span><span class="n">modeldata</span><span class="o">::</span><span class="n">chem_proc_yield</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">yield_split</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## &lt;Training/Testing/Total&gt;
## &lt;132/44/176&gt;</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">yield_train</span> <span class="o">&lt;-</span> <span class="nf">training</span><span class="p">(</span><span class="n">yield_split</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">yield_test</span> <span class="o">&lt;-</span> <span class="nf">testing</span><span class="p">(</span><span class="n">yield_split</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>We’d like to estimate the strength of the relationship between these 57 predictors and the process yield. We’ll quantify that in two ways. First is the old-fashioned 






<a href="https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient" target="_blank" rel="noopener">Spearman rank correlation</a>
 statistic. We can estimate these values and rank them by the absolute value of the correlations. We can also measure their value using a random forest variable importance. One quality of the predictors is that their values are correlated, so there may be some value in using an <em>oblique</em> random forest model. This creates a collection of tree-based models with splits that are linear combinations of the selected predictors.</p>
<p>To estimate the scores, we use the score objects contained in the package along with the <code>fit()</code> method:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">yield_rank_res</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">score_cor_spearman</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">yield</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">yield_train</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># The object contains the statistics:</span>
</span></span><span class="line"><span class="cl"><span class="n">yield_rank_res</span><span class="o">@</span><span class="n">results</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">score</span><span class="p">)))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 57 × 4
##    name          score outcome predictor      
##    &lt;chr&gt;         &lt;dbl&gt; &lt;chr&gt;   &lt;chr&gt;          
##  1 cor_spearman  0.655 yield   man_proc_32    
##  2 cor_spearman -0.537 yield   man_proc_36    
##  3 cor_spearman  0.519 yield   bio_material_03
##  4 cor_spearman  0.502 yield   bio_material_06
##  5 cor_spearman  0.491 yield   man_proc_09    
##  6 cor_spearman  0.478 yield   bio_material_02
##  7 cor_spearman  0.446 yield   man_proc_33    
##  8 cor_spearman  0.421 yield   bio_material_12
##  9 cor_spearman -0.420 yield   man_proc_13    
## 10 cor_spearman  0.412 yield   bio_material_04
## # ℹ 47 more rows</code></pre></div>
<p>To score via a random forest model, we only need to switch out the score object:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">yield_rf_res</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="n">score_imp_rf_oblique</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">yield</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">yield_train</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">yield_rf_res</span><span class="o">@</span><span class="n">results</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">score</span><span class="p">)))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 57 × 4
##    name            score outcome predictor      
##    &lt;chr&gt;           &lt;dbl&gt; &lt;chr&gt;   &lt;chr&gt;          
##  1 imp_rf_oblique 0.128  yield   man_proc_32    
##  2 imp_rf_oblique 0.0697 yield   man_proc_36    
##  3 imp_rf_oblique 0.0670 yield   man_proc_17    
##  4 imp_rf_oblique 0.0644 yield   man_proc_09    
##  5 imp_rf_oblique 0.0612 yield   man_proc_13    
##  6 imp_rf_oblique 0.0446 yield   bio_material_03
##  7 imp_rf_oblique 0.0315 yield   man_proc_33    
##  8 imp_rf_oblique 0.0263 yield   man_proc_11    
##  9 imp_rf_oblique 0.0263 yield   bio_material_04
## 10 imp_rf_oblique 0.0262 yield   bio_material_06
## # ℹ 47 more rows</code></pre></div>
<p>We should probably combine the scores and do a joint ranking. To combine the two sets of statistics:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">class_score_list</span> <span class="o">&lt;-</span> <span class="nf">list</span><span class="p">(</span><span class="n">yield_rank_res</span><span class="p">,</span> <span class="n">yield_rf_res</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">bind_scores</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">class_score_list</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 57 × 4
##    outcome predictor       cor_spearman imp_rf_oblique
##    &lt;chr&gt;   &lt;chr&gt;                  &lt;dbl&gt;          &lt;dbl&gt;
##  1 yield   bio_material_01        0.404        0.0178 
##  2 yield   bio_material_02        0.478        0.0190 
##  3 yield   bio_material_03        0.519        0.0446 
##  4 yield   bio_material_04        0.412        0.0263 
##  5 yield   bio_material_05        0.116        0.00639
##  6 yield   bio_material_06        0.502        0.0262 
##  7 yield   bio_material_07       -0.101        0.00151
##  8 yield   bio_material_08        0.369        0.00714
##  9 yield   bio_material_09        0.109        0.0122 
## 10 yield   bio_material_10        0.214        0.00998
## # ℹ 47 more rows</code></pre></div>
<p>We can accomplish a joint ranking via desirability functions. Here, we set goals for each score (i.e., maximize, minimize, etc.). The algorithm rescales their values and uses a geometric mean for an overall ranking. The desirability2 package has some nice tools for this. Here&rsquo;s how we do it:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">desirability2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">class_score_list</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">show_best_desirability_prop</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">maximize</span><span class="p">(</span><span class="n">cor_spearman</span><span class="p">,</span> <span class="n">low</span> <span class="o">=</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">high</span> <span class="o">=</span> <span class="m">1</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">maximize</span><span class="p">(</span><span class="n">imp_rf_oblique</span><span class="p">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">.d_overall</span><span class="p">))</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="o">-</span><span class="nf">starts_with</span><span class="p">(</span><span class="s">&#34;.d_max_&#34;</span><span class="p">))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 57 × 5
##    outcome predictor       cor_spearman imp_rf_oblique .d_overall
##    &lt;chr&gt;   &lt;chr&gt;                  &lt;dbl&gt;          &lt;dbl&gt;      &lt;dbl&gt;
##  1 yield   man_proc_32            0.655         0.128      0.735 
##  2 yield   man_proc_09            0.491         0.0644     0.291 
##  3 yield   bio_material_03        0.519         0.0446     0.217 
##  4 yield   man_proc_33            0.446         0.0315     0.134 
##  5 yield   bio_material_06        0.502         0.0262     0.129 
##  6 yield   bio_material_04        0.412         0.0263     0.104 
##  7 yield   bio_material_02        0.478         0.0190     0.0926
##  8 yield   bio_material_01        0.404         0.0178     0.0719
##  9 yield   bio_material_11        0.381         0.0194     0.0714
## 10 yield   man_proc_12            0.391         0.0183     0.0705
## # ℹ 47 more rows</code></pre></div>
<p>Using the <code>scale = 2</code> option puts more weight on the random forest results.</p>
<p>It is unlikely that users will work with filtro directly; it is much better to incorporate these feature selection tools inside a model workflow (as we will see below).</p>
<p>Now that we&rsquo;ve looked at filtro, next up is the important package (yes, this is what we named it).</p>
<h2 id="important">important
</h2>
<p>The 






<a href="https://important.tidymodels.org/" target="_blank" rel="noopener">important</a>
 package does two things. First, it provides yet another tool for calculating random forest-like permutation importance scores. We highly value other packages that perform these same calculations (such as 






<a href="https://modeloriented.github.io/DALEX/" target="_blank" rel="noopener">DALEX</a>
 and 






<a href="https://github.com/koalaverse/vip/" target="_blank" rel="noopener">vip</a>
). Our rationale for creating another package for this is that we&rsquo;ve developed interfaces for censored regression, including dynamic metrics such as Brier scores or ROC curves that evaluate models at a specific time point. These dynamic methods aren&rsquo;t available in other packages, and the peculiarities of these metrics make them difficult to incorporate into existing frameworks.</p>
<p>Other niceties about importance scores are that any metric from the yardstick package can be used, and we have optimized parallel processing for the underlying computations. For the latter feature, we support the future and mirai packages for parallel processing.</p>
<p>important also has three recipe steps for supervised feature selection (similar to what Steven Pawley did with his 






<a href="https://stevenpawley.github.io/colino/" target="_blank" rel="noopener">colino package</a>
). The steps are:</p>
<ul>
<li>






<a href="https://important.tidymodels.org/reference/step_predictor_best.html" target="_blank" rel="noopener"><code>step_predictors_best()</code></a>
</li>
<li>






<a href="https://important.tidymodels.org/reference/step_predictor_retain.html" target="_blank" rel="noopener"><code>step_predictors_retain()</code></a>
</li>
<li>






<a href="https://important.tidymodels.org/reference/step_predictor_desirability.html" target="_blank" rel="noopener"><code>step_predictors_desirability()</code></a>
</li>
</ul>
<p>Let&rsquo;s look at the last one, which mirrors our analysis above.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">important</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">goals</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="nf">desirability</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">maximize</span><span class="p">(</span><span class="n">cor_spearman</span><span class="p">,</span> <span class="n">low</span> <span class="o">=</span> <span class="m">0.25</span><span class="p">,</span> <span class="n">high</span> <span class="o">=</span> <span class="m">1</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">    <span class="nf">maximize</span><span class="p">(</span><span class="n">imp_rf_oblique</span><span class="p">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">yield_rec</span> <span class="o">&lt;-</span>
</span></span><span class="line"><span class="cl">  <span class="nf">recipe</span><span class="p">(</span><span class="n">yield</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">yield_train</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_impute_knn</span><span class="p">(</span><span class="nf">all_predictors</span><span class="p">(),</span> <span class="n">neighbors</span> <span class="o">=</span> <span class="m">10</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_predictor_desirability</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nf">all_predictors</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl">    <span class="n">score</span> <span class="o">=</span> <span class="n">goals</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">prop_terms</span> <span class="o">=</span> <span class="m">1</span> <span class="o">/</span> <span class="m">10</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">yield_rec</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## </code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## ── Recipe ───────────────────────────────────────────────────────</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## </code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## ── Inputs</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## Number of variables by role</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## outcome:    1
## predictor: 57</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## </code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## ── Operations</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## • K-nearest neighbor imputation for: all_predictors()</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## • Feature selection via desirability functions (`cor_spearman`
##   and `imp_rf_oblique`) on: all_predictors()</code></pre></div>
<p>When combined with a specific model, we can tune the number of neighbors as well as the proportion of predictors retained (10% above).</p>
<p><code>prep()</code> will do the appropriate estimation steps:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">trained_rec</span> <span class="o">&lt;-</span> <span class="nf">prep</span><span class="p">(</span><span class="n">yield_rec</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Which 10% of the predictors were retained? The <code>tidy()</code> method can list the scores and their rankings:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">scores</span> <span class="o">&lt;-</span> <span class="nf">tidy</span><span class="p">(</span><span class="n">trained_rec</span><span class="p">,</span> <span class="n">number</span> <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">scores</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="nf">desc</span><span class="p">(</span><span class="n">.d_overall</span><span class="p">))</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="o">-</span><span class="nf">starts_with</span><span class="p">(</span><span class="s">&#34;.d_max_&#34;</span><span class="p">),</span> <span class="o">-</span><span class="n">id</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 57 × 5
##    terms           removed cor_spearman imp_rf_oblique .d_overall
##    &lt;chr&gt;           &lt;lgl&gt;          &lt;dbl&gt;          &lt;dbl&gt;      &lt;dbl&gt;
##  1 man_proc_32     FALSE          0.655         0.128       0.735
##  2 man_proc_36     FALSE         -0.530         0.0668      0.325
##  3 man_proc_09     FALSE          0.491         0.0673      0.304
##  4 man_proc_13     FALSE         -0.420         0.0725      0.275
##  5 bio_material_03 FALSE          0.519         0.0517      0.249
##  6 bio_material_06 TRUE           0.502         0.0445      0.210
##  7 man_proc_17     TRUE          -0.303         0.0749      0.158
##  8 man_proc_33     TRUE           0.443         0.0374      0.156
##  9 bio_material_02 TRUE           0.478         0.0330      0.151
## 10 bio_material_04 TRUE           0.412         0.0347      0.133
## # ℹ 47 more rows</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="c1"># What percentage was removed?</span>
</span></span><span class="line"><span class="cl"><span class="nf">mean</span><span class="p">(</span><span class="n">scores</span><span class="o">$</span><span class="n">removed</span> <span class="o">*</span> <span class="m">100</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## [1] 91.22807</code></pre></div>
<h2 id="summary">Summary
</h2>
<p>Both filtro and important satisfy a feature for tidymodels that has been highly ranked in our user surveys: supervised feature selection. filtro contains the underlying framework and important provides recipe steps that can be used in a workflow.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-11-22_two-new-tidymodels-packages/thumbnail-wd.jpg" length="97105" type="image/jpeg" />
    </item>
    <item>
      <title>Q3 2025 tidymodels digest</title>
      <link>https://opensource.posit.co/blog/2025-11-18_tidymodels-2025-q3/</link>
      <pubDate>Tue, 18 Nov 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-11-18_tidymodels-2025-q3/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.</p>
<p>Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what&rsquo;s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused.</p>
<p>Since our last update we have had some larger releases that you can read about in these posts.</p>
<ul>
<li>






  
  

<a href="https://opensource.posit.co/blog/2025-11-05_tune-2/">tune 2.0.0</a>
</li>
<li>






  
  

<a href="https://opensource.posit.co/blog/2025-04-28_recipes-1-3-0/">recipes 1.3.0</a>
</li>
<li>






  
  

<a href="https://opensource.posit.co/blog/2025-04-03_rsample-1-3-0/">rsample 1.3.0</a>
</li>
<li>






  
  

<a href="https://opensource.posit.co/blog/2025-03-19_tidymodels-sparsity/">improved sparsity support in tidymodels</a>
</li>
</ul>
<p>The post will update, you on which packages have changed and the improvements you should know about that haven&rsquo;t been covered in the above posts.</p>
<p>Here&rsquo;s a list of the packages and their News sections:</p>
<ul>
<li>






<a href="https://dials.tidymodels.org/news/index.html" target="_blank" rel="noopener">dials</a>
</li>
<li>






<a href="https://parsnip.tidymodels.org/news/index.html" target="_blank" rel="noopener">parsnip</a>
</li>
<li>






<a href="https://rsample.tidymodels.org/news/index.html" target="_blank" rel="noopener">rsample</a>
</li>
<li>






<a href="https://recipes.tidymodels.org/news/index.html" target="_blank" rel="noopener">recipes</a>
</li>
<li>






<a href="https://probably.tidymodels.org/news/index.html" target="_blank" rel="noopener">probably</a>
</li>
<li>






<a href="https://brulee.tidymodels.org/news/index.html" target="_blank" rel="noopener">brulee</a>
</li>
</ul>
<p>Let&rsquo;s look at a few specific updates.</p>
<h2 id="quiet-linear-svm-models">Quiet linear svm models
</h2>
<p>When you used to fit a linear SVM model, you would get a message that you were not able to avoid.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">parsnip</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">modeldata</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">res</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">svm_linear</span><span class="p">(</span><span class="n">mode</span> <span class="o">=</span> <span class="s">&#34;classification&#34;</span><span class="p">,</span> <span class="n">engine</span> <span class="o">=</span> <span class="s">&#34;kernlab&#34;</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">Class</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">two_class_dat</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;  Setting default kernel parameters</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>This message by itself was not that useful and was unable to turn off in a reasonable way. We have silenced this message to hopefully alleviate some of the noise that came from using this method.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/tidymodels/parsnip'>parsnip</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://modeldata.tidymodels.org'>modeldata</a></span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Attaching package: 'modeldata'</span></span>
<span></span><span><span class='c'>#&gt; The following object is masked from 'package:datasets':</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt;     penguins</span></span>
<span></span><span></span>
<span><span class='nv'>res</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/svm_linear.html'>svm_linear</a></span><span class='o'>(</span>mode <span class='o'>=</span> <span class='s'>"classification"</span>, engine <span class='o'>=</span> <span class='s'>"kernlab"</span><span class='o'>)</span> <span class='o'>|&gt;</span> </span>
<span>  <span class='nf'><a href='https://generics.r-lib.org/reference/fit.html'>fit</a></span><span class='o'>(</span><span class='nv'>Class</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>two_class_dat</span><span class='o'>)</span></span>
<span><span class='nv'>res</span></span>
<span><span class='c'>#&gt; parsnip model object</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Support Vector Machine object of class "ksvm" </span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; SV type: C-svc  (classification) </span></span>
<span><span class='c'>#&gt;  parameter : cost C = 1 </span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Linear (vanilla) kernel function. </span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Number of Support Vectors : 361 </span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Objective Function Value : -357.1487 </span></span>
<span><span class='c'>#&gt; Training error : 0.178255 </span></span>
<span><span class='c'>#&gt; Probability model included.</span></span>
<span></span></code></pre>
</div>
<h2 id="fewer-numeric-overflow-issues-in-brulee">Fewer numeric overflow issues in brulee
</h2>
<p>The brulee package has been improved to try to help avoid numeric overflow in the loss functions. The following things have been done to help deal with this type of issue.</p>
<ul>
<li>
<p>Starting values were transitioned to using Gaussian distribution (instead of uniform) with a smaller standard deviation.</p>
</li>
<li>
<p>The results always contain the initial results to use as a fallback if there is overflow during the first epoch.</p>
</li>
<li>
<p><code>brulee_mlp()</code> has two additional parameters, <code>grad_value_clip</code> and <code>grad_value_clip</code>, that prevent issues.</p>
</li>
<li>
<p>The warning was changed to &ldquo;Early stopping occurred at epoch {X} due to numerical overflow of the loss function.&rdquo;</p>
</li>
</ul>
<h2 id="additional-torch-optimizers-in-brulee">Additional torch optimizers in brulee
</h2>
<p>Several additional optimizers have been added: <code>&quot;ADAMw&quot;</code>, <code>&quot;Adadelta&quot;</code>, <code>&quot;Adagrad&quot;</code>, and <code>&quot;RMSprop&quot;</code>. Previously, the options were <code>&quot;SGD&quot;</code> and <code>LBFGS&quot;</code>. ## Acknowledgements</p>
<p>We want to sincerely thank everyone who contributed to these packages since their previous versions:</p>
<ul>
<li>dials: 






<a href="https://github.com/brendad8" target="_blank" rel="noopener">@brendad8</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, and 






<a href="https://github.com/Wander03" target="_blank" rel="noopener">@Wander03</a>
.</li>
<li>parsnip: 






<a href="https://github.com/chillerb" target="_blank" rel="noopener">@chillerb</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/jmgirard" target="_blank" rel="noopener">@jmgirard</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, and 






<a href="https://github.com/ZWael" target="_blank" rel="noopener">@ZWael</a>
.</li>
<li>rsample: 






<a href="https://github.com/abichat" target="_blank" rel="noopener">@abichat</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/mkiang" target="_blank" rel="noopener">@mkiang</a>
, and 






<a href="https://github.com/vincentarelbundock" target="_blank" rel="noopener">@vincentarelbundock</a>
.</li>
<li>recipes: 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/SimonDedman" target="_blank" rel="noopener">@SimonDedman</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</li>
<li>probably: 






<a href="https://github.com/abichat" target="_blank" rel="noopener">@abichat</a>
, 






<a href="https://github.com/ayueme" target="_blank" rel="noopener">@ayueme</a>
, 






<a href="https://github.com/dchiu911" target="_blank" rel="noopener">@dchiu911</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/frankiethull" target="_blank" rel="noopener">@frankiethull</a>
, 






<a href="https://github.com/gaborcsardi" target="_blank" rel="noopener">@gaborcsardi</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/Jeffrothschild" target="_blank" rel="noopener">@Jeffrothschild</a>
, 






<a href="https://github.com/jgaeb" target="_blank" rel="noopener">@jgaeb</a>
, 






<a href="https://github.com/jrwinget" target="_blank" rel="noopener">@jrwinget</a>
, 






<a href="https://github.com/mark-burdon" target="_blank" rel="noopener">@mark-burdon</a>
, 






<a href="https://github.com/martinhulin" target="_blank" rel="noopener">@martinhulin</a>
, 






<a href="https://github.com/simonpcouch" target="_blank" rel="noopener">@simonpcouch</a>
, 






<a href="https://github.com/teunbrand" target="_blank" rel="noopener">@teunbrand</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, 






<a href="https://github.com/wjakethompson" target="_blank" rel="noopener">@wjakethompson</a>
, and 






<a href="https://github.com/yellowbridge" target="_blank" rel="noopener">@yellowbridge</a>
.</li>
<li>brulee: 






<a href="https://github.com/genec1" target="_blank" rel="noopener">@genec1</a>
, 






<a href="https://github.com/talegari" target="_blank" rel="noopener">@talegari</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</li>
</ul>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-11-18_tidymodels-2025-q3/thumbnail-wd.jpg" length="838271" type="image/jpeg" />
    </item>
    <item>
      <title>tune version 2.0.0</title>
      <link>https://opensource.posit.co/blog/2025-11-05_tune-2/</link>
      <pubDate>Wed, 05 Nov 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-11-05_tune-2/</guid>
      <dc:creator>Max Kuhn</dc:creator>
      <dc:creator>Simon Couch</dc:creator>
      <dc:creator>Emil Hvitfeldt</dc:creator>
      <dc:creator>Hannah Frick</dc:creator><description><![CDATA[<p>We&rsquo;re very chuffed to announce the release of 






<a href="https://tune.tidymodels.org" target="_blank" rel="noopener">tune</a>
 <strong>2.0.0</strong>. tune is a package that can be used to resample models and/or optimize their tuning parameters</p>
<p>You can install it from CRAN with:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;tune&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>This blog post will describe the two major updates to the package. You can see a full list of changes in the 


  
  
  





<a href="https://tune.tidymodels.org/news/index.html#tune-200" target="_blank" rel="noopener">release notes</a>
.</p>
<p>Those two big improvements to the package: new parallel processing features and postprocessing.</p>
<h2 id="using-future-or-mirai-for-parallel-processing">Using future or mirai for parallel processing
</h2>
<p>


  
  
  





  
  

<a href="https://opensource.posit.co/blog/2024-04-18_tune-1-2-0/#modernized-support-for-parallel-processing">Historically</a>
, we&rsquo;ve used the foreach package to run calculations in parallel. Sadly, that package is no longer under active development. We&rsquo;ve been 


  
  
  





  
  

<a href="https://opensource.posit.co/blog/2024-04-18_tune-1-2-0/#modernized-support-for-parallel-processing">progressively moving away</a>
 from it, and as of this version, it is deprecated. In its place, we&rsquo;ve added functionality for the 






<a href="https://future.futureverse.org/" target="_blank" rel="noopener">future</a>
 and 






<a href="https://mirai.r-lib.org/" target="_blank" rel="noopener">mirai</a>
 packages.</p>
<p>Previously, you would load a foreach parallel backend package, such as doParallel, doMC, or doFuture, and then register it. For  example:</p>
<div class="code-block"><pre tabindex="0"><code>library(doParallel)
cl &lt;- makePSOCKcluster()
registerDoParallel(cl)</code></pre></div>
<p>Instead, you can use the future package via:</p>
<div class="code-block"><pre tabindex="0"><code>library(future)
plan(&#34;multisession&#34;)</code></pre></div>
<p>or the mirai package by using</p>
<div class="code-block"><pre tabindex="0"><code>library(mirai)
daemons(num_cores)</code></pre></div>
<p>Each of these is configurable to run in various ways, such as on remote servers.</p>
<p>


  
  
  





<a href="https://tune.tidymodels.org/articles/extras/optimizations.html#foreach-legacy" target="_blank" rel="noopener">tidymodels.org</a>
 and the tune 






<a href="https://tune.tidymodels.org/reference/parallelism.html" target="_blank" rel="noopener">pkgdown site</a>
 have more information to help users switch away from foreach.</p>
<h2 id="tuning-your-postprocessor">Tuning your postprocessor
</h2>
<p>A postprocessor is an operation that modifies model predictions.  For example, if your classifier can separate classes but its probability estimates are not accurate enough, you can add a <em>calibrator</em> operation that can attempt to adjust those probability estimates. Another good example is for binary classifiers, where the default threshold for classifying a prediction as an event can be adjusted based on its corresponding probability estimate.</p>
<p>Currently, we&rsquo;ve enabled postprocessing using the 






  
  

<a href="https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/">tailor package</a>
. The operations that are currently available:</p>
<ul>
<li><code>adjust_numeric_calibration()</code>: Estimate and apply a calibration model for regression problems.</li>
<li><code>adjust_numeric_range()</code>: Truncate the range of predictions.</li>
<li><code>adjust_probability_calibration()</code>: Estimate and apply a calibration model for classification problems.</li>
<li><code>adjust_probability_threshold()</code>: Covert binary class probabilities to hard class predictions using different thresholds.</li>
<li><code>adjust_equivocal_zone()</code>: <em>Decline</em> to predict a sample if its strongest class probability is low.</li>
<li><code>adjust_predictions_custom()</code>: A general <code>mutate()</code>-like adjustment.</li>
</ul>
<p>If the operations have arguments, these can be tuned in the same way as the preprocessors (e.g., a recipe) or the supervised model. For example, let&rsquo;s tune the probability threshold for a random forest classifier.</p>
<p>We&rsquo;ll simulate some data with a class imbalance:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">296</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sim_data</span> <span class="o">&lt;-</span> <span class="nf">sim_classification</span><span class="p">(</span><span class="m">2000</span><span class="p">,</span> <span class="n">intercept</span> <span class="o">=</span> <span class="m">-12</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">sim_data</span> <span class="o">|&gt;</span> <span class="nf">count</span><span class="p">(</span><span class="n">class</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 2 × 2
##   class       n
##   &lt;fct&gt;   &lt;int&gt;
## 1 class_1   234
## 2 class_2  1766</code></pre></div>
<p>We&rsquo;ll resampling them via 10-fold cross-validation:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">sim_rs</span> <span class="o">&lt;-</span> <span class="nf">vfold_cv</span><span class="p">(</span><span class="n">sim_data</span><span class="p">,</span> <span class="n">strata</span> <span class="o">=</span> <span class="n">class</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>We define a tailor object that tags the class probability threshold for optimization:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">tlr_spec</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">tailor</span><span class="p">()</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">adjust_probability_threshold</span><span class="p">(</span><span class="n">threshold</span> <span class="o">=</span> <span class="nf">tune</span><span class="p">())</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>We also specify a random forest that uses its default tuning parameters:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">rf_spec</span> <span class="o">&lt;-</span> <span class="nf">rand_forest</span><span class="p">(</span><span class="n">mode</span> <span class="o">=</span> <span class="s">&#34;classification&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">rf_thrsh_wflow</span> <span class="o">&lt;-</span> <span class="nf">workflow</span><span class="p">(</span><span class="n">class</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">rf_spec</span><span class="p">,</span> <span class="n">tlr_spec</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">rf_thrsh_wflow</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## ══ Workflow ════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: rand_forest()
## Postprocessor: tailor
## 
## ── Preprocessor ────────────────────────────────────────────────────────
## class ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger 
## 
## 
## ── Postprocessor ───────────────────────────────────────────────────────</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## </code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## ── tailor ──────────────────────────────────────────────────────────────</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## A binary postprocessor with 1 adjustment:</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## </code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## • Adjust probability threshold to optimized value.</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## NA
## NA
## NA</code></pre></div>
<p>With a class imbalance, the default 50% threshold yields high specificity but low sensitivity. When we alter the threshold, those numbers will change, and we can select the best trade-off for our application. Let&rsquo;s tune the workflow:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">cls_mtr</span> <span class="o">&lt;-</span> <span class="nf">metric_set</span><span class="p">(</span><span class="n">roc_auc</span><span class="p">,</span> <span class="n">sensitivity</span><span class="p">,</span> <span class="n">specificity</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># To run all resamples in parallel:</span>
</span></span><span class="line"><span class="cl"><span class="n">mirai</span><span class="o">::</span><span class="nf">daemons</span><span class="p">(</span><span class="m">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">set.seed</span><span class="p">(</span><span class="m">985</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">rf_thrsh_res</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="n">rf_thrsh_wflow</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">tune_grid</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">resamples</span> <span class="o">=</span> <span class="n">sim_rs</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">grid</span> <span class="o">=</span> <span class="nf">tibble</span><span class="p">(</span><span class="n">threshold</span> <span class="o">=</span> <span class="nf">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0.6</span><span class="p">,</span> <span class="n">by</span> <span class="o">=</span> <span class="m">0.01</span><span class="p">)),</span>
</span></span><span class="line"><span class="cl">    <span class="n">metrics</span> <span class="o">=</span> <span class="n">cls_mtr</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Let&rsquo;s visualize the results:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">autoplot</span><span class="p">(</span><span class="n">rf_thrsh_res</span><span class="p">)</span> <span class="o">+</span> <span class="nf">lims</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="m">0</span><span class="o">:</span><span class="m">1</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2025-11-05_tune-2/figure/autoplot-1.png"
      alt="plot of chunk autoplot" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>We can see that we can improve sensitivity by <em>reducing</em> the threshold. The rate of decay in specificity is slow compared to the gain in sensitivity until thresholds less than 10% are used. The Brier score is constant over the threshold since it only uses the estimated class probabilities, which are unaffected by the threshold.</p>
<p>We&rsquo;ve taken great pains to avoid redundant calculations. In this example, for each resample, a single random forest model is trained, and then the postprocessing grid is evaluated. This <em>conditional execution</em> strategy is used to fit the fewest possible preprocessors, models, and postprocessors.</p>
<p>For this classification example, recent updates to the 


  
  
  





<a href="https://desirability2.tidymodels.org/#using-with-the-tune-package" target="_blank" rel="noopener">desirability2</a>
 package can enable you to jointly find the best sensitivity/specificity trade-off using the threshold parameter <em>and</em> model calibration/separation using other parameters.</p>
<p>We&rsquo;ll add more examples and tutorials to tidymodels.org to showcase what we can do with postprocessing.</p>
<h2 id="whats-next">What&rsquo;s next
</h2>
<p>This had been a race towards posit::conf(2025). Our focus had to be on the two big features for this release (since we taught workshops that use them). There are a few other relatively minor issues to address as the year closes.</p>
<p>One is to swap the package that we currently use for Gaussian Processes in Bayesian optimization from the GPfit package to the 






<a href="https://github.com/CollinErickson/GauPro" target="_blank" rel="noopener">GauPro</a>
 package. The former is not actively supported, and the latter has a few features that we&rsquo;d love to have. Specifically, better kernel methods for non-numeric tuning parameters (e.g., the type of activation function used in neural networks). Hopefully, we&rsquo;ll have another planned release before the end of the year.</p>
<p>Another near-future development goal is to have comprehensive integration for quantile regression models. We&rsquo;ve added a few parsnip engines already and will expand the support in yardstick and tune.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>We&rsquo;d like to thanks everyone who contributed since the previous version: 






<a href="https://github.com/3styleJam" target="_blank" rel="noopener">@3styleJam</a>
, 






<a href="https://github.com/Diyar0D" target="_blank" rel="noopener">@Diyar0D</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/MatthieuStigler" target="_blank" rel="noopener">@MatthieuStigler</a>
, 






<a href="https://github.com/MattJEM" target="_blank" rel="noopener">@MattJEM</a>
, 






<a href="https://github.com/mthulin" target="_blank" rel="noopener">@mthulin</a>
, 






<a href="https://github.com/tjburch" target="_blank" rel="noopener">@tjburch</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-11-05_tune-2/thumbnail-wd.jpg" length="359031" type="image/jpeg" />
    </item>
    <item>
      <title>mall 0.2.0</title>
      <link>https://opensource.posit.co/blog/2025-08-19_edgarmall02/</link>
      <pubDate>Tue, 19 Aug 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-08-19_edgarmall02/</guid>
      <dc:creator>Edgar Ruiz</dc:creator><description><![CDATA[<p>






<a href="https://mlverse.github.io/mall/" target="_blank" rel="noopener">mall</a>
 uses Large Language Models (LLM) to run
Natural Language Processing (NLP) operations against your data. This package
is available for both R, and Python. Version 0.2.0 has been released to







<a href="https://cran.r-project.org/web/packages/mall/index.html" target="_blank" rel="noopener">CRAN</a>
 and







<a href="https://pypi.org/project/mlverse-mall/" target="_blank" rel="noopener">PyPi</a>
 respectively.</p>
<p>In R, you can install the latest version with:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;mall&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>In Python, with:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pip</span> <span class="n">install</span> <span class="n">mlverse</span><span class="o">-</span><span class="n">mall</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>This release expands the number of LLM providers you can use with <code>mall</code>. Also,
in Python it introduces the option to run the NLP operations over string vectors,
and in R, it enables support for &lsquo;parallelized&rsquo; requests.</p>
<p>It is also very exciting to announce a brand new cheatsheet for this package. It
is available in print (PDF) and HTML format!</p>
<h2 id="more-llm-providers">More LLM providers
</h2>
<p>The biggest highlight of this release is the the ability to use external LLM
providers such as 






<a href="https://openai.com/" target="_blank" rel="noopener">OpenAI</a>
, 






<a href="https://gemini.google.com/" target="_blank" rel="noopener">Gemini</a>

and 






<a href="https://www.anthropic.com/" target="_blank" rel="noopener">Anthropic</a>
. Instead of writing integration for
each provider one by one, <code>mall</code> uses specialized integration packages to act as
intermediates.</p>
<p>In R, <code>mall</code> uses the 






<a href="https://ellmer.tidyverse.org/index.html" target="_blank" rel="noopener"><code>ellmer</code></a>
 package
to integrate with 


  
  
  





<a href="https://ellmer.tidyverse.org/reference/index.html#chatbots" target="_blank" rel="noopener">a variety of LLM providers</a>
.
To access the new feature, first create a chat connection, and then pass that
connection to <code>llm_use()</code>. Here is an example of connecting and using OpenAI:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">install.packages</span><span class="p">(</span><span class="s">&#34;ellmer&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">mall</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">ellmer</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">chat</span> <span class="o">&lt;-</span> <span class="nf">chat_openai</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Using model = &#34;gpt-4.1&#34;.</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">llm_use</span><span class="p">(</span><span class="n">chat</span><span class="p">,</span> <span class="n">.cache</span> <span class="o">=</span> <span class="s">&#34;_my_cache&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── mall session object </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Backend: ellmerLLM session: model:gpt-4.1R session: cache_folder:_my_cache</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>In Python, <code>mall</code> uses 






<a href="https://posit-dev.github.io/chatlas/" target="_blank" rel="noopener"><code>chatlas</code></a>
 as
the integration point with the LLM. <code>chatlas</code> also integrates with



  
  
  





<a href="https://posit-dev.github.io/chatlas/reference/#chat-model-providers" target="_blank" rel="noopener">several LLM providers</a>
.
To use, first instantiate a <code>chatlas</code> chat connection class, and then pass that
to the 






<a href="https://pola.rs/" target="_blank" rel="noopener">Polars</a>
 data frame via the <code>&lt;DF&gt;.llm.use()</code> function:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pip</span> <span class="n">install</span> <span class="n">chatlas</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">mall</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">chatlas</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">chat</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">mall</span><span class="o">.</span><span class="n">MallData</span>
</span></span><span class="line"><span class="cl"><span class="n">reviews</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">reviews</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">reviews</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="n">chat</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; {&#39;backend&#39;: &#39;chatlas&#39;, &#39;chat&#39;: &lt;Chat OpenAI/gpt-4.1 turns=0 tokens=0/0 $0.0&gt;</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; , &#39;_cache&#39;: &#39;_mall_cache&#39;}</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Connecting <code>mall</code> to external LLM providers introduces a consideration of cost.
Most providers charge for the use of their API, so there is a potential that a
large table, with long texts, could be an expensive operation.</p>
<h2 id="parallel-requests-r-only">Parallel requests (R only)
</h2>
<p>A new feature introduced in 






  
  

<a href="https://opensource.posit.co/blog/2025-07-25_ellmer-0-3-0/"><code>ellmer</code> 0.3.0</a>

enables the access to submit multiple prompts in parallel, rather than in sequence.
This makes it faster, and potentially cheaper, to process a table. If the provider
supports this feature, <code>ellmer</code> is able to leverage it via the







<a href="https://ellmer.tidyverse.org/reference/parallel_chat.html" target="_blank" rel="noopener"><code>parallel_chat()</code></a>

function. Gemini and OpenAI support the feature.</p>
<p>In the new release of <code>mall</code>, the integration with <code>ellmer</code> has been specially
written to take advantage of parallel chat. The internals have been re-written to
submit the NLP-specific instructions as a system message in order
reduce the size of each prompt. Additionally, the cache system has also been
re-tooled to support batched requests.</p>
<h2 id="nlp-operations-without-a-table">NLP operations without a table
</h2>
<p>Since its initial version, <code>mall</code> has provided the ability for R users to perform
the NLP operations over a string vector, in other words, without needing a table.
Starting with the new release, <code>mall</code> also provides this same functionality
in its Python version.</p>
<p><code>mall</code> can process vectors contained in a <code>list</code> object. To use, initialize a
new <code>LLMVec</code> class object with either an Ollama model, or a <code>chatlas</code> <code>Chat</code>
object, and then access the same NLP functions as the Polars extension.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Initialize a Chat object</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">chatlas</span> <span class="kn">import</span> <span class="n">ChatOllama</span>
</span></span><span class="line"><span class="cl"><span class="n">chat</span> <span class="o">=</span> <span class="n">ChatOllama</span><span class="p">(</span><span class="n">model</span> <span class="o">=</span> <span class="s2">&#34;llama3.2&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Pass it to a new LLMVec</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">mall</span> <span class="kn">import</span> <span class="n">LLMVec</span>
</span></span><span class="line"><span class="cl"><span class="n">llm</span> <span class="o">=</span> <span class="n">LLMVec</span><span class="p">(</span><span class="n">chat</span><span class="p">)</span>    </span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Access the functions via the new LLMVec object, and pass the text to be processed.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">llm</span><span class="o">.</span><span class="n">sentiment</span><span class="p">([</span><span class="s2">&#34;I am happy&#34;</span><span class="p">,</span> <span class="s2">&#34;I am sad&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; [&#39;positive&#39;, &#39;negative&#39;]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">llm</span><span class="o">.</span><span class="n">translate</span><span class="p">([</span><span class="s2">&#34;Este es el mejor dia!&#34;</span><span class="p">],</span> <span class="s2">&#34;english&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; [&#39;This is the best day!&#39;]</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>For more information visit the reference page: 






<a href="https://mlverse.github.io/mall/reference/LlmVec.html" target="_blank" rel="noopener">LLMVec</a>
</p>
<h2 id="new-cheatsheet">New cheatsheet
</h2>
<p>The brand new official cheatsheet is now available from Posit:







<a href="https://rstudio.github.io/cheatsheets/nlp-with-llms.pdf" target="_blank" rel="noopener">Natural Language processing using LLMs in R/Python</a>
.
Its mean feature is that one side of the page is dedicated to the R version,
and the other side of the page to the Python version.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2025-08-19_edgarmall02/images/cheatsheet.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<p>An web page version is also availabe in the official cheatsheet site







<a href="https://rstudio.github.io/cheatsheets/html/nlp-with-llms.html" target="_blank" rel="noopener">here</a>
. It takes
advantage of the tab feature that lets you select between R and Python
explanations and examples.</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2025-08-19_edgarmall02/images/html-cheatsheet.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-08-19_edgarmall02/thumbnail.png" length="690897" type="image/png" />
    </item>
    <item>
      <title>recipes 1.3.0</title>
      <link>https://opensource.posit.co/blog/2025-04-28_recipes-1-3-0/</link>
      <pubDate>Mon, 28 Apr 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-04-28_recipes-1-3-0/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re thrilled to announce the release of 






<a href="https://recipes.tidymodels.org/" target="_blank" rel="noopener">recipes</a>
 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"recipes"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will walk through some of the highlights of this release, which includes changes to how <code>strings_as_factors</code> are specified, deprecation of 






<a href="https://recipes.tidymodels.org/reference/step_select.html" target="_blank" rel="noopener"><code>step_select()</code></a>
, new <code>contrasts</code> argument for 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
, and improvements for 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>step_impute_bag()</code></a>
.</p>
<p>You can see a full list of changes in the 


  
  
  





<a href="https://recipes.tidymodels.org/news/index.html#recipes-130" target="_blank" rel="noopener">release notes</a>
.</p>
<p>Let&rsquo;s first load the package:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/tidymodels/recipes'>recipes</a></span><span class='o'>)</span></span></code></pre>
</div>
<h2 id="strings_as_factors"><code>strings_as_factors</code>
</h2>
<p>Recipes by default convert predictor strings to factors, and the option for that is located in 






<a href="https://recipes.tidymodels.org/reference/prep.html" target="_blank" rel="noopener"><code>prep()</code></a>
. This caused an issue when you wanted to set <code>strings_as_factors = FALSE</code> for a recipe that is used somewhere else like in a workflow.</p>
<p>This is no longer an issue as we have moved the argument to 






<a href="https://recipes.tidymodels.org/reference/recipe.html" target="_blank" rel="noopener"><code>recipe()</code></a>
 itself. We are at the same time deprecating the use of <code>strings_as_factors</code> when used in 






<a href="https://recipes.tidymodels.org/reference/prep.html" target="_blank" rel="noopener"><code>prep()</code></a>
. Here is an example:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://modeldata.tidymodels.org'>modeldata</a></span><span class='o'>)</span></span>
<span><span class='nv'>tate_text</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 4,284 × 5</span></span></span>
<span><span class='c'>#&gt;        id artist             title                                  medium  year</span></span>
<span><span class='c'>#&gt;     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                                  <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>  <span style='text-decoration: underline;'>21</span>926 Absalon            Proposals for a Habitat                Video…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  <span style='text-decoration: underline;'>20</span>472 Auerbach, Frank    Michael                                Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>  <span style='text-decoration: underline;'>20</span>474 Auerbach, Frank    Geoffrey                               Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  <span style='text-decoration: underline;'>20</span>473 Auerbach, Frank    Jake                                   Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  <span style='text-decoration: underline;'>20</span>513 Auerbach, Frank    To the Studios                         Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  <span style='text-decoration: underline;'>21</span>389 Ayres, OBE Gillian Phaëthon                               Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> <span style='text-decoration: underline;'>121</span>187 Barlow, Phyllida   Untitled                               Acryl…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  <span style='text-decoration: underline;'>19</span>455 Baselitz, Georg    Green VIII                             Woodc…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>  <span style='text-decoration: underline;'>20</span>938 Beattie, Basil     Present Bound                          Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> <span style='text-decoration: underline;'>105</span>941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 4,274 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>We are loading the modeldata package to get <code>tate_text</code> which has a character column <code>title</code>. If we don&rsquo;t do anything then it turns into a factor.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span><span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>tate_text</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span><span class='nv'>tate_text</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 4,284 × 5</span></span></span>
<span><span class='c'>#&gt;        id artist             title                                  medium  year</span></span>
<span><span class='c'>#&gt;     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>                                  <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>  <span style='text-decoration: underline;'>21</span>926 Absalon            Proposals for a Habitat                Video…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  <span style='text-decoration: underline;'>20</span>472 Auerbach, Frank    Michael                                Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>  <span style='text-decoration: underline;'>20</span>474 Auerbach, Frank    Geoffrey                               Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  <span style='text-decoration: underline;'>20</span>473 Auerbach, Frank    Jake                                   Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  <span style='text-decoration: underline;'>20</span>513 Auerbach, Frank    To the Studios                         Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  <span style='text-decoration: underline;'>21</span>389 Ayres, OBE Gillian Phaëthon                               Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> <span style='text-decoration: underline;'>121</span>187 Barlow, Phyllida   Untitled                               Acryl…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  <span style='text-decoration: underline;'>19</span>455 Baselitz, Georg    Green VIII                             Woodc…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>  <span style='text-decoration: underline;'>20</span>938 Beattie, Basil     Present Bound                          Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> <span style='text-decoration: underline;'>105</span>941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 4,274 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>But we can set <code>strings_as_factors = FALSE</code> in 






<a href="https://recipes.tidymodels.org/reference/recipe.html" target="_blank" rel="noopener"><code>recipe()</code></a>
 and it won&rsquo;t anymore.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span><span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>tate_text</span>, strings_as_factors <span class='o'>=</span> <span class='kc'>FALSE</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span><span class='nv'>tate_text</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 4,284 × 5</span></span></span>
<span><span class='c'>#&gt;        id artist             title                                  medium  year</span></span>
<span><span class='c'>#&gt;     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                                  <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>  <span style='text-decoration: underline;'>21</span>926 Absalon            Proposals for a Habitat                Video…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  <span style='text-decoration: underline;'>20</span>472 Auerbach, Frank    Michael                                Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>  <span style='text-decoration: underline;'>20</span>474 Auerbach, Frank    Geoffrey                               Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  <span style='text-decoration: underline;'>20</span>473 Auerbach, Frank    Jake                                   Etchi…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  <span style='text-decoration: underline;'>20</span>513 Auerbach, Frank    To the Studios                         Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  <span style='text-decoration: underline;'>21</span>389 Ayres, OBE Gillian Phaëthon                               Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> <span style='text-decoration: underline;'>121</span>187 Barlow, Phyllida   Untitled                               Acryl…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  <span style='text-decoration: underline;'>19</span>455 Baselitz, Georg    Green VIII                             Woodc…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>  <span style='text-decoration: underline;'>20</span>938 Beattie, Basil     Present Bound                          Oil p…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> <span style='text-decoration: underline;'>105</span>941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  <span style='text-decoration: underline;'>1</span>990</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 4,274 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself.</p>
<h2 id="deprecating-step_select">Deprecating <code>step_select()</code>
</h2>
<p>We have started the process of deprecating 






<a href="https://recipes.tidymodels.org/reference/step_select.html" target="_blank" rel="noopener"><code>step_select()</code></a>
. Given the number of issues people are having with the step and the fact that it doesn&rsquo;t play well with workflows we think this is the right call.</p>
<p>There are two main use cases where 






<a href="https://recipes.tidymodels.org/reference/step_select.html" target="_blank" rel="noopener"><code>step_select()</code></a>
 was used: removing variables, and selecting variables. Removing variables when done with <code>-</code> in 






<a href="https://recipes.tidymodels.org/reference/step_select.html" target="_blank" rel="noopener"><code>step_select()</code></a>
</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_select.html'>step_select</a></span><span class='o'>(</span><span class='o'>-</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/starts_with.html'>starts_with</a></span><span class='o'>(</span><span class='s'>"d"</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='kc'>NULL</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 32 × 9</span></span></span>
<span><span class='c'>#&gt;      cyl    hp    wt  qsec    vs    am  gear  carb   mpg</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>     6   110  2.62  16.5     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>     6   110  2.88  17.0     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>     4    93  2.32  18.6     1     1     4     1  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>     6   110  3.22  19.4     1     0     3     1  21.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>     8   175  3.44  17.0     0     0     3     2  18.7</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>     6   105  3.46  20.2     1     0     3     1  18.1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>     8   245  3.57  15.8     0     0     3     4  14.3</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>     4    62  3.19  20       1     0     4     2  24.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>     4    95  3.15  22.9     1     0     4     2  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>     6   123  3.44  18.3     1     0     4     4  19.2</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>These use cases can seamlessly be converted to use 






<a href="https://recipes.tidymodels.org/reference/step_rm.html" target="_blank" rel="noopener"><code>step_rm()</code></a>
 without the <code>-</code> for the same result.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_rm.html'>step_rm</a></span><span class='o'>(</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/starts_with.html'>starts_with</a></span><span class='o'>(</span><span class='s'>"d"</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='kc'>NULL</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 32 × 9</span></span></span>
<span><span class='c'>#&gt;      cyl    hp    wt  qsec    vs    am  gear  carb   mpg</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>     6   110  2.62  16.5     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>     6   110  2.88  17.0     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>     4    93  2.32  18.6     1     1     4     1  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>     6   110  3.22  19.4     1     0     3     1  21.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>     8   175  3.44  17.0     0     0     3     2  18.7</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>     6   105  3.46  20.2     1     0     3     1  18.1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>     8   245  3.57  15.8     0     0     3     4  14.3</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>     4    62  3.19  20       1     0     4     2  24.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>     4    95  3.15  22.9     1     0     4     2  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>     6   123  3.44  18.3     1     0     4     4  19.2</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use 






<a href="https://dplyr.tidyverse.org/reference/select.html" target="_blank" rel="noopener"><code>select()</code></a>
 to do that before passing the data into the 






<a href="https://recipes.tidymodels.org/reference/recipe.html" target="_blank" rel="noopener"><code>recipe()</code></a>
. This is especially helpful since 


  
  
  





  
  

<a href="https://opensource.posit.co/blog/2024-07-08_recipes-1-1-0/#column-type-checking">recipes are tighter with respect to their input types</a>
, so only passing the data you need to use is helpful.</p>
<p>If you need to do the selection after another step takes effect you should still be able to do so, by using 






<a href="https://recipes.tidymodels.org/reference/step_rm.html" target="_blank" rel="noopener"><code>step_rm()</code></a>
 in the following manner.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">step_rm</span><span class="p">(</span><span class="n">recipe</span><span class="p">,</span> <span class="nf">all_predictors</span><span class="p">(),</span> <span class="o">-</span><span class="nf">all_of</span><span class="p">(</span><span class="o">&lt;</span><span class="n">variables</span> <span class="n">that</span> <span class="n">you</span> <span class="n">want</span> <span class="n">to</span> <span class="n">keep</span><span class="o">&gt;</span><span class="p">))</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<h2 id="step_dummy-contrasts-argument"><code>step_dummy()</code> contrasts argument
</h2>
<p>Contrasts such as 






<a href="https://rdrr.io/r/stats/contrast.html" target="_blank" rel="noopener"><code>contr.treatment()</code></a>
 and 






<a href="https://rdrr.io/r/stats/contrast.html" target="_blank" rel="noopener"><code>contr.poly()</code></a>
 are used in 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
 to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using 






<a href="https://rdrr.io/r/base/options.html" target="_blank" rel="noopener"><code>options()</code></a>
 like so:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/base/options.html'>options</a></span><span class='o'>(</span>contrasts <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span>unordered <span class='o'>=</span> <span class='s'>"contr.poly"</span>, ordered <span class='o'>=</span> <span class='s'>"contr.poly"</span><span class='o'>)</span><span class='o'>)</span></span></code></pre>
</div>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span><span class='nv'>species</span> <span class='o'>+</span> <span class='nv'>island</span>, <span class='nv'>penguins</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_dummy.html'>step_dummy</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_nominal_predictors</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 4</span></span></span>
<span><span class='c'>#&gt;    species_Chinstrap species_Gentoo island_Dream island_Torgersen</span></span>
<span><span class='c'>#&gt;                <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>          <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>        <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>            <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>The issue with this approach is that it pulls from 






<a href="https://rdrr.io/r/base/options.html" target="_blank" rel="noopener"><code>options()</code></a>
 when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment.</p>
<div class="highlight">
</div>
<p>To fix this issue we have given 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
 an argument <code>contrasts</code> that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span><span class='nv'>species</span> <span class='o'>+</span> <span class='nv'>island</span>, <span class='nv'>penguins</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_dummy.html'>step_dummy</a></span><span class='o'>(</span></span>
<span>    <span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_nominal_predictors</a></span><span class='o'>(</span><span class='o'>)</span>, contrasts <span class='o'>=</span> <span class='s'>"contr.poly"</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 4</span></span></span>
<span><span class='c'>#&gt;    species_Chinstrap species_Gentoo island_Dream island_Torgersen</span></span>
<span><span class='c'>#&gt;                <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>          <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>        <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>            <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>            -<span style='color: #BB0000;'>0.707</span>          0.408        0.707            0.408</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>If you are using a contrasts from an external package such as 






<a href="https://hardhat.tidymodels.org/reference/contr_one_hot.html" target="_blank" rel="noopener"><code>hardhat::contr_one_hot()</code></a>
 you will need to have the package loaded in the environments you are working in with 






<a href="https://github.com/tidymodels/hardhat" target="_blank" rel="noopener"><code>library(hardhat)</code></a>
 and setting <code>contrasts = &quot;contr_one_hot&quot;</code>. You will also need to call 






<a href="https://github.com/tidymodels/hardhat" target="_blank" rel="noopener"><code>library(hardhat)</code></a>
 in any production environments you are using this recipe.</p>
<h2 id="tidyselect-can-be-used-everywhere">tidyselect can be used everywhere
</h2>
<p>Several steps such as 






<a href="https://recipes.tidymodels.org/reference/step_pls.html" target="_blank" rel="noopener"><code>step_pls()</code></a>
 and 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>step_impute_bag()</code></a>
 require the selection of more than just the affected columns. 






<a href="https://recipes.tidymodels.org/reference/step_pls.html" target="_blank" rel="noopener"><code>step_pls()</code></a>
 needs you to select an <code>outcome</code> variable and 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>step_impute_bag()</code></a>
 needs you to select which variables to impute with, <code>impute_with</code>, if you don&rsquo;t want to use all predictors. Previously these needed to be strings or use special selectors like 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>imp_vars()</code></a>
. You don&rsquo;t have to do that anymore. You can now use tidyselect in these arguments too.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_pls.html'>step_pls</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_predictors</a></span><span class='o'>(</span><span class='o'>)</span>, outcome <span class='o'>=</span> <span class='nv'>mpg</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='nv'>mtcars</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 32 × 3</span></span></span>
<span><span class='c'>#&gt;      mpg   PLS1   PLS2</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>  21    0.693  0.895</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  21    0.650  0.654</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>  22.8  2.78   0.378</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  21.4  0.210 -<span style='color: #BB0000;'>0.368</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  18.7 -<span style='color: #BB0000;'>1.95</span>   0.845</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  18.1  0.137 -<span style='color: #BB0000;'>0.624</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>  14.3 -<span style='color: #BB0000;'>2.77</span>   0.364</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  24.4  1.81  -<span style='color: #BB0000;'>1.30</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>  22.8  2.12  -<span style='color: #BB0000;'>1.95</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>  19.2  0.531 -<span style='color: #BB0000;'>1.51</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>For arguments that allow for multiple selections now work with recipes selectors like 






<a href="https://recipes.tidymodels.org/reference/has_role.html" target="_blank" rel="noopener"><code>all_numeric_predictors()</code></a>
 and 






<a href="https://recipes.tidymodels.org/reference/has_role.html" target="_blank" rel="noopener"><code>has_role()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_impute_bag.html'>step_impute_bag</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_predictors</a></span><span class='o'>(</span><span class='o'>)</span>, impute_with <span class='o'>=</span> <span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>has_role</a></span><span class='o'>(</span><span class='s'>"predictor"</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span>new_data <span class='o'>=</span> <span class='nv'>mtcars</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 32 × 11</span></span></span>
<span><span class='c'>#&gt;      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   mpg</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>     6  160    110  3.9   2.62  16.5     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>     6  160    110  3.9   2.88  17.0     0     1     4     4  21  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>     4  147.    62  3.69  3.19  20       1     0     4     2  24.4</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>These changes are backwards compatible meaning that the old ways still work with minimal warnings.</p>
<h2 id="step_impute_bag-now-takes-up-less-memory"><code>step_impute_bag()</code> now takes up less memory
</h2>
<p>We have another benefit for users of 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>step_impute_bag()</code></a>
. For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with 






<a href="https://recipes.tidymodels.org/reference/step_impute_bag.html" target="_blank" rel="noopener"><code>step_impute_bag()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>rec</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>Sale_Price</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>ames</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_impute_bag.html'>step_impute_bag</a></span><span class='o'>(</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/starts_with.html'>starts_with</a></span><span class='o'>(</span><span class='s'>"Lot_"</span><span class='o'>)</span>, impute_with <span class='o'>=</span> <span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_numeric_predictors</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'>lobstr</span><span class='nf'>::</span><span class='nf'><a href='https://lobstr.r-lib.org/reference/obj_size.html'>obj_size</a></span><span class='o'>(</span><span class='nv'>rec</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; 20.23 MB</span></span>
<span></span></code></pre>
</div>
<p>This recipe took up over <code>75 MB</code> and now takes up <code>20 MB</code>.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>Many thanks to all the people who contributed to recipes since the last release!</p>
<p>






<a href="https://github.com/chillerb" target="_blank" rel="noopener">@chillerb</a>
, 






<a href="https://github.com/dshemetov" target="_blank" rel="noopener">@dshemetov</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/kevbaer" target="_blank" rel="noopener">@kevbaer</a>
, 






<a href="https://github.com/nhward" target="_blank" rel="noopener">@nhward</a>
, 






<a href="https://github.com/regisely" target="_blank" rel="noopener">@regisely</a>
, and 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-04-28_recipes-1-3-0/thumbnail-wd.jpg" length="698394" type="image/jpeg" />
    </item>
    <item>
      <title>rsample 1.3.0</title>
      <link>https://opensource.posit.co/blog/2025-04-03_rsample-1-3-0/</link>
      <pubDate>Thu, 03 Apr 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-04-03_rsample-1-3-0/</guid>
      <dc:creator>Hannah Frick</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re thrilled to announce the release of 






<a href="https://rsample.tidymodels.org/" target="_blank" rel="noopener">rsample</a>
 1.3.0. rsample makes it easy to create resamples for assessing model performance. It is part of the tidymodels framework, a collection of R packages for modeling and machine learning using tidyverse principles.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"rsample"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will walk you through the more flexible grouping for calculating bootstrap confidence intervals and highlight the contributions made by participants of the tidyverse developer day.</p>
<p>You can see a full list of changes in the 


  
  
  





<a href="https://rsample.tidymodels.org/news/index.html#rsample-130" target="_blank" rel="noopener">release notes</a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://rsample.tidymodels.org'>rsample</a></span><span class='o'>)</span></span></code></pre>
</div>
<h2 id="flexible-grouping-for-bootstrap-intervals">Flexible grouping for bootstrap intervals
</h2>
<p>Resampling allows you get an understanding of the variability of an estimate, e.g., a summary statistic of your data. If you want to lean on statistical theory and get confidence intervals for your estimate, you can reach for the bootstrap resampling scheme: calculating your summary statistic on the bootstrap samples enables you to calculate confidence intervals around your point estimate.</p>
<p>rsample contains a family of <code>int_*()</code> functions to calculate bootstrap confidence intervals of different flavors: percentile intervals, &ldquo;BCa&rdquo; intervals, and bootstrap-t intervals. If you want to dive into the technical details, Chapter 11 of 






<a href="https://hastie.su.domains/CASI/" target="_blank" rel="noopener">CASI</a>
 is a good place to start.</p>
<p>You can calculate the confidence intervals based on a grouping in your data. However, so far, rsample would only let you provide a single grouping variable. With this release, we are extending this functionality to allow a more flexible grouping.</p>
<p>The motivating application for us was to be able to calculate confidence intervals around multiple model performance metrics, including dynamic metrics for time-to-event models which depend on an evaluation time point. So in this case, the metric is one grouping variable and the evaluation time another. But let&rsquo;s pull back complexity for an example of how the new rsample functionality works!</p>
<p>We have a dataset with delivery times for orders containing one or more items. We&rsquo;ll do some data wrangling with it, so we are also loading dplyr.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://dplyr.tidyverse.org'>dplyr</a></span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; Attaching package: 'dplyr'</span></span>
<span></span><span><span class='c'>#&gt; The following objects are masked from 'package:stats':</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt;     filter, lag</span></span>
<span></span><span><span class='c'>#&gt; The following objects are masked from 'package:base':</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt;     intersect, setdiff, setequal, union</span></span>
<span></span><span><span class='nf'><a href='https://rdrr.io/r/utils/data.html'>data</a></span><span class='o'>(</span><span class='nv'>deliveries</span>, package <span class='o'>=</span> <span class='s'>"modeldata"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>deliveries</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 10,012 × 31</span></span></span>
<span><span class='c'>#&gt;    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05</span></span>
<span><span class='c'>#&gt;               <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>             16.1  11.9 Thu       3.15       0       0       2       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>             22.9  19.2 Tue       3.69       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>             30.3  18.4 Fri       2.06       0       0       0       0       1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>             33.4  15.8 Thu       5.97       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>             27.2  19.6 Fri       2.52       0       0       0       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>             19.6  13.0 Sat       3.35       1       0       0       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>             22.1  15.5 Sun       2.46       0       0       1       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>             26.6  17.0 Thu       2.21       0       0       1       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>             30.8  16.7 Fri       2.62       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>             17.4  11.9 Sun       2.75       0       2       1       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 10,002 more rows</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more variables: item_06 &lt;int&gt;, item_07 &lt;int&gt;, item_08 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_09 &lt;int&gt;, item_10 &lt;int&gt;, item_11 &lt;int&gt;, item_12 &lt;int&gt;, item_13 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_14 &lt;int&gt;, item_15 &lt;int&gt;, item_16 &lt;int&gt;, item_17 &lt;int&gt;, item_18 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_19 &lt;int&gt;, item_20 &lt;int&gt;, item_21 &lt;int&gt;, item_22 &lt;int&gt;, item_23 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_24 &lt;int&gt;, item_25 &lt;int&gt;, item_26 &lt;int&gt;, item_27 &lt;int&gt;</span></span></span>
<span></span></code></pre>
</div>
<p>Instead of fitting a whole model here, we are calculating a straightforward summary statistic for how much delivery time increases if an item is included in the order. So the item is one grouping factor. As a second one, we are using whether the order was delivered on a weekday or a weekend. Let&rsquo;s start by making that weekend indicator and reshaping the data to make it easier to calculate our summary statistic.</p>
<p>Note that the name for the weekend indicator column, <code>.weekend</code>, starts with a dot. That is important as it is the convention to signal to rsample that this is an additional grouping variable.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>item_data</span> <span class='o'>&lt;-</span> <span class='nv'>deliveries</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://dplyr.tidyverse.org/reference/mutate.html'>mutate</a></span><span class='o'>(</span>.weekend <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/ifelse.html'>ifelse</a></span><span class='o'>(</span><span class='nv'>day</span> <span class='o'><a href='https://rdrr.io/r/base/match.html'>%in%</a></span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"Sat"</span>, <span class='s'>"Sun"</span><span class='o'>)</span>, <span class='s'>"weekend"</span>, <span class='s'>"weekday"</span><span class='o'>)</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://dplyr.tidyverse.org/reference/select.html'>select</a></span><span class='o'>(</span><span class='nv'>time_to_delivery</span>, <span class='nv'>.weekend</span>, <span class='nf'><a href='https://tidyselect.r-lib.org/reference/starts_with.html'>starts_with</a></span><span class='o'>(</span><span class='s'>"item"</span><span class='o'>)</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'>tidyr</span><span class='nf'>::</span><span class='nf'><a href='https://tidyr.tidyverse.org/reference/pivot_longer.html'>pivot_longer</a></span><span class='o'>(</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/starts_with.html'>starts_with</a></span><span class='o'>(</span><span class='s'>"item"</span><span class='o'>)</span>, names_to <span class='o'>=</span> <span class='s'>"item"</span>, values_to <span class='o'>=</span> <span class='s'>"value"</span><span class='o'>)</span> </span></code></pre>
</div>
<p>Next, we are making a small function that calculates the ratio of average delivery times with and without the item included in the order, as a estimate of how much a specific item in an order increases the delivery time.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>relative_increase</span> <span class='o'>&lt;-</span> <span class='kr'>function</span><span class='o'>(</span><span class='nv'>data</span><span class='o'>)</span> <span class='o'>&#123;</span></span>
<span>  <span class='nv'>data</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>    <span class='nf'><a href='https://dplyr.tidyverse.org/reference/mutate.html'>mutate</a></span><span class='o'>(</span>includes_item <span class='o'>=</span> <span class='nv'>value</span> <span class='o'>&gt;</span> <span class='m'>0</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>    <span class='nf'><a href='https://dplyr.tidyverse.org/reference/summarise.html'>summarize</a></span><span class='o'>(</span></span>
<span>      has <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/mean.html'>mean</a></span><span class='o'>(</span><span class='nv'>time_to_delivery</span><span class='o'>[</span><span class='nv'>includes_item</span><span class='o'>]</span><span class='o'>)</span>,</span>
<span>      has_not <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/mean.html'>mean</a></span><span class='o'>(</span><span class='nv'>time_to_delivery</span><span class='o'>[</span><span class='o'>!</span><span class='nv'>includes_item</span><span class='o'>]</span><span class='o'>)</span>,</span>
<span>      .by <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='nv'>item</span>, <span class='nv'>.weekend</span><span class='o'>)</span></span>
<span>    <span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>    <span class='nf'><a href='https://dplyr.tidyverse.org/reference/mutate.html'>mutate</a></span><span class='o'>(</span>estimate <span class='o'>=</span> <span class='nv'>has</span> <span class='o'>/</span> <span class='nv'>has_not</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>    <span class='nf'><a href='https://dplyr.tidyverse.org/reference/select.html'>select</a></span><span class='o'>(</span>term <span class='o'>=</span> <span class='nv'>item</span>, <span class='nv'>.weekend</span>, <span class='nv'>estimate</span><span class='o'>)</span></span>
<span><span class='o'>&#125;</span></span></code></pre>
</div>
<p>We can calculate that on our entire dataset.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'>relative_increase</span><span class='o'>(</span><span class='nv'>item_data</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 54 × 3</span></span></span>
<span><span class='c'>#&gt;    term    .weekend estimate</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>       <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> item_01 weekday      1.07</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> item_02 weekday      1.02</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> item_03 weekday      1.02</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> item_04 weekday      1.00</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> item_05 weekday      1.00</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> item_06 weekday      1.01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> item_07 weekday      1.03</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> item_08 weekday      1.01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> item_09 weekday      1.01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> item_10 weekday      1.06</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 44 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>This is fine, but what we really want here is to get confidence intervals around these estimates!</p>
<p>So let&rsquo;s make bootstrap samples and calculate our statistic on those.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>item_bootstrap</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rsample.tidymodels.org/reference/bootstraps.html'>bootstraps</a></span><span class='o'>(</span><span class='nv'>item_data</span>, times <span class='o'>=</span> <span class='m'>1000</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>item_stats</span> <span class='o'>&lt;-</span></span>
<span>  <span class='nv'>item_bootstrap</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://dplyr.tidyverse.org/reference/mutate.html'>mutate</a></span><span class='o'>(</span>stats <span class='o'>=</span> <span class='nf'>purrr</span><span class='nf'>::</span><span class='nf'><a href='https://purrr.tidyverse.org/reference/map.html'>map</a></span><span class='o'>(</span><span class='nv'>splits</span>, <span class='o'>~</span> <span class='nf'><a href='https://rsample.tidymodels.org/reference/as.data.frame.rsplit.html'>analysis</a></span><span class='o'>(</span><span class='nv'>.x</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span> <span class='nf'>relative_increase</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span><span class='o'>)</span></span></code></pre>
</div>
<p>Now we have everything we need to calculate the confidence intervals, stashed into the tibbles in the <code>stats</code> column: an <code>estimate</code>, a <code>term</code> (the primary grouping variable), and our additional grouping variable <code>.weekend</code>, starting with a dot. What&rsquo;s left to do is call one of the <code>int_*()</code> functions and specify which column contains the statistics. Here, we&rsquo;ll calculate percentile intervals with 






<a href="https://rsample.tidymodels.org/reference/int_pctl.html" target="_blank" rel="noopener"><code>int_pctl()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>item_ci</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rsample.tidymodels.org/reference/int_pctl.html'>int_pctl</a></span><span class='o'>(</span><span class='nv'>item_stats</span>, statistics <span class='o'>=</span> <span class='nv'>stats</span>, alpha <span class='o'>=</span> <span class='m'>0.1</span><span class='o'>)</span></span>
<span><span class='nv'>item_ci</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 54 × 7</span></span></span>
<span><span class='c'>#&gt;    term    .weekend .lower .estimate .upper .alpha .method   </span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>  <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> item_01 weekday   1.05      1.07    1.09    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> item_01 weekend   1.04      1.07    1.10    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> item_02 weekday   1.00      1.02    1.03    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> item_02 weekend   0.996     1.01    1.03    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> item_03 weekday   1.01      1.02    1.04    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> item_03 weekend   0.970     0.990   1.01    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> item_04 weekday   0.989     1.00    1.02    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> item_04 weekend   0.998     1.02    1.03    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> item_05 weekday   0.987     1.00    1.02    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> item_05 weekend   0.982     1.00    1.03    0.1 percentile</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 44 more rows</span></span></span>
<span></span></code></pre>
</div>
<h2 id="tidyverse-developer-day">Tidyverse developer day
</h2>
<p>At the tidyverse developer day after posit::conf, rsample got a lot of love in form of contributions by various community members. People improved documentation and examples, move deprecations along, tightened checks to support good practice, and upgraded errors and warnings, both in style and content. None of these changes are flashy new features but all of them are essential to rsample working well!</p>
<p>So for example, leave-one-out (LOO) cross-validation is not a great choice of resampling scheme in most situations. From 


  
  
  





<a href="https://www.tmwr.org/resampling#leave-one-out-cross-validation" target="_blank" rel="noopener">Tidy modeling with R</a>
:</p>
<blockquote>
<p>For anything but pathologically small samples, LOO is computationally excessive, and it may not have good statistical properties.</p>
</blockquote>
<p>It was possible, however, to create implicit LOO samples by using 






<a href="https://rsample.tidymodels.org/reference/vfold_cv.html" target="_blank" rel="noopener"><code>vfold_cv()</code></a>
 with the number of folds set to the number of rows in the data. With a dev day contribution, this now errors:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rsample.tidymodels.org/reference/vfold_cv.html'>vfold_cv</a></span><span class='o'>(</span><span class='nv'>mtcars</span>, v <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/nrow.html'>nrow</a></span><span class='o'>(</span><span class='nv'>mtcars</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `vfold_cv()`:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00;'>!</span> Leave-one-out cross-validation is not supported by this function.</span></span>
<span><span class='c'>#&gt; <span style='color: #BB0000;'>✖</span> You set `v` to `nrow(data)`, which would result in a leave-one-out</span></span>
<span><span class='c'>#&gt;   cross-validation.</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>ℹ</span> Use `loo_cv()` in this case.</span></span>
<span></span></code></pre>
</div>
<p>This is to make users pause and consider if this a good choice for their dataset. If you require LOO, you can still use 






<a href="https://rsample.tidymodels.org/reference/loo_cv.html" target="_blank" rel="noopener"><code>loo_cv()</code></a>
.</p>
<p>Error messages in general have been a focus of ours across various tidymodels packages, rsample is no exception. We opened a bunch of issues to tackle all of rsample - and all got closed! Some of these changes are purely internal, upgrading manual formatting to let the cli package do the work. While the error message in most cases doesn&rsquo;t <em>look</em> different, it&rsquo;s a great deal more consistency in formatting.</p>
<p>For some error messages, the additional functionality in cli makes it easy to improve readability. This error message used to be one block of text, now it comes as three bullet points.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rsample.tidymodels.org/reference/permutations.html'>permutations</a></span><span class='o'>(</span><span class='nv'>mtcars</span>, <span class='nf'><a href='https://tidyselect.r-lib.org/reference/everything.html'>everything</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `permutations()`:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00;'>!</span> You have selected all columns to permute.</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>ℹ</span> This effectively reorders the rows in the original data without changing the</span></span>
<span><span class='c'>#&gt;   data structure.</span></span>
<span><span class='c'>#&gt; → Please select fewer columns to permute.</span></span>
<span></span></code></pre>
</div>
<p>Changes like these are super helpful to users and developers alike. A big thank you to all the contributors!</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>Many thanks to all the people who contributed to rsample since the last release!</p>
<p>






<a href="https://github.com/agmurray" target="_blank" rel="noopener">@agmurray</a>
, 






<a href="https://github.com/brshallo" target="_blank" rel="noopener">@brshallo</a>
, 






<a href="https://github.com/ccani007" target="_blank" rel="noopener">@ccani007</a>
, 






<a href="https://github.com/dicook" target="_blank" rel="noopener">@dicook</a>
, 






<a href="https://github.com/Dpananos" target="_blank" rel="noopener">@Dpananos</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/gaborcsardi" target="_blank" rel="noopener">@gaborcsardi</a>
, 






<a href="https://github.com/gregor-fausto" target="_blank" rel="noopener">@gregor-fausto</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/JamesHWade" target="_blank" rel="noopener">@JamesHWade</a>
, 






<a href="https://github.com/jttoivon" target="_blank" rel="noopener">@jttoivon</a>
, 






<a href="https://github.com/krz" target="_blank" rel="noopener">@krz</a>
, 






<a href="https://github.com/laurabrianna" target="_blank" rel="noopener">@laurabrianna</a>
, 






<a href="https://github.com/malcolmbarrett" target="_blank" rel="noopener">@malcolmbarrett</a>
, 






<a href="https://github.com/MatthieuStigler" target="_blank" rel="noopener">@MatthieuStigler</a>
, 






<a href="https://github.com/msberends" target="_blank" rel="noopener">@msberends</a>
, 






<a href="https://github.com/nmercadeb" target="_blank" rel="noopener">@nmercadeb</a>
, 






<a href="https://github.com/PriKalra" target="_blank" rel="noopener">@PriKalra</a>
, 






<a href="https://github.com/seb09" target="_blank" rel="noopener">@seb09</a>
, 






<a href="https://github.com/simonpcouch" target="_blank" rel="noopener">@simonpcouch</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, 






<a href="https://github.com/ZWael" target="_blank" rel="noopener">@ZWael</a>
, and 






<a href="https://github.com/zz77zz" target="_blank" rel="noopener">@zz77zz</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-04-03_rsample-1-3-0/thumbnail-wd.jpg" length="509250" type="image/jpeg" />
    </item>
    <item>
      <title>Improved sparsity support in tidymodels</title>
      <link>https://opensource.posit.co/blog/2025-03-19_tidymodels-sparsity/</link>
      <pubDate>Wed, 19 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-03-19_tidymodels-sparsity/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<p>Photo by <a href="https://unsplash.com/@oxygenvisuals?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Oliver Olah</a> on <a href="https://unsplash.com/photos/green-tree-in-the-middle-of-grass-field-KD8nzFznQQ0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a></p>
<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re stoked to announce tidymodels now fully supports sparse data from end to end. We have been working on this for 






<a href="https://github.com/tidymodels/recipes/pull/515" target="_blank" rel="noopener">over 5 years</a>
. This is an extension of the work we have done 






  
  

<a href="https://opensource.posit.co/blog/2020-11-25_tidymodels-sparse-support/">previously</a>
 with blueprints, which would carry the data sparsely some of the way.</p>
<p>You will need 


  
  
  





<a href="https://recipes.tidymodels.org/news/index.html#recipes-120" target="_blank" rel="noopener">recipes 1.2.0</a>
, 


  
  
  





<a href="https://parsnip.tidymodels.org/news/index.html#parsnip-130" target="_blank" rel="noopener">parsnip 1.3.0</a>
, 


  
  
  





<a href="https://workflows.tidymodels.org/news/index.html#workflows-120" target="_blank" rel="noopener">workflows 1.2.0</a>
 or later for this to work.</p>
<h2 id="what-are-sparse-data">What are sparse data?
</h2>
<p>The term <strong>sparse data</strong> refers to a data set containing many zeroes. Sparse data appears in all kinds of fields and can be produced in a number of preprocessing methods. The reason why we care about sparse data is because of how computers store numbers. A 32-bit integer value takes 4 bytes to store. An array of 32-bit integers takes 40 bytes, and so on. This happens because each value is written down.</p>
<p>A sparse representation instead stores the locations and values of the non-zero entries. Suppose we have the following vector with 20 entries:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>It could be represented sparsely using the 3 values <code>positions = c(1, 3, 7)</code>, <code>values = c(3, 5, 8)</code>, and <code>length = 20</code>. Now, we have seven values to represent a vector of 20 elements. Since some modeling tasks contain even sparser data, this type of representation starts to show real benefits in terms of execution time and memory consumption.</p>
<p>The tidymodels set of packages has undergone several internal changes to allow it to represent data sparsely internally when it would be beneficial. These changes allow you to fit models that contain sparse data faster and more memory efficiently than before. Moreover, it allows you to fit models previously not possible due to them not fitting in memory.</p>
<h2 id="sparse-matrix-support">Sparse matrix support
</h2>
<p>The first benefit of these changes is that <code>recipe()</code>, <code>prep()</code>, <code>bake()</code>, <code>fit()</code>, and 






<a href="https://rdrr.io/r/stats/predict.html" target="_blank" rel="noopener"><code>predict()</code></a>
 now accept sparse matrices created using the Matrix package.</p>
<p>The <code>permeability_qsar</code> data set from the modeldata package contains quite a lot of zeroes in the predictors, so we will use it as a demonstration. Starting by coercing it into a sparse matrix.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://tidymodels.tidymodels.org'>tidymodels</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://Matrix.R-forge.R-project.org'>Matrix</a></span><span class='o'>)</span></span>
<span><span class='nv'>permeability_sparse</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/methods/as.html'>as</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/matrix.html'>as.matrix</a></span><span class='o'>(</span><span class='nv'>permeability_qsar</span><span class='o'>)</span>, <span class='s'>"sparseMatrix"</span><span class='o'>)</span></span></code></pre>
</div>
<p>We can now use this sparse matrix in our code the same way as a dense matrix or data frame:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>rec_spec</span> <span class='o'>&lt;-</span> <span class='nf'>recipe</span><span class='o'>(</span><span class='nv'>permeability</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>permeability_sparse</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_zv</span><span class='o'>(</span><span class='nf'>all_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>mod_spec</span> <span class='o'>&lt;-</span> <span class='nf'>boost_tree</span><span class='o'>(</span><span class='s'>"regression"</span>, <span class='s'>"xgboost"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>wf_spec</span> <span class='o'>&lt;-</span> <span class='nf'>workflow</span><span class='o'>(</span><span class='nv'>rec_spec</span>, <span class='nv'>mod_spec</span><span class='o'>)</span></span></code></pre>
</div>
<p>Model training has the usual syntax:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>wf_fit</span> <span class='o'>&lt;-</span> <span class='nf'>fit</span><span class='o'>(</span><span class='nv'>wf_spec</span>, <span class='nv'>permeability_sparse</span><span class='o'>)</span></span></code></pre>
</div>
<p>as does prediction:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>wf_fit</span>, <span class='nv'>permeability_sparse</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 165 × 1</span></span></span>
<span><span class='c'>#&gt;     .pred</span></span>
<span><span class='c'>#&gt;     <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> 10.5  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  1.50 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> 13.1  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  1.10 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  1.25 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  0.738</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> 29.3  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  2.44 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> 36.3  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>  4.31 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 155 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>Note that only some models/engines work well with sparse data. These are all listed here 






<a href="https://www.tidymodels.org/find/sparse/" target="_blank" rel="noopener">https://www.tidymodels.org/find/sparse/</a>
. If the model doesn&rsquo;t support sparse data, it will be coerced into the default non-sparse representation and used as usual.</p>
<p>With a few exceptions, it should work like any other data set. However, this approach has two main limitations. The first is that we are limited to regression tasks since the outcome has to be numeric to be part of the sparse matrix.</p>
<p>The second limitation is that it only works with non-formula methods for parsnip and workflows. This means that you can use a recipe with <code>add_recipe()</code> or select variables directly with <code>add_variables()</code> when using a workflow. And you need to use <code>fit_xy()</code> instead of <code>fit()</code> when using a parsnip object by itself.</p>
<p>If this is of interest we also have a 






<a href="https://www.tidymodels.org/" target="_blank" rel="noopener">https://www.tidymodels.org/</a>
 post about 






<a href="https://www.tidymodels.org/learn/work/sparse-matrix/" target="_blank" rel="noopener">using sparse matrices in tidymodels</a>
.</p>
<h2 id="sparse-data-from-recipes-steps">Sparse data from recipes steps
</h2>
<p>Where this sparsity support really starts to shine is when the recipe we use will generate sparse data. They come in two flavors, sparsity creation steps and sparsity preserving steps. Both listed here: 






<a href="https://www.tidymodels.org/find/sparse/" target="_blank" rel="noopener">https://www.tidymodels.org/find/sparse/</a>
.</p>
<p>Some steps like <code>step_dummy()</code>, <code>step_indicate_na()</code>, and 






<a href="https://textrecipes.tidymodels.org/reference/step_tf.html" target="_blank" rel="noopener"><code>textrecipes::step_tf()</code></a>
 will almost always produce a lot of zeroes. We take advantage of that by generating it sparsely when it is beneficial. If these steps end up producing sparse vectors, we want to make sure the sparsity is preserved. A couple of handfuls of steps, such as <code>step_impute_mean()</code> and <code>step_scale(),</code> have been updated to be able to work efficiently with sparse vectors. Both types of steps are detailed in the above-linked list of compatible methods.</p>
<p>What this means in practice is that if you use a model/engine that supports sparse data and have a recipe that produces enough sparse data, then the steps will switch to produce sparse data by using a new sparse data format to store the data (when appropriate) as the recipe is being processed. Then if the model can accept sparse objects, we convert the data from our new sparse format to a standard sparse matrix object. Increasing performance when possible while preserving performance otherwise.</p>
<p>Below is a simple recipe using the <code>ames</code> data set. <code>step_dummy()</code> is applied to all the categorical predictors, leading to a significant amount of zeroes.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>rec_spec</span> <span class='o'>&lt;-</span> <span class='nf'>recipe</span><span class='o'>(</span><span class='nv'>Sale_Price</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>ames</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_zv</span><span class='o'>(</span><span class='nf'>all_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_normalize</span><span class='o'>(</span><span class='nf'>all_numeric_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_dummy</span><span class='o'>(</span><span class='nf'>all_nominal_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>mod_spec</span> <span class='o'>&lt;-</span> <span class='nf'>boost_tree</span><span class='o'>(</span><span class='s'>"regression"</span>, <span class='s'>"xgboost"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>wf_spec</span> <span class='o'>&lt;-</span> <span class='nf'>workflow</span><span class='o'>(</span><span class='nv'>rec_spec</span>, <span class='nv'>mod_spec</span><span class='o'>)</span></span></code></pre>
</div>
<p>When we go to fit it now, it takes around 125ms and allocates 37.2MB. Compared to before these changes it would take around 335ms and allocate 67.5MB.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>wf_fit</span> <span class='o'>&lt;-</span> <span class='nf'>fit</span><span class='o'>(</span><span class='nv'>wf_spec</span>, <span class='nv'>ames</span><span class='o'>)</span></span></code></pre>
</div>
<p>We see similar speedups when we predictor with around 20ms and 25.2MB now, compared to around 60ms and 55.6MB before.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>wf_fit</span>, <span class='nv'>ames</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 2,930 × 1</span></span></span>
<span><span class='c'>#&gt;      .pred</span></span>
<span><span class='c'>#&gt;      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> <span style='text-decoration: underline;'>208</span>649.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> <span style='text-decoration: underline;'>115</span>339.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> <span style='text-decoration: underline;'>148</span>634.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> <span style='text-decoration: underline;'>239</span>770.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> <span style='text-decoration: underline;'>190</span>082.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> <span style='text-decoration: underline;'>184</span>604.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> <span style='text-decoration: underline;'>208</span>572.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> <span style='text-decoration: underline;'>177</span>403 </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> <span style='text-decoration: underline;'>261</span>000.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> <span style='text-decoration: underline;'>198</span>604.</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 2,920 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>These improvements are tightly related to memory allocation, which depends on the sparsity of the data set produced by the recipe. This is why it is hard to say how much benefit you will see. We have seen orders of magnitudes of improvements, both in terms of time and memory allocation. We have also been able to fit models where previously the data was too big to fit in memory.</p>
<p>Please see the post on tidymodels.org, which goes into more detail about when you are likely to benefit from this and how to change your recipes and workflows to take full advantage of this new feature.</p>
<p>There is also a 






<a href="https://www.tidymodels.org/" target="_blank" rel="noopener">https://www.tidymodels.org/</a>
 post going into a bit more detail about how to 






<a href="https://www.tidymodels.org/learn/work/sparse-recipe/" target="_blank" rel="noopener">use recipes to produce sparse data</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-03-19_tidymodels-sparsity/thumbnail-wd.jpg" length="368137" type="image/jpeg" />
    </item>
    <item>
      <title>Q1 2025 tidymodels digest</title>
      <link>https://opensource.posit.co/blog/2025-02-27_tidymodels-2025-q1/</link>
      <pubDate>Thu, 27 Feb 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-02-27_tidymodels-2025-q1/</guid>
      <dc:creator>Max Kuhn</dc:creator><description><![CDATA[<!--
TODO:
* [ ] Look over / edit the post's title in the yaml
* [ ] Edit (or delete) the description; note this appears in the Twitter card
* [ ] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
* [ ] Find photo & update yaml metadata
* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
* [ ] `hugodown::use_tidy_thumbnails()`
* [ ] Add intro sentence, e.g. the standard tagline for the package
* [ ] `usethis::use_tidy_thanks()`
-->
<p>The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.</p>
<p>Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused.</p>
<p>We&rsquo;ve sent a steady stream of tidymodels packages to CRAN recently. We usually release them in batches since many of our packages are tightly coupled with one another. Internally, this process is referred to as the &ldquo;cascade&rdquo; of CRAN submissions.</p>
<p>The post will update you on which packages have changed and the major improvements you should know about.</p>
<p>Here&rsquo;s a list of the packages and their News sections:</p>
<ul>
<li>






<a href="https://baguette.tidymodels.org/news/index.html" target="_blank" rel="noopener">baguette</a>
</li>
<li>






<a href="https://brulee.tidymodels.org/news/index.html" target="_blank" rel="noopener">brulee</a>
</li>
<li>






<a href="https://censored.tidymodels.org/news/index.html" target="_blank" rel="noopener">censored</a>
</li>
<li>






<a href="https://dials.tidymodels.org/news/index.html" target="_blank" rel="noopener">dials</a>
</li>
<li>






<a href="https://hardhat.tidymodels.org/news/index.html" target="_blank" rel="noopener">hardhat</a>
</li>
<li>






<a href="https://parsnip.tidymodels.org/news/index.html" target="_blank" rel="noopener">parsnip</a>
</li>
<li>






<a href="https://recipes.tidymodels.org/news/index.html" target="_blank" rel="noopener">recipes</a>
</li>
<li>






<a href="https://tidymodels.tidymodels.org/news/index.html" target="_blank" rel="noopener">tidymodels</a>
</li>
<li>






<a href="https://tune.tidymodels.org/news/index.html" target="_blank" rel="noopener">tune</a>
</li>
<li>






<a href="https://workflows.tidymodels.org/news/index.html" target="_blank" rel="noopener">workflows</a>
</li>
</ul>
<p>Let&rsquo;s look at a few specific updates.</p>
<h2 id="improvements-in-errors-and-warnings">Improvements in errors and warnings
</h2>
<p>A group effort was made to improve our error and warning messages across many packages. This started with an internal &ldquo;upkeep week&rdquo; (which ended up being 3-4 weeks) and concluded at the 






  
  

<a href="https://opensource.posit.co/blog/2024-04-09_tdd-2024/">Tidy Dev Day in Seattle</a>
 after posit::conf(2024).</p>
<p>The goal was to use new tools in the cli and rlang packages to make messages more informative than they used to be. For example, using:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">tidy</span><span class="p">(</span><span class="n">pca_extract_trained</span><span class="p">,</span> <span class="n">number</span> <span class="o">=</span> <span class="m">3</span><span class="p">,</span> <span class="n">type</span> <span class="o">=</span> <span class="s">&#34;variances&#34;</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>used to result in the error message:</p>
<div class="code-block"><pre tabindex="0"><code>Error in `match.arg()`:
! &#39;arg&#39; should be one of &#34;coef&#34;, &#34;variance&#34;</code></pre></div>
<p>The new system references the function that you called and not the underlying base R function that actually errored. It also suggests a solution:</p>
<div class="code-block"><pre tabindex="0"><code>Error in `tidy()`:
! `type` must be one of &#34;coef&#34; or &#34;variance&#34;, not &#34;variances&#34;.
i Did you mean &#34;variance&#34;?</code></pre></div>
<p>The rlang package created a set of 






<a href="https://usethis.r-lib.org/reference/use_standalone.html" target="_blank" rel="noopener">standalone files</a>
 that contain high-quality type checkers and related functions. This also improves the information that users get from an error. For example, using an inappropriate formula value in <code>fit(linear_reg(), &quot;boop&quot;, mtcars)</code>, the old message was:</p>
<div class="code-block"><pre tabindex="0"><code>Error in `fit()`:
! The `formula` argument must be a formula, but it is a &lt;character&gt;.</code></pre></div>
<p>and now you see:</p>
<div class="code-block"><pre tabindex="0"><code>Error in `fit()`:
! `formula` must be a formula, not the string &#34;boop&#34;.</code></pre></div>
<p>This was <em>a lot</em> of work and we&rsquo;re still aren’t finished. Two events helped us get as far as we did.</p>
<p>First, Simon Couch made the 






<a href="https://simonpcouch.github.io/chores/" target="_blank" rel="noopener">chores</a>
 package (its previous name was &ldquo;pal&rdquo;), which enabled us to use AI tools to solve small-scope problems, such as converting old rlang error code to use the new 






<a href="https://rlang.r-lib.org/reference/topic-condition-formatting.html" target="_blank" rel="noopener">cli syntax</a>
. I can’t overstate how much of a speed-up this was for us.</p>
<p>Second, at developer day, many external folks pitched in to make pull requests from a list of issues:</p>
<div class="figure" style="text-align: center">
<img src="https://opensource.posit.co/blog/2025-02-27_tidymodels-2025-q1/IMG_4743.jpeg" alt="Organizing Tidy Dev Day issues."  />
<p class="caption">Organizing Tidy Dev Day issues.</p>
</div>
<p>I love these sessions for many reasons, but mostly because we meet users and contributors to our packages in person and work with them on specific tasks.</p>
<p>There is a lot more to do here; we have a lot of secondary packages that would benefit from these improvements too.</p>
<h2 id="quantile-regression-in-parsnip">Quantile regression in parsnip
</h2>
<p>One big update in parsnip was a new modeling mode of <code>&quot;quantile regression&quot;</code>. Daniel McDonald and Ryan Tibshirani largely provided some inertia for this work based on their 






<a href="https://delphi.cmu.edu/" target="_blank" rel="noopener">disease modeling framework</a>
.</p>
<p>You can generate quantile predictions by first creating a model specification, which includes the quantiles that you want to predict:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">tidymodels</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="nf">tidymodels_prefer</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">ames</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="n">modeldata</span><span class="o">::</span><span class="n">ames</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">mutate</span><span class="p">(</span><span class="n">Sale_Price</span> <span class="o">=</span> <span class="nf">log10</span><span class="p">(</span><span class="n">Sale_Price</span><span class="p">))</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">Sale_Price</span><span class="p">,</span> <span class="n">Latitude</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">quant_spec</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">linear_reg</span><span class="p">()</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">set_engine</span><span class="p">(</span><span class="s">&#34;quantreg&#34;</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">set_mode</span><span class="p">(</span><span class="s">&#34;quantile regression&#34;</span><span class="p">,</span> <span class="n">quantile_levels</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span> <span class="m">0.5</span><span class="p">,</span> <span class="m">0.9</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="n">quant_spec</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## Linear Regression Model Specification (quantile regression)
## 
## Computational engine: quantreg</code></pre></div>
<div class="code-block"><pre tabindex="0"><code>## Quantile levels: 0.1, 0.5, and 0.9.</code></pre></div>
<p>We&rsquo;ll add some spline terms via a recipe and fit the model:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">spline_rec</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">recipe</span><span class="p">(</span><span class="n">Sale_Price</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">ames</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">step_spline_natural</span><span class="p">(</span><span class="n">Latitude</span><span class="p">,</span> <span class="n">deg_free</span> <span class="o">=</span> <span class="m">10</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">quant_fit</span> <span class="o">&lt;-</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">workflow</span><span class="p">(</span><span class="n">spline_rec</span><span class="p">,</span> <span class="n">quant_spec</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">fit</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">ames</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">quant_fit</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## ══ Workflow [trained] ═════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ───────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_spline_natural()
## 
## ── Model ──────────────────────────────────────────────────────────────
## Call:
## quantreg::rq(formula = ..y ~ ., tau = quantile_levels, data = data)
## 
## Coefficients:
##               tau= 0.1    tau= 0.5    tau= 0.9
## (Intercept) 4.71981123  5.07728741  5.25221335
## Latitude_01 1.22409173  0.70928577  0.79000849
## Latitude_02 0.19561816  0.04937750  0.02832633
## Latitude_03 0.16616065  0.02045910  0.14730573
## Latitude_04 0.30583648  0.08489487  0.15595080
## Latitude_05 0.21663212  0.02016258 -0.01110625
## Latitude_06 0.33541228  0.12005254  0.03006777
## Latitude_07 0.47732205  0.09146728  0.17394021
## Latitude_08 0.24028784  0.30450058  0.26144584
## Latitude_09 0.05840312 -0.14733781 -0.11911843
## Latitude_10 1.52800673  0.95994216  1.21750501
## 
## Degrees of freedom: 2930 total; 2919 residual</code></pre></div>
<p>For prediction, tidymodels always returns a data frame with as many rows as the input data set (here: <code>ames</code>). The result for quantile predictions is a special vctrs class:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">quant_pred</span> <span class="o">&lt;-</span> <span class="nf">predict</span><span class="p">(</span><span class="n">quant_fit</span><span class="p">,</span> <span class="n">ames</span><span class="p">)</span> 
</span></span><span class="line"><span class="cl"><span class="n">quant_pred</span> <span class="o">|&gt;</span> <span class="nf">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 4 × 1
##   .pred_quantile
##        &lt;qtls(3)&gt;
## 1         [5.33]
## 2         [5.33]
## 3         [5.33]
## 4         [5.31]</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">class</span><span class="p">(</span><span class="n">quant_pred</span><span class="o">$</span><span class="n">.pred_quantile</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## [1] &#34;quantile_pred&#34; &#34;vctrs_vctr&#34;    &#34;list&#34;</code></pre></div>
<p>where the output <code>[5.31]</code> shows the middle quantile.</p>
<p>We can expand the set of quantile predictions so that there are three rows for each source row in <code>ames</code>. There’s also an integer column called <code>.row</code> so that we can merge the data with the source data:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">quant_pred</span><span class="o">$</span><span class="n">.pred_quantile[1]</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## &lt;quantiles[1]&gt;
## [1] [5.33]
## # Quantile levels: 0.1 0.5 0.9</code></pre></div>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">as_tibble</span><span class="p">(</span><span class="n">quant_pred</span><span class="o">$</span><span class="n">.pred_quantile[1]</span><span class="p">)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="code-block"><pre tabindex="0"><code>## # A tibble: 3 × 3
##   .pred_quantile .quantile_levels  .row
##            &lt;dbl&gt;            &lt;dbl&gt; &lt;int&gt;
## 1           5.08              0.1     1
## 2           5.33              0.5     1
## 3           5.52              0.9     1</code></pre></div>
<p>Here are the predicted quantile values:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="lnt">8
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">quant_pred</span><span class="o">$</span><span class="n">.pred_quantile</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">as_tibble</span><span class="p">()</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">full_join</span><span class="p">(</span><span class="n">ames</span> <span class="o">|&gt;</span> <span class="nf">add_rowindex</span><span class="p">(),</span> <span class="n">by</span> <span class="o">=</span> <span class="s">&#34;.row&#34;</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">arrange</span><span class="p">(</span><span class="n">Latitude</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">ggplot</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">Latitude</span><span class="p">))</span> <span class="o">+</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">geom_point</span><span class="p">(</span><span class="n">data</span> <span class="o">=</span> <span class="n">ames</span><span class="p">,</span> <span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">Sale_Price</span><span class="p">),</span> <span class="n">alpha</span> <span class="o">=</span> <span class="m">1</span> <span class="o">/</span> <span class="m">5</span><span class="p">)</span> <span class="o">+</span>
</span></span><span class="line"><span class="cl">  <span class="nf">geom_line</span><span class="p">(</span><span class="nf">aes</span><span class="p">(</span><span class="n">y</span> <span class="o">=</span> <span class="n">.pred_quantile</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="nf">format</span><span class="p">(</span><span class="n">.quantile_levels</span><span class="p">)),</span> 
</span></span><span class="line"><span class="cl">            <span class="n">show.legend</span> <span class="o">=</span> <span class="kc">FALSE</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="m">1.5</span><span class="p">)</span> </span></span></code></pre></td></tr></table>
</div>
</div></div>
<div class="figure" style="text-align: center">
<img src="https://opensource.posit.co/blog/2025-02-27_tidymodels-2025-q1/figure/quant-plot-1.svg" alt="10%, 50%, and 90% quantile predictions." width="80%" />
<p class="caption">10%, 50%, and 90% quantile predictions.</p>
</div>
<p>For now, the new mode does not have many engines. We need to implement some performance statistics in the yardstick package before integrating these models into the whole tidymodels ecosystem.</p>
<p>In other news, we’ve added some additional neural network models based on some improvements in the brulee package. Namely, two-layer networks can be tuned for feed-forward networks on tabular data (using torch).</p>
<p>One other improvement has been simmering for a long time: the ability to exploit sparse data structures better. We’ve improved our <code>fit()</code> interfaces for the few model engines that can use sparsely encoded data. There is much more to come on this in a few months, especially around recipes, so stay tuned.</p>
<p>Finally, we’ve created a set of 






<a href="https://parsnip.tidymodels.org/articles/checklists.html" target="_blank" rel="noopener">checklists</a>
 that can be used when creating new models or engines. These are very helpful, even for us, since there is a lot of minutiae to remember.</p>
<h2 id="parallelism-in-tune">Parallelism in tune
</h2>
<p>This was a small maintenance release mostly related to parallel processing. Up to now, tune facilitated parallelism using the 






<a href="https://cran.r-project.org/package=foreach" target="_blank" rel="noopener">foreach</a>
 package. That package is mature but not actively developed, so we have been slowly moving toward using the 






<a href="https://www.futureverse.org/packages-overview.html" target="_blank" rel="noopener">future</a>
 package(s).</p>
<p>The 


  
  
  





  
  

<a href="https://opensource.posit.co/blog/2024-04-18_tune-1-2-0/#modernized-support-for-parallel-processing">first step in this journey</a>
 was to keep using foreach internally (but lean toward future) but to encourage users to move from directly invoking the foreach package and, instead, load and use the future package.</p>
<p>We’re now moving folks into the second stage. tune will now raise a warning when:</p>
<ul>
<li>A parallel backend has been registered with foreach, and</li>
<li>No 






<a href="https://future.futureverse.org/reference/plan.html" target="_blank" rel="noopener"><code>plan()</code></a>
 has been specified with future.</li>
</ul>
<p>This will allow users to transition their existing code to only future and allow us to update existing documentation and training materials.</p>
<p>We anticipate that the third stage, <strong>removing foreach entirely</strong>, will occur sometime before posit::conf(2025) in September.</p>
<h2 id="things-to-look-forward-to">Things to look forward to
</h2>
<p>We are working hard on a few major initiatives that we plan on showing off at 






<a href="https://posit.co/conference/" target="_blank" rel="noopener">posit::conf(2025)</a>
.</p>
<p>First is integrated support for sparse <strong>data</strong>. The emphasis is on &ldquo;data&rdquo; because users can use a data frame of sparse vectors <em>or</em> the usual sparse matrix format. This is a big deal because it does not force you to convert non-numeric data into a numeric matrix format. Again, we’ll discuss this more in the future, but you should be able to use sparse data frames in parsnip, recipes, tune, etc.</p>
<p>The second initiative is the longstanding goal of adding <strong>postprocessing</strong> to tidymodels. Just as you can add a preprocessor to a model workflow, you will be able to add a set of postprocessing adjustments to the predictions your model generates. See our 






  
  

<a href="https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/">previous post</a>
 for a sneak peek.</p>
<p>Finally, this year&rsquo;s 






  
  

<a href="https://opensource.posit.co/blog/2025-01-08_tidymodels-2025-internship/">summer internship</a>
 focuses on supervised feature selection methods. We’ll also have releases (and probably another package) for these tools.</p>
<p>These should come to fruition (and CRAN) before or around August 2025.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>We want to sincerely thank everyone who contributed to these packages since their previous versions:</p>
<p>






<a href="https://github.com/AlbertoImg" target="_blank" rel="noopener">@AlbertoImg</a>
, 






<a href="https://github.com/asb2111" target="_blank" rel="noopener">@asb2111</a>
, 






<a href="https://github.com/balraadjsings" target="_blank" rel="noopener">@balraadjsings</a>
, 






<a href="https://github.com/bcjaeger" target="_blank" rel="noopener">@bcjaeger</a>
, 






<a href="https://github.com/beansrowning" target="_blank" rel="noopener">@beansrowning</a>
, 






<a href="https://github.com/BrennanAntone" target="_blank" rel="noopener">@BrennanAntone</a>
, 






<a href="https://github.com/cheryldietrich" target="_blank" rel="noopener">@cheryldietrich</a>
, 






<a href="https://github.com/chillerb" target="_blank" rel="noopener">@chillerb</a>
, 






<a href="https://github.com/conarr5" target="_blank" rel="noopener">@conarr5</a>
, 






<a href="https://github.com/corybrunson" target="_blank" rel="noopener">@corybrunson</a>
, 






<a href="https://github.com/dajmcdon" target="_blank" rel="noopener">@dajmcdon</a>
, 






<a href="https://github.com/davidrsch" target="_blank" rel="noopener">@davidrsch</a>
, 






<a href="https://github.com/Edgar-Zamora" target="_blank" rel="noopener">@Edgar-Zamora</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/gaborcsardi" target="_blank" rel="noopener">@gaborcsardi</a>
, 






<a href="https://github.com/gimholte" target="_blank" rel="noopener">@gimholte</a>
, 






<a href="https://github.com/grantmcdermott" target="_blank" rel="noopener">@grantmcdermott</a>
, 






<a href="https://github.com/grouptheory" target="_blank" rel="noopener">@grouptheory</a>
, 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/ilaria-kode" target="_blank" rel="noopener">@ilaria-kode</a>
, 






<a href="https://github.com/JamesHWade" target="_blank" rel="noopener">@JamesHWade</a>
, 






<a href="https://github.com/jesusherranz" target="_blank" rel="noopener">@jesusherranz</a>
, 






<a href="https://github.com/jkylearmstrong" target="_blank" rel="noopener">@jkylearmstrong</a>
, 






<a href="https://github.com/joranE" target="_blank" rel="noopener">@joranE</a>
, 






<a href="https://github.com/joscani" target="_blank" rel="noopener">@joscani</a>
, 






<a href="https://github.com/Joscelinrocha" target="_blank" rel="noopener">@Joscelinrocha</a>
, 






<a href="https://github.com/josho88" target="_blank" rel="noopener">@josho88</a>
, 






<a href="https://github.com/joshuagi" target="_blank" rel="noopener">@joshuagi</a>
, 






<a href="https://github.com/JosiahParry" target="_blank" rel="noopener">@JosiahParry</a>
, 






<a href="https://github.com/jrosell" target="_blank" rel="noopener">@jrosell</a>
, 






<a href="https://github.com/jrwinget" target="_blank" rel="noopener">@jrwinget</a>
, 






<a href="https://github.com/KarlKoe" target="_blank" rel="noopener">@KarlKoe</a>
, 






<a href="https://github.com/kscott-1" target="_blank" rel="noopener">@kscott-1</a>
, 






<a href="https://github.com/lilykoff" target="_blank" rel="noopener">@lilykoff</a>
, 






<a href="https://github.com/lionel-" target="_blank" rel="noopener">@lionel-</a>
, 






<a href="https://github.com/LouisMPenrod" target="_blank" rel="noopener">@LouisMPenrod</a>
, 






<a href="https://github.com/luisDVA" target="_blank" rel="noopener">@luisDVA</a>
, 






<a href="https://github.com/marcelglueck" target="_blank" rel="noopener">@marcelglueck</a>
, 






<a href="https://github.com/marcozanotti" target="_blank" rel="noopener">@marcozanotti</a>
, 






<a href="https://github.com/martaalcalde" target="_blank" rel="noopener">@martaalcalde</a>
, 






<a href="https://github.com/mattwarkentin" target="_blank" rel="noopener">@mattwarkentin</a>
, 






<a href="https://github.com/mihem" target="_blank" rel="noopener">@mihem</a>
, 






<a href="https://github.com/mitchellmanware" target="_blank" rel="noopener">@mitchellmanware</a>
, 






<a href="https://github.com/naokiohno" target="_blank" rel="noopener">@naokiohno</a>
, 






<a href="https://github.com/nhward" target="_blank" rel="noopener">@nhward</a>
, 






<a href="https://github.com/npelikan" target="_blank" rel="noopener">@npelikan</a>
, 






<a href="https://github.com/obgeneralao" target="_blank" rel="noopener">@obgeneralao</a>
, 






<a href="https://github.com/owenjonesuob" target="_blank" rel="noopener">@owenjonesuob</a>
, 






<a href="https://github.com/pbhogale" target="_blank" rel="noopener">@pbhogale</a>
, 






<a href="https://github.com/Peter4801" target="_blank" rel="noopener">@Peter4801</a>
, 






<a href="https://github.com/pgg1309" target="_blank" rel="noopener">@pgg1309</a>
, 






<a href="https://github.com/reisner" target="_blank" rel="noopener">@reisner</a>
, 






<a href="https://github.com/rfsaldanha" target="_blank" rel="noopener">@rfsaldanha</a>
, 






<a href="https://github.com/rkb965" target="_blank" rel="noopener">@rkb965</a>
, 






<a href="https://github.com/RobLBaker" target="_blank" rel="noopener">@RobLBaker</a>
, 






<a href="https://github.com/RodDalBen" target="_blank" rel="noopener">@RodDalBen</a>
, 






<a href="https://github.com/SantiagoD999" target="_blank" rel="noopener">@SantiagoD999</a>
, 






<a href="https://github.com/shum461" target="_blank" rel="noopener">@shum461</a>
, 






<a href="https://github.com/simonpcouch" target="_blank" rel="noopener">@simonpcouch</a>
, 






<a href="https://github.com/szimmer" target="_blank" rel="noopener">@szimmer</a>
, 






<a href="https://github.com/talegari" target="_blank" rel="noopener">@talegari</a>
, 






<a href="https://github.com/therealjpetereit" target="_blank" rel="noopener">@therealjpetereit</a>
, 






<a href="https://github.com/topepo" target="_blank" rel="noopener">@topepo</a>
, 






<a href="https://github.com/walkerjameschris" target="_blank" rel="noopener">@walkerjameschris</a>
, and  






<a href="https://github.com/ZWael" target="_blank" rel="noopener">@ZWael</a>
</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-02-27_tidymodels-2025-q1/thumbnail-wd.jpg" length="92976" type="image/jpeg" />
    </item>
    <item>
      <title>orbital 0.3.0</title>
      <link>https://opensource.posit.co/blog/2025-01-13_orbital-0-3-0/</link>
      <pubDate>Mon, 13 Jan 2025 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2025-01-13_orbital-0-3-0/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re thrilled to announce the release of 






<a href="https://orbital.tidymodels.org/" target="_blank" rel="noopener">orbital</a>
 0.3.0. orbital lets you predict in databases using tidymodels workflows.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"orbital"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will cover the highlights, which are classification support and the new augment method.</p>
<p>You can see a full list of changes in the 


  
  
  





<a href="https://orbital.tidymodels.org/news/index.html#orbital-030" target="_blank" rel="noopener">release notes</a>
.</p>
<h2 id="classification-support">Classification support
</h2>
<p>The biggest improvement in this version is that 






<a href="https://orbital.tidymodels.org/reference/orbital.html" target="_blank" rel="noopener"><code>orbital()</code></a>
 now works for supported classification models. See 


  
  
  





<a href="https://orbital.tidymodels.org/articles/supported-models.html#supported-models" target="_blank" rel="noopener">vignette</a>
 for list of all supported models.</p>
<p>Let&rsquo;s start by fitting a classification model on the <code>penguins</code> data set, using {xgboost} as the engine.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>rec_spec</span> <span class='o'>&lt;-</span> <span class='nf'>recipe</span><span class='o'>(</span><span class='nv'>species</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_unknown</span><span class='o'>(</span><span class='nf'>all_nominal_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_dummy</span><span class='o'>(</span><span class='nf'>all_nominal_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_impute_mean</span><span class='o'>(</span><span class='nf'>all_numeric_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>step_zv</span><span class='o'>(</span><span class='nf'>all_predictors</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>lr_spec</span> <span class='o'>&lt;-</span> <span class='nf'>boost_tree</span><span class='o'>(</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>set_mode</span><span class='o'>(</span><span class='s'>"classification"</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'>set_engine</span><span class='o'>(</span><span class='s'>"xgboost"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>wf_spec</span> <span class='o'>&lt;-</span> <span class='nf'>workflow</span><span class='o'>(</span><span class='nv'>rec_spec</span>, <span class='nv'>lr_spec</span><span class='o'>)</span></span>
<span><span class='nv'>wf_fit</span> <span class='o'>&lt;-</span> <span class='nf'>fit</span><span class='o'>(</span><span class='nv'>wf_spec</span>, data <span class='o'>=</span> <span class='nv'>penguins</span><span class='o'>)</span></span></code></pre>
</div>
<p>With this fitted workflow object, we can call 






<a href="https://orbital.tidymodels.org/reference/orbital.html" target="_blank" rel="noopener"><code>orbital()</code></a>
 on it to create an orbital object.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>orbital_obj</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://orbital.tidymodels.org/reference/orbital.html'>orbital</a></span><span class='o'>(</span><span class='nv'>wf_fit</span><span class='o'>)</span></span>
<span><span class='nv'>orbital_obj</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>orbital Object</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────</span></span></span>
<span><span class='c'>#&gt; • island = dplyr::if_else(is.na(island), "unknown", island)</span></span>
<span><span class='c'>#&gt; • sex = dplyr::if_else(is.na(sex), "unknown", sex)</span></span>
<span><span class='c'>#&gt; • island_Dream = as.numeric(island == "Dream")</span></span>
<span><span class='c'>#&gt; • island_Torgersen = as.numeric(island == "Torgersen")</span></span>
<span><span class='c'>#&gt; • sex_male = as.numeric(sex == "male")</span></span>
<span><span class='c'>#&gt; • sex_unknown = as.numeric(sex == "unknown")</span></span>
<span><span class='c'>#&gt; • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...</span></span>
<span><span class='c'>#&gt; • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...</span></span>
<span><span class='c'>#&gt; • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...</span></span>
<span><span class='c'>#&gt; • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)</span></span>
<span><span class='c'>#&gt; • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...</span></span>
<span><span class='c'>#&gt; • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...</span></span>
<span><span class='c'>#&gt; • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)</span></span>
<span><span class='c'>#&gt; • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...</span></span>
<span><span class='c'>#&gt; • Adelie = 0 + dplyr::case_when((bill_depth_mm &lt; 15.1 | is.na(bill_depth_ ...</span></span>
<span><span class='c'>#&gt; • Chinstrap = 0 + dplyr::case_when((island_Dream &lt; 0.5 | is.na(island_Dre ...</span></span>
<span><span class='c'>#&gt; • Gentoo = 0 + dplyr::case_when((bill_depth_mm &lt; 15.95 | is.na(bill_depth ...</span></span>
<span><span class='c'>#&gt; • .pred_class = dplyr::case_when(Adelie &gt; Chinstrap &amp; Adelie &gt; Gentoo ~ " ...</span></span>
<span><span class='c'>#&gt; ────────────────────────────────────────────────────────────────────────────────</span></span>
<span><span class='c'>#&gt; 18 equations in total.</span></span>
<span></span></code></pre>
</div>
<p>This object contains all the information that is needed to produce predictions. Which we can produce with 






<a href="https://rdrr.io/r/stats/predict.html" target="_blank" rel="noopener"><code>predict()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 1</span></span></span>
<span><span class='c'>#&gt;    .pred_class</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>The main thing to note here is that the orbital package produces character vectors instead of factors. This is done as a unifying approach since many databases don&rsquo;t have factor types.</p>
<p>Speaking of databases, you can 






<a href="https://rdrr.io/r/stats/predict.html" target="_blank" rel="noopener"><code>predict()</code></a>
 on an orbital object using tables from databases. Below we create an ephemeral in-memory RSQLite database.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://dbi.r-dbi.org'>DBI</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://rsqlite.r-dbi.org'>RSQLite</a></span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>con_sqlite</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://dbi.r-dbi.org/reference/dbConnect.html'>dbConnect</a></span><span class='o'>(</span><span class='nf'><a href='https://rsqlite.r-dbi.org/reference/SQLite.html'>SQLite</a></span><span class='o'>(</span><span class='o'>)</span>, path <span class='o'>=</span> <span class='s'>":memory:"</span><span class='o'>)</span></span>
<span><span class='nv'>penguins_sqlite</span> <span class='o'>&lt;-</span> <span class='nf'>copy_to</span><span class='o'>(</span><span class='nv'>con_sqlite</span>, <span class='nv'>penguins</span>, name <span class='o'>=</span> <span class='s'>"penguins_table"</span><span class='o'>)</span></span></code></pre>
</div>
<p>And we can predict with it like normal. All the calculations are sent to the database for execution.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>penguins_sqlite</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Source:   SQL [?? x 1]</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Database: sqlite 3.47.1 []</span></span></span>
<span><span class='c'>#&gt;    .pred_class</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie     </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ more rows</span></span></span>
<span></span></code></pre>
</div>
<p>This works the same with 






<a href="https://orbital.tidymodels.org/articles/databases.html" target="_blank" rel="noopener">many types of databases</a>
.</p>
<p>Classification is different from regression in part because it comes with multiple prediction types. The above example showed the default which is hard classification. You can set the type of prediction you want with the <code>type</code> argument to <code>orbital</code>. For classification models, possible options are <code>&quot;class&quot;</code> and <code>&quot;prob&quot;</code>.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>orbital_obj_prob</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://orbital.tidymodels.org/reference/orbital.html'>orbital</a></span><span class='o'>(</span><span class='nv'>wf_fit</span>, type <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"class"</span>, <span class='s'>"prob"</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='nv'>orbital_obj_prob</span></span>
<span><span class='c'>#&gt; </span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>orbital Object</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────</span></span></span>
<span><span class='c'>#&gt; • island = dplyr::if_else(is.na(island), "unknown", island)</span></span>
<span><span class='c'>#&gt; • sex = dplyr::if_else(is.na(sex), "unknown", sex)</span></span>
<span><span class='c'>#&gt; • island_Dream = as.numeric(island == "Dream")</span></span>
<span><span class='c'>#&gt; • island_Torgersen = as.numeric(island == "Torgersen")</span></span>
<span><span class='c'>#&gt; • sex_male = as.numeric(sex == "male")</span></span>
<span><span class='c'>#&gt; • sex_unknown = as.numeric(sex == "unknown")</span></span>
<span><span class='c'>#&gt; • bill_length_mm = dplyr::if_else(is.na(bill_length_mm), 43.92193, bill_l ...</span></span>
<span><span class='c'>#&gt; • bill_depth_mm = dplyr::if_else(is.na(bill_depth_mm), 17.15117, bill_dep ...</span></span>
<span><span class='c'>#&gt; • flipper_length_mm = dplyr::if_else(is.na(flipper_length_mm), 201, flipp ...</span></span>
<span><span class='c'>#&gt; • body_mass_g = dplyr::if_else(is.na(body_mass_g), 4202, body_mass_g)</span></span>
<span><span class='c'>#&gt; • island_Dream = dplyr::if_else(is.na(island_Dream), 0.3604651, island_Dr ...</span></span>
<span><span class='c'>#&gt; • island_Torgersen = dplyr::if_else(is.na(island_Torgersen), 0.1511628, i ...</span></span>
<span><span class='c'>#&gt; • sex_male = dplyr::if_else(is.na(sex_male), 0.4883721, sex_male)</span></span>
<span><span class='c'>#&gt; • sex_unknown = dplyr::if_else(is.na(sex_unknown), 0.03197674, sex_unknow ...</span></span>
<span><span class='c'>#&gt; • Adelie = 0 + dplyr::case_when((bill_depth_mm &lt; 15.1 | is.na(bill_depth_ ...</span></span>
<span><span class='c'>#&gt; • Chinstrap = 0 + dplyr::case_when((island_Dream &lt; 0.5 | is.na(island_Dre ...</span></span>
<span><span class='c'>#&gt; • Gentoo = 0 + dplyr::case_when((bill_depth_mm &lt; 15.95 | is.na(bill_depth ...</span></span>
<span><span class='c'>#&gt; • .pred_class = dplyr::case_when(Adelie &gt; Chinstrap &amp; Adelie &gt; Gentoo ~ " ...</span></span>
<span><span class='c'>#&gt; • norm = exp(Adelie) + exp(Chinstrap) + exp(Gentoo)</span></span>
<span><span class='c'>#&gt; • .pred_Adelie = exp(Adelie) / norm</span></span>
<span><span class='c'>#&gt; • .pred_Chinstrap = exp(Chinstrap) / norm</span></span>
<span><span class='c'>#&gt; • .pred_Gentoo = exp(Gentoo) / norm</span></span>
<span><span class='c'>#&gt; ────────────────────────────────────────────────────────────────────────────────</span></span>
<span><span class='c'>#&gt; 22 equations in total.</span></span>
<span></span></code></pre>
</div>
<p>Notice how we can select both <code>&quot;class&quot;</code> and <code>&quot;prob&quot;</code>. The predictions now include both hard and soft class predictions.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>orbital_obj_prob</span>, <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 4</span></span></span>
<span><span class='c'>#&gt;    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>           <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>        <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie             0.709         0.024<span style='text-decoration: underline;'>5</span>       0.267  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie             0.979         0.005<span style='text-decoration: underline;'>49</span>      0.015<span style='text-decoration: underline;'>8</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie             0.980         0.005<span style='text-decoration: underline;'>59</span>      0.014<span style='text-decoration: underline;'>8</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span></span></code></pre>
</div>
<p>That works equally well in databases.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>orbital_obj_prob</span>, <span class='nv'>penguins_sqlite</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Source:   SQL [?? x 4]</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Database: sqlite 3.47.1 []</span></span></span>
<span><span class='c'>#&gt;    .pred_class .pred_Adelie .pred_Chinstrap .pred_Gentoo</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>           <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>        <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie             0.709         0.024<span style='text-decoration: underline;'>5</span>       0.267  </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie             0.989         0.005<span style='text-decoration: underline;'>54</span>      0.005<span style='text-decoration: underline;'>60</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie             0.979         0.005<span style='text-decoration: underline;'>49</span>      0.015<span style='text-decoration: underline;'>8</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie             0.980         0.005<span style='text-decoration: underline;'>59</span>      0.014<span style='text-decoration: underline;'>8</span> </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ more rows</span></span></span>
<span></span></code></pre>
</div>
<h2 id="new-augment-method">New augment method
</h2>
<p>The users of tidymodels have found the 






<a href="https://generics.r-lib.org/reference/augment.html" target="_blank" rel="noopener"><code>augment()</code></a>
 function to be a handy tool. This function performs predictions and returns them alongside the original data set.</p>
<p>This release adds 






<a href="https://generics.r-lib.org/reference/augment.html" target="_blank" rel="noopener"><code>augment()</code></a>
 support for orbital objects.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://generics.r-lib.org/reference/augment.html'>augment</a></span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>penguins</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 344 × 8</span></span></span>
<span><span class='c'>#&gt;    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>       <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>         <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>             <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie      Adelie  Torgersen           39.1          18.7               181</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie      Adelie  Torgersen           39.5          17.4               186</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie      Adelie  Torgersen           40.3          18                 195</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie      Adelie  Torgersen           <span style='color: #BB0000;'>NA</span>            <span style='color: #BB0000;'>NA</span>                  <span style='color: #BB0000;'>NA</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie      Adelie  Torgersen           36.7          19.3               193</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie      Adelie  Torgersen           39.3          20.6               190</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie      Adelie  Torgersen           38.9          17.8               181</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie      Adelie  Torgersen           39.2          19.6               195</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie      Adelie  Torgersen           34.1          18.1               193</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie      Adelie  Torgersen           42            20.2               190</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 334 more rows</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 2 more variables: body_mass_g &lt;int&gt;, sex &lt;fct&gt;</span></span></span>
<span></span></code></pre>
</div>
<p>The function works for most databases, but for technical reasons doesn&rsquo;t work with all. It has been confirmed to not work work in spark databases or arrow tables.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://generics.r-lib.org/reference/augment.html'>augment</a></span><span class='o'>(</span><span class='nv'>orbital_obj</span>, <span class='nv'>penguins_sqlite</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Source:   SQL [?? x 8]</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># Database: sqlite 3.47.1 []</span></span></span>
<span><span class='c'>#&gt;    .pred_class species island    bill_length_mm bill_depth_mm flipper_length_mm</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>       <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>              <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>         <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>             <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span> Adelie      Adelie  Torgersen           39.1          18.7               181</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span> Adelie      Adelie  Torgersen           39.5          17.4               186</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span> Adelie      Adelie  Torgersen           40.3          18                 195</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span> Adelie      Adelie  Torgersen           <span style='color: #BB0000;'>NA</span>            <span style='color: #BB0000;'>NA</span>                  <span style='color: #BB0000;'>NA</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span> Adelie      Adelie  Torgersen           36.7          19.3               193</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span> Adelie      Adelie  Torgersen           39.3          20.6               190</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span> Adelie      Adelie  Torgersen           38.9          17.8               181</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span> Adelie      Adelie  Torgersen           39.2          19.6               195</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span> Adelie      Adelie  Torgersen           34.1          18.1               193</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span> Adelie      Adelie  Torgersen           42            20.2               190</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ more rows</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 2 more variables: body_mass_g &lt;int&gt;, sex &lt;chr&gt;</span></span></span>
<span></span></code></pre>
</div>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>A big thank you to all the people who have contributed to orbital since the release of v0.3.0:</p>
<p>






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/joscani" target="_blank" rel="noopener">@joscani</a>
, 






<a href="https://github.com/jrosell" target="_blank" rel="noopener">@jrosell</a>
, 






<a href="https://github.com/npelikan" target="_blank" rel="noopener">@npelikan</a>
, and 






<a href="https://github.com/szimmer" target="_blank" rel="noopener">@szimmer</a>
.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2025-01-13_orbital-0-3-0/thumbnail-wd.jpg" length="60423" type="image/jpeg" />
    </item>
    <item>
      <title>Introducing mall for R...and Python</title>
      <link>https://opensource.posit.co/blog/2024-10-30_edgarmallintro/</link>
      <pubDate>Wed, 30 Oct 2024 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2024-10-30_edgarmallintro/</guid>
      <dc:creator>Edgar Ruiz</dc:creator><description><![CDATA[<h2 id="the-beginning">The beginning
</h2>
<p>A few months ago, while working on the Databricks with R workshop, I came
across some of their custom SQL functions. These particular functions are
prefixed with &ldquo;ai_&rdquo;, and they run NLP with a simple SQL call:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="o">&gt;</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">ai_analyze_sentiment</span><span class="p">(</span><span class="s1">&#39;I am happy&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">positive</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="o">&gt;</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">ai_analyze_sentiment</span><span class="p">(</span><span class="s1">&#39;I am sad&#39;</span><span class="p">);</span><span class="w">
</span></span></span><span class="line"><span class="cl"><span class="w">  </span><span class="n">negative</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>This was a revelation to me. It showcased a new way to use
LLMs in our daily work as analysts. To-date, I had primarily employed LLMs
for code completion and development tasks. However, this new approach
focuses on using LLMs directly against our data instead.</p>
<p>My first reaction was to try and access the custom functions via R. With







<a href="https://github.com/tidyverse/dbplyr" target="_blank" rel="noopener"><code>dbplyr</code></a>
 we can access SQL functions
in R, and it was great to see them work:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">orders</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">mutate</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">sentiment</span> <span class="o">=</span> <span class="nf">ai_analyze_sentiment</span><span class="p">(</span><span class="n">o_comment</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; # Source:   SQL [6 x 2]</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;   o_comment                   sentiment</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;   &lt;chr&gt;                        &lt;chr&gt;    </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 1 &#34;, pending theodolites …    neutral  </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 2 &#34;uriously special foxes …   neutral  </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 3 &#34;sleep. courts after the …  neutral  </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 4 &#34;ess foxes may sleep …      neutral  </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 5 &#34;ts wake blithely unusual … mixed    </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 6 &#34;hins sleep. fluffily …     neutral</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>One downside of this integration is that even though accessible through R, we
require a live connection to Databricks in order to utilize an LLM in this
manner, thereby limiting the number of people who can benefit from it.</p>
<p>According to their documentation, Databricks is leveraging the Llama 3.1 70B
model. While this is a highly effective Large Language Model, its enormous size
poses a significant challenge for most users&rsquo; machines, making it impractical
to run on standard hardware.</p>
<h2 id="reaching-viability">Reaching viability
</h2>
<p>LLM development has been accelerating at a rapid pace. Initially, only online
Large Language Models (LLMs) were viable for daily use. This sparked concerns among
companies hesitant to share their data externally. Moreover, the cost of using
LLMs online can be substantial, per-token charges can add up quickly.</p>
<p>The ideal solution would be to integrate an LLM into our own systems, requiring
three essential components:</p>
<ol>
<li>A model that can fit comfortably in memory</li>
<li>A model that achieves sufficient accuracy for NLP tasks</li>
<li>An intuitive interface between the model and the user&rsquo;s laptop</li>
</ol>
<p>In the past year, having all three of these elements was nearly impossible.
Models capable of fitting in-memory were either inaccurate or excessively slow.
However, recent advancements, such as 






<a href="https://www.llama.com/" target="_blank" rel="noopener">Llama from Meta</a>

and cross-platform interaction engines like 






<a href="https://ollama.com/" target="_blank" rel="noopener">Ollama</a>
, have
made it feasible to deploy these models, offering a promising solution for
companies looking to integrate LLMs into their workflows.</p>
<h2 id="the-project">The project
</h2>
<p>This project started as an exploration, driven by my interest in leveraging a
&ldquo;general-purpose&rdquo; LLM to produce results comparable to those from Databricks AI
functions. The primary challenge was determining how much setup and preparation
would be required for such a model to deliver reliable and consistent results.</p>
<p>Without access to a design document or open-source code, I relied solely on the
LLM&rsquo;s output as a testing ground. This presented several obstacles, including
the numerous options available for fine-tuning the model. Even within prompt
engineering, the possibilities are vast. To ensure the model was not too
specialized or focused on a specific subject or outcome, I needed to strike a
delicate balance between accuracy and generality.</p>
<p>Fortunately, after conducting extensive testing, I discovered that a simple
&ldquo;one-shot&rdquo; prompt yielded the best results. By &ldquo;best,&rdquo; I mean that the answers
were both accurate for a given row and consistent across multiple rows.
Consistency was crucial, as it meant providing answers that were one of the
specified options (positive, negative, or neutral), without any additional
explanations.</p>
<p>The following is an example of a prompt that worked reliably against
Llama 3.2:</p>
<pre><code>&gt;&gt;&gt; You are a helpful sentiment engine. Return only one of the 
... following answers: positive, negative, neutral. No capitalization. 
... No explanations. The answer is based on the following text: 
... I am happy
positive
</code></pre>
<p>As a side note, my attempts to submit multiple rows at once proved unsuccessful.
In fact, I spent a significant amount of time exploring different approaches,
such as submitting 10 or 2 rows simultaneously, formatting them in JSON or
CSV formats. The results were often inconsistent, and it didn&rsquo;t seem to accelerate
the process enough to be worth the effort.</p>
<p>Once I became comfortable with the approach, the next step was wrapping the
functionality within an R package.</p>
<h2 id="the-approach">The approach
</h2>
<p>One of my goals was to make the mall package as &ldquo;ergonomic&rdquo; as possible. In
other words, I wanted to ensure that using the package in R and Python
integrates seamlessly with how data analysts use their preferred language on a
daily basis.</p>
<p>For R, this was relatively straightforward. I simply needed to verify that the
functions worked well with pipes (<code>%&gt;%</code> and <code>|&gt;</code>) and could be easily
incorporated into packages like those in the <code>tidyverse</code>:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">reviews</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">llm_sentiment</span><span class="p">(</span><span class="n">review</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">filter</span><span class="p">(</span><span class="n">.sentiment</span> <span class="o">==</span> <span class="s">&#34;positive&#34;</span><span class="p">)</span> <span class="o">|&gt;</span> 
</span></span><span class="line"><span class="cl">  <span class="nf">select</span><span class="p">(</span><span class="n">review</span><span class="p">)</span> 
</span></span><span class="line"><span class="cl"><span class="c1">#&gt;                                                               review</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; 1 This has been the best TV I&#39;ve ever used. Great screen, and sound.</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>However, for Python, being a non-native language for me, meant that I had to adapt my
thinking about data manipulation. Specifically, I learned that in Python,
objects (like pandas DataFrames) &ldquo;contain&rdquo; transformation functions by design.</p>
<p>This insight led me to investigate if the Pandas API allows for extensions,
and fortunately, it did! After exploring the possibilities, I decided to start
with Polar, which allowed me to extend its API by creating a new namespace.
This simple addition enabled users to easily access the necessary functions:</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
</span></span><span class="line"><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mall</span>
</span></span><span class="line"><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&#34;I am happy&#34;</span><span class="p">,</span> <span class="s2">&#34;I am sad&#34;</span><span class="p">]))</span>
</span></span><span class="line"><span class="cl"><span class="o">&gt;&gt;&gt;</span> <span class="n">df</span><span class="o">.</span><span class="n">llm</span><span class="o">.</span><span class="n">sentiment</span><span class="p">(</span><span class="s2">&#34;x&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">shape</span><span class="p">:</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="err">┌────────────┬───────────┐</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="n">x</span>          <span class="err">┆</span> <span class="n">sentiment</span> <span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="o">---</span>        <span class="err">┆</span> <span class="o">---</span>       <span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="nb">str</span>        <span class="err">┆</span> <span class="nb">str</span>       <span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">╞════════════╪═══════════╡</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="n">I</span> <span class="n">am</span> <span class="n">happy</span> <span class="err">┆</span> <span class="n">positive</span>  <span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="n">I</span> <span class="n">am</span> <span class="n">sad</span>   <span class="err">┆</span> <span class="n">negative</span>  <span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">└────────────┴───────────┘</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>By keeping all the new functions within the llm namespace, it becomes very easy
for users to find and utilize the ones they need:</p>
<p><div class="not-prose"><figure>
    <img class="h-auto max-w-full rounded-lg"
      src="https://opensource.posit.co/blog/2024-10-30_edgarmallintro/images/llm-namespace.png"
      alt="" 
      loading="lazy"
    >
  </figure></div>
</p>
<h2 id="whats-next">What&rsquo;s next
</h2>
<p>I think it will be easier to know what is to come for <code>mall</code> once the community
uses it and provides feedback. I anticipate that adding more LLM back ends will
be the main request. The other possible enhancement will be when new updated
models are available, then the prompts may need to be updated for that given
model. I experienced this going from LLama 3.1 to Llama 3.2. There was a need
to tweak one of the prompts. The package is structured in a way the future
tweaks like that will be additions to the package, and not replacements to the
prompts, so as to retains backwards compatibility.</p>
<p>This is the first time I write an article about the history and structure of a
project. This particular effort was so unique because of the R + Python, and the
LLM aspects of it, that I figured it is worth sharing.</p>
<p>If you wish to learn more about <code>mall</code>, feel free to visit its official site:







<a href="https://mlverse.github.io/mall/" target="_blank" rel="noopener">https://mlverse.github.io/mall/</a>
</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2024-10-30_edgarmallintro/thumbnail.png" length="225127" type="image/png" />
    </item>
    <item>
      <title>Postprocessing is coming to tidymodels</title>
      <link>https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/</link>
      <pubDate>Tue, 08 Oct 2024 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/</guid>
      <dc:creator>Simon Couch</dc:creator>
      <dc:creator>Hannah Frick</dc:creator>
      <dc:creator>Max Kuhn</dc:creator><description><![CDATA[<p>We&rsquo;re bristling with elation to share about a set of upcoming features for postprocessing with tidymodels. Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations. The developmental versions of many tidymodels core packages include changes to support postprocessors, and we&rsquo;re ready to share about our work and hear the community&rsquo;s thoughts on our progress so far.</p>
<p>Postprocessing support with tidymodels hasn&rsquo;t yet made it to CRAN, but you can install the needed versions of tidymodels packages with the following code.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'>pak</span><span class='nf'>::</span><span class='nf'><a href='https://pak.r-lib.org/reference/pak.html'>pak</a></span><span class='o'>(</span></span>
<span>  <span class='nf'><a href='https://rdrr.io/r/base/paste.html'>paste0</a></span><span class='o'>(</span></span>
<span>    <span class='s'>"tidymodels/"</span>,</span>
<span>    <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"tune"</span>, <span class='s'>"workflows"</span>, <span class='s'>"rsample"</span>, <span class='s'>"tailor"</span><span class='o'>)</span></span>
<span>  <span class='o'>)</span></span>
<span><span class='o'>)</span></span></code></pre>
</div>
<p>Now, we load packages with those developmental versions installed.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://tidymodels.tidymodels.org'>tidymodels</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/tidymodels/probably'>probably</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/tidymodels/tailor'>tailor</a></span><span class='o'>)</span></span></code></pre>
</div>
<p>Existing tidymodels users might have spotted something funky already; who is this tailor character?</p>
<h2 id="meet-tailor">Meet tailor👋
</h2>
<p>The tailor package introduces tailor objects, which compose iterative adjustments to model predictions. tailor is to postprocessing as recipes is to preprocessing; applying your mental model of recipes to tailor should get you a good bit of the way there.</p>
<div style="width: 140%; max-width: 140%; overflow-x: auto;">
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Applied to...</th>
          <th>Initialize with...</th>
          <th>Composes...</th>
          <th>Train with...</th>
          <th>Predict with...</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>recipes</td>
          <td>Training data</td>
          <td><code>recipe()</code></td>
          <td><code>step_*()</code>s</td>
          <td><code>prep()</code></td>
          <td><code>bake()</code></td>
      </tr>
      <tr>
          <td>tailor</td>
          <td>Model predictions</td>
          <td>






<a href="https://tailor.tidymodels.org/reference/tailor.html" target="_blank" rel="noopener"><code>tailor()</code></a>
</td>
          <td><code>adjust_*()</code>ments</td>
          <td>






<a href="https://generics.r-lib.org/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a>
</td>
          <td>






<a href="https://rdrr.io/r/stats/predict.html" target="_blank" rel="noopener"><code>predict()</code></a>
</td>
      </tr>
  </tbody>
</table>
</div>
<p>First, users can initialize a tailor object with 






<a href="https://tailor.tidymodels.org/reference/tailor.html" target="_blank" rel="noopener"><code>tailor()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tailor.tidymodels.org/reference/tailor.html'>tailor</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>tailor</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; A postprocessor with 0 adjustments.</span></span>
<span></span></code></pre>
</div>
<p>Tailors compose &ldquo;adjustments,&rdquo; analogous to steps from the recipes package.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tailor.tidymodels.org/reference/tailor.html'>tailor</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://tailor.tidymodels.org/reference/adjust_probability_threshold.html'>adjust_probability_threshold</a></span><span class='o'>(</span>threshold <span class='o'>=</span> <span class='m'>.7</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>tailor</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; A binary postprocessor with 1 adjustment:</span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> Adjust probability threshold to 0.7.</span></span>
<span></span></code></pre>
</div>
<p>As an example, we&rsquo;ll apply this tailor to the <code>two_class_example</code> data made available after loading tidymodels.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/head.html'>head</a></span><span class='o'>(</span><span class='nv'>two_class_example</span><span class='o'>)</span></span>
<span><span class='c'>#&gt;    truth      Class1       Class2 predicted</span></span>
<span><span class='c'>#&gt; 1 Class2 0.003589243 0.9964107574    Class2</span></span>
<span><span class='c'>#&gt; 2 Class1 0.678621054 0.3213789460    Class1</span></span>
<span><span class='c'>#&gt; 3 Class2 0.110893522 0.8891064779    Class2</span></span>
<span><span class='c'>#&gt; 4 Class1 0.735161703 0.2648382969    Class1</span></span>
<span><span class='c'>#&gt; 5 Class2 0.016239960 0.9837600397    Class2</span></span>
<span><span class='c'>#&gt; 6 Class1 0.999275071 0.0007249286    Class1</span></span>
<span></span></code></pre>
</div>
<p>This data gives the true value of an outcome variable <code>truth</code> as well as predicted probabilities (<code>Class1</code> and <code>Class2</code>). The hard class predictions, in <code>predicted</code>, are <code>&quot;Class1&quot;</code> if the probability assigned to <code>&quot;Class1&quot;</code> is above .5, and <code>&quot;Class2&quot;</code> otherwise.</p>
<p>The model predicts <code>&quot;Class1&quot;</code> more often than it does <code>&quot;Class2&quot;</code>.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>two_class_example</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span> <span class='nf'>count</span><span class='o'>(</span><span class='nv'>predicted</span><span class='o'>)</span></span>
<span><span class='c'>#&gt;   predicted   n</span></span>
<span><span class='c'>#&gt; 1    Class1 277</span></span>
<span><span class='c'>#&gt; 2    Class2 223</span></span>
<span></span></code></pre>
</div>
<p>If we wanted the model to predict <code>&quot;Class2&quot;</code> more often, we could increase the probability threshold assigned to <code>&quot;Class1&quot;</code> above which the hard class prediction will be <code>&quot;Class1&quot;</code>. In the tailor package, this adjustment is implemented in 






<a href="https://tailor.tidymodels.org/reference/adjust_probability_threshold.html" target="_blank" rel="noopener"><code>adjust_probability_threshold()</code></a>
, which can be situated in a tailor object.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>tlr</span> <span class='o'>&lt;-</span></span>
<span>  <span class='nf'><a href='https://tailor.tidymodels.org/reference/tailor.html'>tailor</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://tailor.tidymodels.org/reference/adjust_probability_threshold.html'>adjust_probability_threshold</a></span><span class='o'>(</span>threshold <span class='o'>=</span> <span class='m'>.7</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>tlr</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>tailor</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; A binary postprocessor with 1 adjustment:</span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> Adjust probability threshold to 0.7.</span></span>
<span></span></code></pre>
</div>
<p>tailors must be fitted before they can predict on new data. For adjustments like 






<a href="https://tailor.tidymodels.org/reference/adjust_probability_threshold.html" target="_blank" rel="noopener"><code>adjust_probability_threshold()</code></a>
, there&rsquo;s no training that actually happens at the 






<a href="https://generics.r-lib.org/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a>
 step besides recording the name and type of relevant variables. For other adjustments, like numeric calibration with 






<a href="https://tailor.tidymodels.org/reference/adjust_numeric_calibration.html" target="_blank" rel="noopener"><code>adjust_numeric_calibration()</code></a>
, parameters are actually estimated at the 






<a href="https://generics.r-lib.org/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a>
 stage and separate data should be used to train the postprocessor and evaluate its performance. More on this in 


  
  
  





<a href="#tailors-in-context">Tailors in context</a>
.</p>
<p>In this case, though, we can 






<a href="https://generics.r-lib.org/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a>
 on the whole dataset. The resulting object is still a tailor, but is now flagged as trained.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>tlr_trained</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://generics.r-lib.org/reference/fit.html'>fit</a></span><span class='o'>(</span></span>
<span>  <span class='nv'>tlr</span>,</span>
<span>  <span class='nv'>two_class_example</span>,</span>
<span>  outcome <span class='o'>=</span> <span class='nv'>truth</span>,</span>
<span>  estimate <span class='o'>=</span> <span class='nv'>predicted</span>,</span>
<span>  probabilities <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='nv'>Class1</span>, <span class='nv'>Class2</span><span class='o'>)</span></span>
<span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>tlr_trained</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>tailor</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; A binary postprocessor with 1 adjustment:</span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> Adjust probability threshold to 0.7. [trained]</span></span>
<span></span></code></pre>
</div>
<p>When used with a model 






<a href="https://workflows.tidymodels.org" target="_blank" rel="noopener">workflow</a>
 via 






<a href="https://workflows.tidymodels.org/dev/reference/add_tailor.html" target="_blank" rel="noopener"><code>add_tailor()</code></a>
, the arguments to 






<a href="https://generics.r-lib.org/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a>
 a tailor will be set automatically. Generally, as in recipes, we recommend that users add tailors to model workflows for training and prediction rather than using them standalone for greater ease of use and to prevent data leakage, but tailors are totally functional by themselves, too.</p>
<p>Now, when passed new data, the trained tailor will determine the outputted class based on whether the probability assigned to the level <code>&quot;Class1&quot;</code> is above <code>.7</code>, resulting in more predictions of <code>&quot;Class2&quot;</code> than before.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/stats/predict.html'>predict</a></span><span class='o'>(</span><span class='nv'>tlr_trained</span>, <span class='nv'>two_class_example</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span> <span class='nf'>count</span><span class='o'>(</span><span class='nv'>predicted</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 2 × 2</span></span></span>
<span><span class='c'>#&gt;   predicted     n</span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>     <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span> Class1      236</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span> Class2      264</span></span>
<span></span></code></pre>
</div>
<p>Changing the probability threshold is one of many possible adjustments available in tailor.</p>
<ul>
<li>For probabilities: 






<a href="https://tailor.tidymodels.org/reference/adjust_probability_calibration.html" target="_blank" rel="noopener">calibration</a>
</li>
<li>For transformation of probabilities to hard class predictions: 






<a href="https://tailor.tidymodels.org/reference/adjust_probability_threshold.html" target="_blank" rel="noopener">thresholds</a>
, 






<a href="https://tailor.tidymodels.org/reference/adjust_equivocal_zone.html" target="_blank" rel="noopener">equivocal zones</a>
</li>
<li>For numeric outcomes: 






<a href="https://tailor.tidymodels.org/reference/adjust_numeric_calibration.html" target="_blank" rel="noopener">calibration</a>
, 






<a href="https://tailor.tidymodels.org/reference/adjust_numeric_range.html" target="_blank" rel="noopener">range</a>
</li>
</ul>
<p>Support for tailors is now plumbed through workflows (via 






<a href="https://workflows.tidymodels.org/dev/reference/add_tailor.html" target="_blank" rel="noopener"><code>add_tailor()</code></a>
) and tune, and rsample includes a set of infrastructural changes to prevent data leakage behind the scenes. That said, we haven&rsquo;t yet implemented support for tuning parameters in tailors, but we plan to implement that before this functionality heads to CRAN.</p>
<h2 id="tailors-in-context">Tailors in context
</h2>
<p>As an example, let&rsquo;s model a study of food delivery times in minutes (i.e., the time from the initial order to receiving the food) for a single restaurant. The <code>deliveries</code> data is available upon loading the tidymodels meta-package.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/data.html'>data</a></span><span class='o'>(</span><span class='nv'>deliveries</span><span class='o'>)</span></span>
<span></span>
<span><span class='c'># split into training and testing sets</span></span>
<span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_split</span> <span class='o'>&lt;-</span> <span class='nf'>initial_split</span><span class='o'>(</span><span class='nv'>deliveries</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_train</span> <span class='o'>&lt;-</span> <span class='nf'>training</span><span class='o'>(</span><span class='nv'>delivery_split</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_test</span>  <span class='o'>&lt;-</span> <span class='nf'>testing</span><span class='o'>(</span><span class='nv'>delivery_split</span><span class='o'>)</span></span>
<span></span>
<span><span class='c'># resample the training set using 10-fold cross-validation</span></span>
<span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_folds</span> <span class='o'>&lt;-</span> <span class='nf'>vfold_cv</span><span class='o'>(</span><span class='nv'>delivery_train</span><span class='o'>)</span></span>
<span></span>
<span><span class='c'># print out the training set</span></span>
<span><span class='nv'>delivery_train</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 7,509 × 31</span></span></span>
<span><span class='c'>#&gt;    time_to_delivery  hour day   distance item_01 item_02 item_03 item_04 item_05</span></span>
<span><span class='c'>#&gt;               <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;fct&gt;</span>    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>             21.2  16.1 Tue       3.02       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>             17.9  12.4 Sun       3.37       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>             22.4  14.2 Fri       2.59       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>             30.9  19.1 Sat       2.77       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>             30.1  16.5 Fri       2.05       0       0       0       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>             35.3  14.7 Sat       4.57       0       0       2       1       1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>             13.1  11.5 Sat       2.09       0       0       0       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>             18.3  13.4 Tue       2.35       0       2       1       0       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>             25.2  20.5 Sat       2.43       0       0       0       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>             30.7  16.7 Fri       2.24       0       0       0       1       0</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 7,499 more rows</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 22 more variables: item_06 &lt;int&gt;, item_07 &lt;int&gt;, item_08 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_09 &lt;int&gt;, item_10 &lt;int&gt;, item_11 &lt;int&gt;, item_12 &lt;int&gt;, item_13 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_14 &lt;int&gt;, item_15 &lt;int&gt;, item_16 &lt;int&gt;, item_17 &lt;int&gt;, item_18 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_19 &lt;int&gt;, item_20 &lt;int&gt;, item_21 &lt;int&gt;, item_22 &lt;int&gt;, item_23 &lt;int&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   item_24 &lt;int&gt;, item_25 &lt;int&gt;, item_26 &lt;int&gt;, item_27 &lt;int&gt;</span></span></span>
<span></span></code></pre>
</div>
<p>Let&rsquo;s deliberately define a regression model that has poor predicted values: a boosted tree with only three ensemble members.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>delivery_wflow</span> <span class='o'>&lt;-</span></span>
<span>  <span class='nf'>workflow</span><span class='o'>(</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'>add_formula</span><span class='o'>(</span><span class='nv'>time_to_delivery</span> <span class='o'>~</span> <span class='nv'>.</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'>add_model</span><span class='o'>(</span><span class='nf'>boost_tree</span><span class='o'>(</span>mode <span class='o'>=</span> <span class='s'>"regression"</span>, trees <span class='o'>=</span> <span class='m'>3</span><span class='o'>)</span><span class='o'>)</span></span></code></pre>
</div>
<p>Evaluating against resamples:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_res</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'>fit_resamples</span><span class='o'>(</span></span>
<span>    <span class='nv'>delivery_wflow</span>, </span>
<span>    <span class='nv'>delivery_folds</span>, </span>
<span>    control <span class='o'>=</span> <span class='nf'>control_resamples</span><span class='o'>(</span>save_pred <span class='o'>=</span> <span class='kc'>TRUE</span><span class='o'>)</span></span>
<span>  <span class='o'>)</span></span></code></pre>
</div>
<p>The $R^2$ looks quite strong!</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tune.tidymodels.org/reference/collect_predictions.html'>collect_metrics</a></span><span class='o'>(</span><span class='nv'>delivery_res</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 2 × 6</span></span></span>
<span><span class='c'>#&gt;   .metric .estimator  mean     n std_err .config             </span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>               </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span> rmse    standard   9.52     10 0.053<span style='text-decoration: underline;'>3</span>  Preprocessor1_Model1</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span> rsq     standard   0.853    10 0.003<span style='text-decoration: underline;'>57</span> Preprocessor1_Model1</span></span>
<span></span></code></pre>
</div>
<p>Let&rsquo;s take a closer look at the predictions, though. How well are they calibrated? We can use the 






<a href="https://probably.tidymodels.org/reference/cal_plot_regression.html" target="_blank" rel="noopener"><code>cal_plot_regression()</code></a>
 helper from the probably package to put together a quick diagnostic plot.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tune.tidymodels.org/reference/collect_predictions.html'>collect_predictions</a></span><span class='o'>(</span><span class='nv'>delivery_res</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://probably.tidymodels.org/reference/cal_plot_regression.html'>cal_plot_regression</a></span><span class='o'>(</span>truth <span class='o'>=</span> <span class='nv'>time_to_delivery</span>, estimate <span class='o'>=</span> <span class='nv'>.pred</span><span class='o'>)</span></span>
</code></pre>
<img src="https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/figs/predictions-bad-boost-1.png" width="700px" style="display: block; margin: auto;" />
</div>
<p>Ooof.</p>
<p>In comes tailor! Numeric calibration can help address the correlated errors here. We can add a tailor to our existing workflow to &ldquo;bump up&rdquo; predictions towards their true value.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>delivery_wflow_improved</span> <span class='o'>&lt;-</span></span>
<span>  <span class='nv'>delivery_wflow</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'>add_tailor</span><span class='o'>(</span><span class='nf'><a href='https://tailor.tidymodels.org/reference/tailor.html'>tailor</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span> <span class='nf'><a href='https://tailor.tidymodels.org/reference/adjust_numeric_calibration.html'>adjust_numeric_calibration</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span></code></pre>
</div>
<p>The resampling code looks the same from here.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>delivery_res_improved</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'>fit_resamples</span><span class='o'>(</span></span>
<span>    <span class='nv'>delivery_wflow_improved</span>, </span>
<span>    <span class='nv'>delivery_folds</span>, </span>
<span>    control <span class='o'>=</span> <span class='nf'>control_resamples</span><span class='o'>(</span>save_pred <span class='o'>=</span> <span class='kc'>TRUE</span><span class='o'>)</span></span>
<span>  <span class='o'>)</span></span></code></pre>
</div>
<p>Checking out the same plot reveals a much better fit!</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://tune.tidymodels.org/reference/collect_predictions.html'>collect_predictions</a></span><span class='o'>(</span><span class='nv'>delivery_res_improved</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://probably.tidymodels.org/reference/cal_plot_regression.html'>cal_plot_regression</a></span><span class='o'>(</span>truth <span class='o'>=</span> <span class='nv'>time_to_delivery</span>, estimate <span class='o'>=</span> <span class='nv'>.pred</span><span class='o'>)</span></span>
</code></pre>
<img src="https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/figs/predictios-better-boost-1.png" width="700px" style="display: block; margin: auto;" />
</div>
<p>There&rsquo;s actually some tricky data leakage prevention happening under the hood here. When you add tailors to workflow and fit them with tune, this is all taken care of for you. If you&rsquo;re interested in using tailors outside of that context, check out 


  
  
  





<a href="https://workflows.tidymodels.org/dev/reference/add_tailor.html#data-usage" target="_blank" rel="noopener">this documentation section</a>
 in <code>add_tailor()</code>.</p>
<h2 id="whats-to-come">What&rsquo;s to come
</h2>
<p>We&rsquo;re excited about how this work is shaping up and would love to hear yall&rsquo;s thoughts on what we&rsquo;ve brought together so far. Please do comment on our social media posts about this blog entry or leave issues on the 






<a href="https://github.com/tidymodels/tailor" target="_blank" rel="noopener">tailor GitHub repository</a>
 and let us know what you think!</p>
<p>Before these changes head out to CRAN, we&rsquo;ll also be implementing tuning functionality for postprocessors. You&rsquo;ll be able to tag arguments like <code>adjust_probability_threshold(threshold)</code> or <code>adjust_probability_calibration(method)</code> with <code>tune()</code> to optimize across several values. Besides that, post-processing with tidymodels should &ldquo;just work&rdquo; on the developmental versions of our packages&mdash;let us know if you come across anything wonky.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>Postprocessing support has been a longstanding feature request across many of our repositories; we&rsquo;re grateful for the community discussions there for shaping this work. Additionally, we thank Ryan Tibshirani and Daniel McDonald for fruitful discussions on how we might scope these features.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2024-10-08_postprocessing-preview/thumbnail-wd.jpg" length="386938" type="image/jpeg" />
    </item>
    <item>
      <title>recipes 1.1.0</title>
      <link>https://opensource.posit.co/blog/2024-07-08_recipes-1-1-0/</link>
      <pubDate>Mon, 08 Jul 2024 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2024-07-08_recipes-1-1-0/</guid>
      <dc:creator>Emil Hvitfeldt</dc:creator><description><![CDATA[<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with [`hugodown::tidy_show_meta()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html))
* [x] Find photo & update yaml metadata
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
* [x] [`hugodown::use_tidy_thumbnails()`](https://rdrr.io/pkg/hugodown/man/use_tidy_post.html)
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] [`usethis::use_tidy_thanks()`](https://usethis.r-lib.org/reference/use_tidy_thanks.html)
-->
<p>We&rsquo;re thrilled to announce the release of 






<a href="https://recipes.tidymodels.org/" target="_blank" rel="noopener">recipes</a>
 1.1.0. recipes lets you create a pipeable sequence of feature engineering steps.</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"recipes"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will go over some of the bigger changes in this release. Improvements in column type checking, allowing more data types to be passed to recipes, use of long formulas and better error for misspelled argument names.</p>
<p>You can see a full list of changes in the 






<a href="https://github.com/tidymodels/recipes/releases/tag/v1.1.0" target="_blank" rel="noopener">release notes</a>
.</p>
<h2 id="column-type-checking">Column type checking
</h2>
<p>A 






<a href="https://github.com/tidymodels/recipes/issues/793" target="_blank" rel="noopener">longtime issue</a>
 in recipes came from the fact that recipes didn&rsquo;t keep a 






<a href="https://vctrs.r-lib.org/articles/type-size.html" target="_blank" rel="noopener">prototype</a>
 (ptype) of the data it was specified with. This would cause unexpected things to happen or uninformative error messages to appear if different data was used to 






<a href="https://recipes.tidymodels.org/reference/prep.html" target="_blank" rel="noopener"><code>prep()</code></a>
 than was used to create the 






<a href="https://recipes.tidymodels.org/reference/recipe.html" target="_blank" rel="noopener"><code>recipe()</code></a>
.</p>
<p>Every recipe you create starts with a call to 






<a href="https://recipes.tidymodels.org/reference/recipe.html" target="_blank" rel="noopener"><code>recipe()</code></a>
. In the below example, we create a recipe where <code>x2</code> starts by being a character vector, but the recipe is prepped where <code>x2</code> is a numeric vector. This didn&rsquo;t produce any warnings or errors, silently doing something unintended.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">data_template</span> <span class="o">&lt;-</span> <span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">outcome</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">10</span><span class="p">),</span> 
</span></span><span class="line"><span class="cl">  <span class="n">x1</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">10</span><span class="p">),</span> 
</span></span><span class="line"><span class="cl">  <span class="n">x2</span> <span class="o">=</span> <span class="nf">sample</span><span class="p">(</span><span class="kc">letters</span><span class="p">,</span> <span class="m">10</span><span class="p">,</span> <span class="bp">T</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">rec</span> <span class="o">&lt;-</span> <span class="nf">recipe</span><span class="p">(</span><span class="n">outcome</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data_template</span><span class="p">)</span> <span class="o">%&gt;%</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_bin2factor</span><span class="p">(</span><span class="nf">all_numeric_predictors</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">data_training</span> <span class="o">&lt;-</span> <span class="nf">tibble</span><span class="p">(</span><span class="n">outcome</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span> <span class="n">x1</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span> <span class="n">x2</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">prep</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">training</span> <span class="o">=</span> <span class="n">data_training</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── Recipe ──────────────────────────────────────────────────────────────────────</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── Inputs</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Number of variables by role</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; outcome:   1</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; predictor: 2</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── Training information</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Training data contained 1000 data points and no incomplete rows.</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; </span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ── Operations</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; • Dummy variable to factor conversion for: x1 | Trained</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Now, we get an error detailing how the data is different.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_template</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span>outcome <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>10</span><span class='o'>)</span>, x1 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>10</span><span class='o'>)</span>, x2 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/sample.html'>sample</a></span><span class='o'>(</span><span class='nv'>letters</span>, <span class='m'>10</span>, <span class='kc'>T</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>rec</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>outcome</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>data_template</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_bin2factor.html'>step_bin2factor</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_numeric_predictors</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>data_training</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span>outcome <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>1000</span><span class='o'>)</span>, x1 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>1000</span><span class='o'>)</span>, x2 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>1000</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='nv'>rec</span>, training <span class='o'>=</span> <span class='nv'>data_training</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `prep()`:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BB0000;'>✖</span> The following variable has the wrong class:</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> `x2` must have class <span style='color: #0000BB;'>&lt;numeric&gt;</span>, not <span style='color: #0000BB;'>&lt;character&gt;</span>.</span></span>
<span></span></code></pre>
</div>
<p>Note that recipes created before version 1.1.0 don&rsquo;t contain any ptype information, and will not undergo checking. Rerunning the code to create the recipe will add ptype information to the recipe.</p>
<h2 id="input-checking-in-recipe">Input checking in <code>recipe()</code>
</h2>
<p>We have relaxed the requirements of data frames, while making feedback more helpful when something goes wrong.</p>
<p>The data was previously passed through 






<a href="https://rdrr.io/r/stats/model.frame.html" target="_blank" rel="noopener"><code>model.frame()</code></a>
 inside the recipe, which restricted what could be handled. Previously prohibited input included data frames with list-columns or 






<a href="https://r-spatial.github.io/sf/" target="_blank" rel="noopener">sf</a>
 data frames. Both of these are now supported, as long as they are a <code>data.frame</code> object.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_listcolumn</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span></span>
<span>  y <span class='o'>=</span> <span class='m'>1</span><span class='o'>:</span><span class='m'>4</span>,</span>
<span>  x <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/list.html'>list</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>:</span><span class='m'>3</span>, <span class='m'>4</span><span class='o'>:</span><span class='m'>6</span>, <span class='m'>3</span><span class='o'>:</span><span class='m'>1</span>, <span class='m'>1</span><span class='o'>:</span><span class='m'>10</span><span class='o'>)</span></span>
<span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>y</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>data_listcolumn</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>Recipe</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; ── Inputs</span></span>
<span></span><span><span class='c'>#&gt; Number of variables by role</span></span>
<span></span><span><span class='c'>#&gt; outcome:   1</span></span>
<span><span class='c'>#&gt; predictor: 1</span></span>
<span></span></code></pre>
</div>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://r-spatial.github.io/sf/'>sf</a></span><span class='o'>)</span></span>
<span><span class='c'>#&gt; Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE</span></span>
<span></span><span><span class='nv'>pathshp</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/system.file.html'>system.file</a></span><span class='o'>(</span><span class='s'>"shape/nc.shp"</span>, package <span class='o'>=</span> <span class='s'>"sf"</span><span class='o'>)</span></span>
<span><span class='nv'>data_sf</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://r-spatial.github.io/sf/reference/st_read.html'>st_read</a></span><span class='o'>(</span><span class='nv'>pathshp</span>, quiet <span class='o'>=</span> <span class='kc'>TRUE</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>AREA</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>data_sf</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>Recipe</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; ── Inputs</span></span>
<span></span><span><span class='c'>#&gt; Number of variables by role</span></span>
<span></span><span><span class='c'>#&gt; outcome:    1</span></span>
<span><span class='c'>#&gt; predictor: 14</span></span>
<span></span></code></pre>
</div>
<p>We are excited to see what people can do with these new options.</p>
<p>Another way to tell a recipe what variables should be included and what roles they should have is to use 






<a href="https://recipes.tidymodels.org/reference/roles.html" target="_blank" rel="noopener"><code>add_role()</code></a>
 and 






<a href="https://recipes.tidymodels.org/reference/roles.html" target="_blank" rel="noopener"><code>update_role()</code></a>
. But if you were not careful, you could end up in situations where the same variable is labeled as both the outcome and predictor.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='c'># didn't used to throw a warning</span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/roles.html'>update_role</a></span><span class='o'>(</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/everything.html'>everything</a></span><span class='o'>(</span><span class='o'>)</span>, new_role <span class='o'>=</span> <span class='s'>"predictor"</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/roles.html'>add_role</a></span><span class='o'>(</span><span class='s'>"mpg"</span>, new_role <span class='o'>=</span> <span class='s'>"outcome"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `add_role()`:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00;'>!</span> `mpg` cannot get <span style='color: #0000BB;'>"outcome"</span> role as it already has role <span style='color: #0000BB;'>"predictor"</span>.</span></span>
<span></span></code></pre>
</div>
<p>This error can be avoided by using 






<a href="https://recipes.tidymodels.org/reference/roles.html" target="_blank" rel="noopener"><code>update_role()</code></a>
 instead of 






<a href="https://recipes.tidymodels.org/reference/roles.html" target="_blank" rel="noopener"><code>add_role()</code></a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/roles.html'>update_role</a></span><span class='o'>(</span><span class='nf'><a href='https://tidyselect.r-lib.org/reference/everything.html'>everything</a></span><span class='o'>(</span><span class='o'>)</span>, new_role <span class='o'>=</span> <span class='s'>"predictor"</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/roles.html'>update_role</a></span><span class='o'>(</span><span class='s'>"mpg"</span>, new_role <span class='o'>=</span> <span class='s'>"outcome"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>Recipe</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; ── Inputs</span></span>
<span></span><span><span class='c'>#&gt; Number of variables by role</span></span>
<span></span><span><span class='c'>#&gt; outcome:    1</span></span>
<span><span class='c'>#&gt; predictor: 10</span></span>
<span></span></code></pre>
</div>
<h2 id="long-formulas-in-recipe">Long formulas in <code>recipe()</code>
</h2>
<p>Related to the changes we saw above, we now fully support very long formulas without hitting a <code>C stack usage</code> error.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_wide</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/matrix.html'>matrix</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>:</span><span class='m'>10000</span>, ncol <span class='o'>=</span> <span class='m'>10000</span><span class='o'>)</span></span>
<span><span class='nv'>data_wide</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/as.data.frame.html'>as.data.frame</a></span><span class='o'>(</span><span class='nv'>data_wide</span><span class='o'>)</span></span>
<span><span class='nf'><a href='https://rdrr.io/r/base/names.html'>names</a></span><span class='o'>(</span><span class='nv'>data_wide</span><span class='o'>)</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/paste.html'>paste0</a></span><span class='o'>(</span><span class='s'>"x"</span>, <span class='m'>1</span><span class='o'>:</span><span class='m'>10000</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>long_formula</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/stats/formula.html'>as.formula</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/paste.html'>paste</a></span><span class='o'>(</span><span class='s'>"~ "</span>, <span class='nf'><a href='https://rdrr.io/r/base/paste.html'>paste</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/names.html'>names</a></span><span class='o'>(</span><span class='nv'>data_wide</span><span class='o'>)</span>, collapse <span class='o'>=</span> <span class='s'>" + "</span><span class='o'>)</span><span class='o'>)</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>long_formula</span>, <span class='nv'>data_wide</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #00BBBB;'>──</span> <span style='font-weight: bold;'>Recipe</span> <span style='color: #00BBBB;'>──────────────────────────────────────────────────────────────────────</span></span></span>
<span></span><span><span class='c'>#&gt; </span></span>
<span></span><span><span class='c'>#&gt; ── Inputs</span></span>
<span></span><span><span class='c'>#&gt; Number of variables by role</span></span>
<span></span><span><span class='c'>#&gt; predictor: 10000</span></span>
<span></span></code></pre>
</div>
<h2 id="better-error-for-misspelled-argument-names">Better error for misspelled argument names
</h2>
<p>If you have used recipes long enough you are very likely to have run into the following error.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="nf">recipe</span><span class="p">(</span><span class="n">mpg</span> <span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">mtcars</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_pca</span><span class="p">(</span><span class="nf">all_numeric_predictors</span><span class="p">(),</span> <span class="n">number</span> <span class="o">=</span> <span class="m">4</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">prep</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Error in `step_pca()`:</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Caused by error in `prep()`:</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; ! Can&#39;t rename variables in this context.</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>The first time you saw it, it didn&rsquo;t make much sense. Hopefully, you figured out that 






<a href="https://recipes.tidymodels.org/reference/step_pca.html" target="_blank" rel="noopener">step_pca()</a>
 doesn&rsquo;t have a <code>number</code> argument, and instead uses <code>num_comp</code> to determine the number of principal components to return. This confusion will be a thing of the past as we now include this improved error message.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='nv'>mpg</span> <span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>mtcars</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_pca.html'>step_pca</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_numeric_predictors</a></span><span class='o'>(</span><span class='o'>)</span>, number <span class='o'>=</span> <span class='m'>4</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `step_pca()`:</span></span></span>
<span><span class='c'>#&gt; <span style='font-weight: bold;'>Caused by error in `prep()` at recipes/R/recipe.R:479:9:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00;'>!</span> The following argument was specified but do not exist: `number`.</span></span>
<span></span></code></pre>
</div>
<h2 id="quality-of-life-increases-in-step_dummy">Quality of life increases in <code>step_dummy()</code>
</h2>
<p>I would imagine that one of the most used steps is 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
. We have improved the errors and warnings it spits out when things go sideways.</p>
<p>If you apply 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
 to a variable that contains a lot of levels, it will produce a lot of columns, and the resulting object may not fit in memory. This can lead to the following error.</p>
<div class="code-block"><div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-r" data-lang="r"><span class="line"><span class="cl"><span class="n">data_id</span> <span class="o">&lt;-</span> <span class="nf">tibble</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">  <span class="n">id</span> <span class="o">=</span> <span class="nf">as.character</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">100000</span><span class="p">),</span> 
</span></span><span class="line"><span class="cl">  <span class="n">x1</span> <span class="o">=</span> <span class="nf">rnorm</span><span class="p">(</span><span class="m">100000</span><span class="p">),</span> 
</span></span><span class="line"><span class="cl">  <span class="n">x2</span> <span class="o">=</span> <span class="nf">sample</span><span class="p">(</span><span class="kc">letters</span><span class="p">,</span> <span class="m">100000</span><span class="p">,</span> <span class="kc">TRUE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nf">recipe</span><span class="p">(</span><span class="o">~</span> <span class="n">.,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">data_id</span><span class="p">)</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">step_dummy</span><span class="p">(</span><span class="nf">all_nominal_predictors</span><span class="p">())</span> <span class="o">|&gt;</span>
</span></span><span class="line"><span class="cl">  <span class="nf">prep</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="c1">#&gt; Error: vector memory exhausted (limit reached?)</span></span></span></code></pre></td></tr></table>
</div>
</div></div>
<p>Instead, you now get a more helpful error message.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_id</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span></span>
<span>  id <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/character.html'>as.character</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>:</span><span class='m'>100000</span><span class='o'>)</span>, </span>
<span>  x1 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/stats/Normal.html'>rnorm</a></span><span class='o'>(</span><span class='m'>100000</span><span class='o'>)</span>, </span>
<span>  x2 <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/sample.html'>sample</a></span><span class='o'>(</span><span class='nv'>letters</span>, <span class='m'>100000</span>, <span class='kc'>TRUE</span><span class='o'>)</span></span>
<span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span> <span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>data_id</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_dummy.html'>step_dummy</a></span><span class='o'>(</span><span class='nf'><a href='https://recipes.tidymodels.org/reference/has_role.html'>all_nominal_predictors</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'>|&gt;</span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00; font-weight: bold;'>Error</span><span style='font-weight: bold;'> in `step_dummy()`:</span></span></span>
<span><span class='c'>#&gt; <span style='font-weight: bold;'>Caused by error:</span></span></span>
<span><span class='c'>#&gt; <span style='color: #BBBB00;'>!</span> `id` contains too many levels (100000), which would result in a</span></span>
<span><span class='c'>#&gt;   data.frame too large to fit in memory.</span></span>
<span></span></code></pre>
</div>
<p>Likewise, you will get helpful errors if 






<a href="https://recipes.tidymodels.org/reference/step_dummy.html" target="_blank" rel="noopener"><code>step_dummy()</code></a>
 gets a <code>NA</code> or unseen values.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_train</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span>x <span class='o'>=</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"a"</span>, <span class='s'>"b"</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='nv'>data_unseen</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span>x <span class='o'>=</span> <span class='s'>"c"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>rec_spec</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://recipes.tidymodels.org/reference/recipe.html'>recipe</a></span><span class='o'>(</span><span class='o'>~</span><span class='nv'>.</span>, data <span class='o'>=</span> <span class='nv'>data_train</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/step_dummy.html'>step_dummy</a></span><span class='o'>(</span><span class='nv'>x</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/prep.html'>prep</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>rec_spec</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span><span class='nv'>data_unseen</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; Warning: <span style='color: #BBBB00;'>!</span> There are new levels in `x`: <span style='color: #0000BB;'>"c"</span>.</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>ℹ</span> Consider using step_novel() (`?recipes::step_novel()`) before `step_dummy()`</span></span>
<span><span class='c'>#&gt;   to handle unseen values.</span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 1 × 1</span></span></span>
<span><span class='c'>#&gt;     x_b</span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    <span style='color: #BB0000;'>NA</span></span></span>
<span></span></code></pre>
</div>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>data_na</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://tibble.tidyverse.org/reference/tibble.html'>tibble</a></span><span class='o'>(</span>x <span class='o'>=</span> <span class='kc'>NA</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>rec_spec</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://recipes.tidymodels.org/reference/bake.html'>bake</a></span><span class='o'>(</span><span class='nv'>data_na</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; Warning: <span style='color: #BBBB00;'>!</span> There are new levels in `x`: <span style='color: #0000BB;'>NA</span>.</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>ℹ</span> Consider using step_unknown() (`?recipes::step_unknown()`) before</span></span>
<span><span class='c'>#&gt;   `step_dummy()` to handle missing values.</span></span>
<span></span><span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 1 × 1</span></span></span>
<span><span class='c'>#&gt;     x_b</span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    <span style='color: #BB0000;'>NA</span></span></span>
<span></span></code></pre>
</div>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>A big thank you to all the people who have contributed to recipes since the release of v1.0.10:</p>
<p>






<a href="https://github.com/brynhum" target="_blank" rel="noopener">@brynhum</a>
, 






<a href="https://github.com/DemetriPananos" target="_blank" rel="noopener">@DemetriPananos</a>
, 






<a href="https://github.com/diegoperoni" target="_blank" rel="noopener">@diegoperoni</a>
, 






<a href="https://github.com/EmilHvitfeldt" target="_blank" rel="noopener">@EmilHvitfeldt</a>
, 






<a href="https://github.com/JiahuaQu" target="_blank" rel="noopener">@JiahuaQu</a>
, 






<a href="https://github.com/joranE" target="_blank" rel="noopener">@joranE</a>
, 






<a href="https://github.com/nhward" target="_blank" rel="noopener">@nhward</a>
, 






<a href="https://github.com/olivroy" target="_blank" rel="noopener">@olivroy</a>
, and 






<a href="https://github.com/simonpcouch" target="_blank" rel="noopener">@simonpcouch</a>
.</p>
<h2 id="chocolate-chocolate-chip-cookies">Chocolate Chocolate Chip Cookies
</h2>
<p>preheat oven 350°F</p>
<ul>
<li>1/3c butter</li>
<li>1/2 + 1/3c sugar</li>
</ul>
<p>mix until fluffy</p>
<ul>
<li>1 tsp vanilla</li>
<li>1 egg</li>
</ul>
<p>mix until combined</p>
<ul>
<li>1/2c cocoa</li>
<li>1/2 tsp baking soda</li>
<li>1c flour</li>
</ul>
<p>mix until combined</p>
<ul>
<li>3/4c chocolate chips</li>
</ul>
<p>bake for about 8 mins, depending on size! they will crack on top, but still be soft.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2024-07-08_recipes-1-1-0/thumbnail-wd.jpg" length="477764" type="image/jpeg" />
    </item>
    <item>
      <title>bonsai 0.3.0</title>
      <link>https://opensource.posit.co/blog/2024-06-25_bonsai-0-3-0/</link>
      <pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate>
      <guid>https://opensource.posit.co/blog/2024-06-25_bonsai-0-3-0/</guid>
      <dc:creator>Simon Couch</dc:creator><description><![CDATA[<p>We&rsquo;re brimming with glee to announce the release of 






<a href="https://bonsai.tidymodels.org" target="_blank" rel="noopener">bonsai</a>
 0.3.0. bonsai is a parsnip extension package for tree-based models, and includes support for random forest and gradient-boosted tree frameworks like partykit and LightGBM. This most recent release of the package introduces support for the <code>&quot;aorsf&quot;</code> engine, which implements accelerated oblique random forests (Jaeger et al. 2022, Jaeger et al. 2024).</p>
<p>You can install it from CRAN with:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"bonsai"</span><span class='o'>)</span></span></code></pre>
</div>
<p>This blog post will demonstrate a modeling workflow where the benefits of using oblique random forests shine through.</p>
<p>You can see a full list of changes in the 


  
  
  





<a href="https://bonsai.tidymodels.org/news/index.html#bonsai-030" target="_blank" rel="noopener">release notes</a>
.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://tidymodels.tidymodels.org'>tidymodels</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://bonsai.tidymodels.org/'>bonsai</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://plsmod.tidymodels.org'>plsmod</a></span><span class='o'>)</span></span>
<span><span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/tidymodels/corrr'>corrr</a></span><span class='o'>)</span></span></code></pre>
</div>
<h2 id="the-meats-data">The <code>meats</code> data
</h2>
<p>The modeldata package, loaded automatically with the tidymodels meta-package, includes several example datasets to demonstrate modeling problems. We&rsquo;ll make use of a dataset called <code>meats</code> in this post. Each row is a measurement of a sample of finely chopped meat.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>meats</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 215 × 103</span></span></span>
<span><span class='c'>#&gt;    x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010 x_011 x_012 x_013</span></span>
<span><span class='c'>#&gt;    <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 1</span>  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63  2.63  2.63  2.64</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 2</span>  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88  2.88  2.89  2.90</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 3</span>  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60  2.60  2.61  2.61</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 4</span>  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84  2.84  2.85  2.85</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 5</span>  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81  2.81  2.82  2.82</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 6</span>  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06  3.07  3.08  3.09</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 7</span>  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04  3.05  3.06  3.07</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 8</span>  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54  2.54  2.54  2.54</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'> 9</span>  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33  3.34  3.35  3.36</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>10</span>  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47  3.48  3.48  3.49</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 205 more rows</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># ℹ 90 more variables: x_014 &lt;dbl&gt;, x_015 &lt;dbl&gt;, x_016 &lt;dbl&gt;, x_017 &lt;dbl&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   x_018 &lt;dbl&gt;, x_019 &lt;dbl&gt;, x_020 &lt;dbl&gt;, x_021 &lt;dbl&gt;, x_022 &lt;dbl&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   x_023 &lt;dbl&gt;, x_024 &lt;dbl&gt;, x_025 &lt;dbl&gt;, x_026 &lt;dbl&gt;, x_027 &lt;dbl&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   x_028 &lt;dbl&gt;, x_029 &lt;dbl&gt;, x_030 &lt;dbl&gt;, x_031 &lt;dbl&gt;, x_032 &lt;dbl&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   x_033 &lt;dbl&gt;, x_034 &lt;dbl&gt;, x_035 &lt;dbl&gt;, x_036 &lt;dbl&gt;, x_037 &lt;dbl&gt;,</span></span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>#   x_038 &lt;dbl&gt;, x_039 &lt;dbl&gt;, x_040 &lt;dbl&gt;, x_041 &lt;dbl&gt;, x_042 &lt;dbl&gt;, …</span></span></span>
<span></span></code></pre>
</div>
<p>From that dataset&rsquo;s documentation:</p>
<blockquote>
<p>These data are recorded on a Tecator Infratec Food and Feed Analyzer&hellip; For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.</p>
</blockquote>
<p>We&rsquo;ll try to predict the protein content, as a percentage, using the absorbance measurements.</p>
<p>Before we take a further look, let&rsquo;s split up our data. I&rsquo;ll first select off two other possible outcome variables and, after splitting into training and testing sets, resample the data using 5-fold cross-validation with 2 repeats.</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>meats</span> <span class='o'>&lt;-</span> <span class='nv'>meats</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span> <span class='nf'>select</span><span class='o'>(</span><span class='o'>-</span><span class='nv'>water</span>, <span class='o'>-</span><span class='nv'>fat</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'><a href='https://rdrr.io/r/base/Random.html'>set.seed</a></span><span class='o'>(</span><span class='m'>1</span><span class='o'>)</span></span>
<span><span class='nv'>meats_split</span> <span class='o'>&lt;-</span> <span class='nf'>initial_split</span><span class='o'>(</span><span class='nv'>meats</span><span class='o'>)</span></span>
<span><span class='nv'>meats_train</span> <span class='o'>&lt;-</span> <span class='nf'>training</span><span class='o'>(</span><span class='nv'>meats_split</span><span class='o'>)</span></span>
<span><span class='nv'>meats_test</span> <span class='o'>&lt;-</span> <span class='nf'>testing</span><span class='o'>(</span><span class='nv'>meats_split</span><span class='o'>)</span></span>
<span><span class='nv'>meats_folds</span> <span class='o'>&lt;-</span> <span class='nf'>vfold_cv</span><span class='o'>(</span><span class='nv'>meats_train</span>, v <span class='o'>=</span> <span class='m'>5</span>, repeats <span class='o'>=</span> <span class='m'>2</span><span class='o'>)</span></span></code></pre>
</div>
<p>The tricky parts of this modeling problem are that:</p>
<ol>
<li>There are few observations to work with (215 total).</li>
<li>Each of these 100 absorbance measurements are <em>highly</em> correlated.</li>
</ol>
<p>Visualizing that correlation:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>meats_train</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://corrr.tidymodels.org/reference/correlate.html'>correlate</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://ggplot2.tidyverse.org/reference/autoplot.html'>autoplot</a></span><span class='o'>(</span><span class='o'>)</span> <span class='o'>+</span></span>
<span>  <span class='nf'>theme</span><span class='o'>(</span>axis.text.x <span class='o'>=</span> <span class='nf'>element_blank</span><span class='o'>(</span><span class='o'>)</span>, axis.text.y <span class='o'>=</span> <span class='nf'>element_blank</span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; Correlation computed with</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> Method: 'pearson'</span></span>
<span><span class='c'>#&gt; <span style='color: #00BBBB;'>•</span> Missing treated using: 'pairwise.complete.obs'</span></span>
<span></span></code></pre>
<img src="https://opensource.posit.co/blog/2024-06-25_bonsai-0-3-0/figs/correlate-1.png" width="700px" style="display: block; margin: auto;" />
</div>
<p>Almost all of these pairwise correlations between predictors are near 1, besides the last variable and every other variable. That last variable with weaker correlation values? It&rsquo;s the outcome.</p>
<h2 id="baseline-models">Baseline models
</h2>
<p>There are several existing model implementations in tidymodels that are resilient to highly correlated predictors. The first one I&rsquo;d probably reach for is an elastic net: an interpolation of the LASSO and Ridge regularized linear regression models. Evaluating that modeling approach against resamples:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='c'># define a regularized linear model</span></span>
<span><span class='nv'>spec_lr</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/linear_reg.html'>linear_reg</a></span><span class='o'>(</span>penalty <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span>, mixture <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/set_engine.html'>set_engine</a></span><span class='o'>(</span><span class='s'>"glmnet"</span><span class='o'>)</span></span>
<span></span>
<span><span class='c'># try out different penalization approaches</span></span>
<span><span class='nv'>res_lr</span> <span class='o'>&lt;-</span> <span class='nf'>tune_grid</span><span class='o'>(</span><span class='nv'>spec_lr</span>, <span class='nv'>protein</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>meats_folds</span><span class='o'>)</span></span>
<span></span>
<span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_lr</span>, metric <span class='o'>=</span> <span class='s'>"rmse"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;         penalty mixture .metric .estimator  mean     n std_err .config          </span></span>
<span><span class='c'>#&gt;           <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>            </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span> 0.000<span style='text-decoration: underline;'>032</span>4       0.668 rmse    standard    1.24    10  0.051<span style='text-decoration: underline;'>6</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span> 0.000<span style='text-decoration: underline;'>000</span>005<span style='text-decoration: underline;'>24</span>   0.440 rmse    standard    1.25    10  0.054<span style='text-decoration: underline;'>8</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span> 0.000<span style='text-decoration: underline;'>000</span>461     0.839 rmse    standard    1.26    10  0.053<span style='text-decoration: underline;'>8</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span> 0.000<span style='text-decoration: underline;'>005</span>50      0.965 rmse    standard    1.26    10  0.054<span style='text-decoration: underline;'>0</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span> 0.000<span style='text-decoration: underline;'>000</span>048<span style='text-decoration: underline;'>9</span>    0.281 rmse    standard    1.26    10  0.053<span style='text-decoration: underline;'>4</span> Preprocessor1_Mo…</span></span>
<span></span><span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_lr</span>, metric <span class='o'>=</span> <span class='s'>"rsq"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;         penalty mixture .metric .estimator  mean     n std_err .config          </span></span>
<span><span class='c'>#&gt;           <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>            </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span> 0.000<span style='text-decoration: underline;'>032</span>4       0.668 rsq     standard   0.849    10  0.012<span style='text-decoration: underline;'>6</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span> 0.000<span style='text-decoration: underline;'>000</span>005<span style='text-decoration: underline;'>24</span>   0.440 rsq     standard   0.848    10  0.012<span style='text-decoration: underline;'>8</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span> 0.000<span style='text-decoration: underline;'>000</span>461     0.839 rsq     standard   0.846    10  0.011<span style='text-decoration: underline;'>4</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span> 0.000<span style='text-decoration: underline;'>005</span>50      0.965 rsq     standard   0.846    10  0.011<span style='text-decoration: underline;'>1</span> Preprocessor1_Mo…</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span> 0.000<span style='text-decoration: underline;'>000</span>048<span style='text-decoration: underline;'>9</span>    0.281 rsq     standard   0.846    10  0.012<span style='text-decoration: underline;'>6</span> Preprocessor1_Mo…</span></span>
<span></span></code></pre>
</div>
<p>That best RMSE value of 1.24 gives us a baseline to work with, and the best R-squared 0.85 seems like a good start.</p>
<p>Many tree-based model implementations in tidymodels generally handle correlated predictors well. Just to be apples-to-apples with <code>&quot;aorsf&quot;</code>, let&rsquo;s use a different random forest engine to get a better sense for baseline performance:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>spec_rf</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/rand_forest.html'>rand_forest</a></span><span class='o'>(</span>mtry <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span>, min_n <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='c'># this is the default engine, but for consistency's sake:</span></span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/set_engine.html'>set_engine</a></span><span class='o'>(</span><span class='s'>"ranger"</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/set_args.html'>set_mode</a></span><span class='o'>(</span><span class='s'>"regression"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>res_rf</span> <span class='o'>&lt;-</span> <span class='nf'>tune_grid</span><span class='o'>(</span><span class='nv'>spec_rf</span>, <span class='nv'>protein</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>meats_folds</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #0000BB;'>i</span> <span style='color: #000000;'>Creating pre-processing data to finalize unknown parameter: mtry</span></span></span>
<span></span><span></span>
<span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_rf</span>, metric <span class='o'>=</span> <span class='s'>"rmse"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;    mtry min_n .metric .estimator  mean     n std_err .config              </span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    96     4 rmse    standard    2.37    10  0.090<span style='text-decoration: underline;'>5</span> Preprocessor1_Model08</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span>    41     6 rmse    standard    2.39    10  0.088<span style='text-decoration: underline;'>3</span> Preprocessor1_Model01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span>    88    10 rmse    standard    2.43    10  0.081<span style='text-decoration: underline;'>6</span> Preprocessor1_Model06</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span>    79    17 rmse    standard    2.51    10  0.074<span style='text-decoration: underline;'>0</span> Preprocessor1_Model07</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span>    27    18 rmse    standard    2.52    10  0.077<span style='text-decoration: underline;'>8</span> Preprocessor1_Model04</span></span>
<span></span><span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_rf</span>, metric <span class='o'>=</span> <span class='s'>"rsq"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;    mtry min_n .metric .estimator  mean     n std_err .config              </span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    96     4 rsq     standard   0.424    10  0.038<span style='text-decoration: underline;'>5</span> Preprocessor1_Model08</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span>    41     6 rsq     standard   0.409    10  0.039<span style='text-decoration: underline;'>4</span> Preprocessor1_Model01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span>    88    10 rsq     standard   0.387    10  0.036<span style='text-decoration: underline;'>5</span> Preprocessor1_Model06</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span>    79    17 rsq     standard   0.353    10  0.040<span style='text-decoration: underline;'>4</span> Preprocessor1_Model07</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span>    27    18 rsq     standard   0.346    10  0.039<span style='text-decoration: underline;'>7</span> Preprocessor1_Model04</span></span>
<span></span></code></pre>
</div>
<p>Not so hot. Just to show I&rsquo;m not making a straw man here, I&rsquo;ll evaluate a few more alternative modeling approaches behind the curtain and print out their best performance metrics:</p>
<ul>
<li><strong>Gradient boosted tree with LightGBM</strong>. Best RMSE: 2.34. Best R-squared: 0.43.</li>
<li><strong>Partial least squares regression</strong>. Best RMSE: 1.39. Best R-squared: 0.81.</li>
<li><strong>Support vector machine</strong>. Best RMSE: 2.28. Best R-squared: 0.46.</li>
</ul>
<p>This is a tricky one.</p>
<h2 id="introducing-accelerated-oblique-random-forests">Introducing accelerated oblique random forests
</h2>
<p>The 0.3.0 release of bonsai introduces support for accelerated oblique random forests via the <code>&quot;aorsf&quot;</code> engine for classification and regression in tidymodels. (Tidy survival modelers might note that 






  
  

<a href="https://opensource.posit.co/blog/2023-04-19_censored-0-2-0/">we already support <code>&quot;aorsf&quot;</code> for censored regression</a>
 via the 






<a href="https://censored.tidymodels.org" target="_blank" rel="noopener">censored</a>
 parsnip extension package!)</p>
<p>Unlike trees in conventional random forests, which create splits using thresholds based on individual predictors (e.g. <code>x_001 &gt; 3</code>), oblique random forests use linear combinations of predictors to create splits (e.g. <code>x_001 * x_002 &gt; 7.5</code>) and have been shown to improve predictive performance related to conventional random forests for a variety of applications (Menze et al. 2011). &ldquo;Oblique&rdquo; references the appearance of decision boundaries when a set of splits is plotted; I&rsquo;ve grabbed a visual from the 


  
  
  





<a href="https://github.com/ropensci/aorsf?tab=readme-ov-file#what-does-oblique-mean" target="_blank" rel="noopener">aorsf README</a>
 that demonstrates:</p>
<div class="highlight">
<img src="https://opensource.posit.co/blog/2024-06-25_bonsai-0-3-0/figures/oblique.png" alt="Two plots of decision boundaries for a classification problem. One uses single-variable splitting and the other oblique splitting. Both trees partition the predictor space defined by predictors X1 and X2, but the oblique splits do a better job of separating the two classes thanks to an 'oblique' boundary formed by considering both X1 and X2 at the same time." width="700px" style="display: block; margin: auto;" />
</div>
<p>In the above, we&rsquo;d like to separate the purple dots from the orange squares. A tree in a traditional random forest, represented on the left, can only generate splits based on one of X1 or X2 at a time. A tree in an oblique random forest, represented on the right, can consider both X1 and X2 in creating decision boundaries, often resulting in stronger predictive performance.</p>
<p>Where does the &ldquo;accelerated&rdquo; come from? Generally, finding optimal oblique splits is computationally more intensive than finding single-predictor splits. The aorsf package uses something called &ldquo;Newton Raphson scoring&rdquo;&mdash;the same algorithm under the hood in the survival package&mdash;to identify splits based on linear combinations of predictor variables. This approach speeds up that process greatly, resulting in fit times that are analogous to implementations of traditional random forests in R (and hundreds of times faster than existing oblique random forest implementations, Jaeger et al. 2024).</p>
<p>The code to tune this model with the <code>&quot;aorsf&quot;</code> engine is the same as for <code>&quot;ranger&quot;</code>, except we switch out the <code>engine</code> argument to 






<a href="https://parsnip.tidymodels.org/reference/set_engine.html" target="_blank" rel="noopener"><code>set_engine()</code></a>
:</p>
<div class="highlight">
<pre class='chroma'><code class='language-r' data-lang='r'><span><span class='nv'>spec_aorsf</span> <span class='o'>&lt;-</span> </span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/rand_forest.html'>rand_forest</a></span><span class='o'>(</span></span>
<span>    mtry <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span>,</span>
<span>    min_n <span class='o'>=</span> <span class='nf'><a href='https://hardhat.tidymodels.org/reference/tune.html'>tune</a></span><span class='o'>(</span><span class='o'>)</span></span>
<span>  <span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/set_engine.html'>set_engine</a></span><span class='o'>(</span><span class='s'>"aorsf"</span><span class='o'>)</span> <span class='o'><a href='https://magrittr.tidyverse.org/reference/pipe.html'>%&gt;%</a></span></span>
<span>  <span class='nf'><a href='https://parsnip.tidymodels.org/reference/set_args.html'>set_mode</a></span><span class='o'>(</span><span class='s'>"regression"</span><span class='o'>)</span></span>
<span></span>
<span><span class='nv'>res_aorsf</span> <span class='o'>&lt;-</span> <span class='nf'>tune_grid</span><span class='o'>(</span><span class='nv'>spec_aorsf</span>, <span class='nv'>protein</span> <span class='o'>~</span> <span class='nv'>.</span>, <span class='nv'>meats_folds</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #0000BB;'>i</span> <span style='color: #000000;'>Creating pre-processing data to finalize unknown parameter: mtry</span></span></span>
<span></span><span></span>
<span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_aorsf</span>, metric <span class='o'>=</span> <span class='s'>"rmse"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;    mtry min_n .metric .estimator  mean     n std_err .config              </span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    87    11 rmse    standard   0.786    10  0.037<span style='text-decoration: underline;'>0</span> Preprocessor1_Model02</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span>    98     8 rmse    standard   0.789    10  0.036<span style='text-decoration: underline;'>3</span> Preprocessor1_Model10</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span>    48     5 rmse    standard   0.793    10  0.036<span style='text-decoration: underline;'>3</span> Preprocessor1_Model01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span>    16    17 rmse    standard   0.803    10  0.032<span style='text-decoration: underline;'>5</span> Preprocessor1_Model09</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span>    31    18 rmse    standard   0.813    10  0.035<span style='text-decoration: underline;'>9</span> Preprocessor1_Model05</span></span>
<span></span><span><span class='nf'>show_best</span><span class='o'>(</span><span class='nv'>res_aorsf</span>, metric <span class='o'>=</span> <span class='s'>"rsq"</span><span class='o'>)</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'># A tibble: 5 × 8</span></span></span>
<span><span class='c'>#&gt;    mtry min_n .metric .estimator  mean     n std_err .config              </span></span>
<span><span class='c'>#&gt;   <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>      <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;int&gt;</span>   <span style='color: #555555; font-style: italic;'>&lt;dbl&gt;</span> <span style='color: #555555; font-style: italic;'>&lt;chr&gt;</span>                </span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>1</span>    48     5 rsq     standard   0.946    10 0.004<span style='text-decoration: underline;'>46</span> Preprocessor1_Model01</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>2</span>    98     8 rsq     standard   0.945    10 0.004<span style='text-decoration: underline;'>82</span> Preprocessor1_Model10</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>3</span>    87    11 rsq     standard   0.945    10 0.004<span style='text-decoration: underline;'>84</span> Preprocessor1_Model02</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>4</span>    16    17 rsq     standard   0.941    10 0.003<span style='text-decoration: underline;'>70</span> Preprocessor1_Model09</span></span>
<span><span class='c'>#&gt; <span style='color: #555555;'>5</span>    31    18 rsq     standard   0.940    10 0.005<span style='text-decoration: underline;'>47</span> Preprocessor1_Model05</span></span>
<span></span></code></pre>
</div>
<p>Holy smokes. The best RMSE from aorsf is 0.79, much more performant than the previous best RMSE from the elastic net with a value of 1.24, and the best R-squared is 0.95, much stronger than the previous best (also from the elastic net) of 0.85.</p>
<p>Especially if your modeling problems involve few samples of many, highly correlated predictors, give the <code>&quot;aorsf&quot;</code> modeling engine a whirl in your workflows and let us know what you think!</p>
<h2 id="references">References
</h2>
<p>Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, Jaime L. Speiser, Matthew W. Segar, Ambarish Pandey, Nicholas M. Pajewski. 2024. &ldquo;Accelerated and Interpretable Oblique Random Survival Forests.&rdquo; <em>Journal of Computational and Graphical Statistics</em> 33.1: 192-207.</p>
<p>Byron C. Jaeger, Sawyer Welden, Kristin Lenoir, and Nicholas M. Pajewski. 2022. &ldquo;aorsf: An R package for Supervised Learning Using the Oblique Random Survival Forest.&rdquo; <em>The Journal of Open Source Software</em>.</p>
<p>Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht. (2011). &ldquo;On Oblique Random Forests.&rdquo; <em>Joint European Conference on Machine Learning and Knowledge Discovery in Databases</em> (pp. 453&ndash;469). Springer.</p>
<h2 id="acknowledgements">Acknowledgements
</h2>
<p>Thank you to 






<a href="https://github.com/bcjaeger" target="_blank" rel="noopener">@bcjaeger</a>
, the aorsf author, for doing most of the work to implement aorsf support in bonsai. Thank you to 






<a href="https://github.com/hfrick" target="_blank" rel="noopener">@hfrick</a>
, 






<a href="https://github.com/joranE" target="_blank" rel="noopener">@joranE</a>
, 






<a href="https://github.com/jrosell" target="_blank" rel="noopener">@jrosell</a>
, 






<a href="https://github.com/nipnipj" target="_blank" rel="noopener">@nipnipj</a>
, 






<a href="https://github.com/p-schaefer" target="_blank" rel="noopener">@p-schaefer</a>
, 






<a href="https://github.com/seb-mueller" target="_blank" rel="noopener">@seb-mueller</a>
, and 






<a href="https://github.com/tcovert" target="_blank" rel="noopener">@tcovert</a>
 for their contributions on the bonsai repository since version 0.2.1.</p>
]]></description>
      <enclosure url="https://opensource.posit.co/blog/2024-06-25_bonsai-0-3-0/thumbnail-wd.jpg" length="389634" type="image/jpeg" />
    </item>
  </channel>
</rss>
