Statistics links, August 2022

No further introduction necessary: Here are some statistics links ->

(Expect these links posts to be irregular in future. I am clearing my tabs.)

Why do tree-based models still outperform deep learning on tabular data? Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. arXiv, July 18, 2022.

Comment: The most interesting thing in the paper was its title. I didn’t know tree-based models are supposed to outperform but I guess they do. I don’t know much about this field, but it sort-of makes intuitive sense: Transformers and prior to them MLP(CNN) architectures have been very impressive at problems in computer vision and natural language — which used to be difficult, because previously the best computer algorithms were not that good at same stuff as our mammalian human brain does 24/7. But “vision” and “natural language” is different kind of difficult than fitting ML models on arbitrary tabular data.

And apparently XGBoost is still good for something. (Learning it wasn’t in vain and it is still relevant?)

McDermott, Grant. “Efficient Simulations in R.” Grant R. McDermott, June 24, 2021. See also follow-up.

Comment: The most useful quote to me:

However, regressions are run on matrices. Which is to say that when you run a regression in R — and most other languages for that matter — behind the scenes your input data frame is first converted to an equivalent matrix before any computation gets done. Matrices have several features that make them “faster” to compute on than data frames. For example, every element must be of the same type (say, numeric). But let’s just agree that converting a data frame to a matrix requires at least some computational effort. Consider then what happens when we feed our function a pre-created design matrix, instead asking it to convert a bunch of data frame columns on the fly.


Torous, William, Florian Gunsilius, and Philippe Rigollet. “An Optimal Transport Approach to Causal Inference.” arXiv, August 12, 2021.

Comment: I do not understand optimal transport, but it seems quite cool. (This tutorial paper by Peyré and Cuturi has been resting on my “to-read” shelf since 2020.) Now, earlier this year I have learned a lot of causal inference techniques common in econometrics, such as differences-in-differences. Apparently DiD can be generalized as “CiC” (Changes-in-Changes), but according to Torous and friends, it works poorly and their optimal transport approach works better. (I can’t really say, but graphs look nice.)