Blogs

Posts

  • Building A LLM From Scratch with Wolfram Language

    Inspired by Raschka’s book on building a large language model from scratch, we implemented our own version using the Wolfram Language (WL). We opted for WL over Python due to its higher level of abstraction, rich set of built-in functions, and powerful computational capabilities, which streamline the development process.

  • Book Summary and Review: A Thousand Brains

    A Thousand Brains: A New Theory of Intelligence

  • Effect of Attention on Text Classification Performance

    In this post, we continue from our previous study on text classification using a recurrent neural network (RNN) model. In that study, we explored how different Byte-Pair Encoding (BPE) settings and RNN architectures impact classification performance. We found that BPE configurations significantly influenced the classification performances. This follow-up study focuses on understanding the effect of incorporating an attention mechanism into the RNN model.

  • Word Embedding and Text Classification Performance

    This report investigates using a recurrent neural network (RNN) model for classifying sentences extracted from Chinese Wikipedia articles. We evaluate the classification performance across various Byte-Pair Encoding (BPE) settings and RNN architectures, finding that BPE settings significantly influence classification outcomes.

  • Power Law Distribution: Word Frequency

    In a previous note, we explored one mechanism that leads to power law distributions: the probability of a random walk returning to its starting point for the first time. In this note, we will examine another mechanism that generates power law distribution of the word frequencies within a text.

  • Power Law Distribution: First Return Time of a Random Walk

    Power law distributions are prevalent in various fields. This note derives the probability of a random walk returning to its starting point for the first time (the first return time), which can be approximated by a power law distribution over sufficiently long time scales. Simulation results closely align with both the exact and approximate probabilities.

  • Multiplicative Processes and the Log-normal Distribution

    This note derives log-normal distribution from random multiplicative processes and confirms its application through investment simulations. We also examine how dispersion and inequality among investments grow over time and quantify inequality using the Gini index.

  • Central Limit Theorem and Cauchy Distribution

    This note demonstrates the Central Limit Theorem (CLT) using the Fourier transform of the probability density function (PDF) and emphasizes the requirement that the mean and variance of the random variable must exist. It then contrasts this with the Cauchy distribution as a counter-example, where neither the mean nor the variance is defined, illustrating why the CLT does not hold in this case.

  • Biorthogonal Basis and Reproducing Kernels

    This note initially explores the concept of a biorthogonal basis in a finite vector space. It subsequently applies a similar methodology to derive the reproducing kernel basis in function spaces, enabling the approximation of functions using their pointwise values and the associated dual basis.

  • Variational Autoencoder for CelebA Image Analysis

    In this note, we implement a variational autoencoder (VAE) using the neural network framework in Wolfram Language and train it on the CelebFaces Attributes (CelebA) dataset. New images can be generated by sampling from the learned latent space. We explore how the VAE captures and manipulates image features, particularly the concept of attractiveness.

  • Comparison of Proportion Tests

    This note compares several statistical methods for detecting differences in failure rates between two groups. We explore Fisher’s exact test, the Chi-squared test, and a Bayesian Monte Carlo approach, focusing on their conceptual simplicity, visual interpretability, and insights into uncertainty.

  • Unit Root Test in the AR(1) Time Series with Monte Carlo Method

    Unit root testing is crucial in determining the stationarity of time series data, especially in autoregressive processes like AR(1). In this note, we explore the effectiveness of the Monte Carlo method in unit root testing for AR(1) processes compared to traditional methods like the Augmented Dickey-Fuller (ADF) test.

  • Unveiling Multidimensional Insights: Radviz Projection and Feature Importance in Regression

    Radviz projection simplifies the representation of multidimensional data onto a 2D plane. In this note, we delve into the computation of Radviz projections and demonstrate their application in uncovering important features in multivariate regression analysis.

  • Numerical Investigation of the Lorenz System

    We solve the ordinary differential equations of the Lorenz system to generate time series for future prediction with various models, including XGBoost and deep neural networks. Furthermore, we numerically compute the Lyapunov exponents of the Lorenz system to gain insights into its chaotic behavior.

  • Causal Inference By Regression

    We use a simple example to illustrate that causal inference by regression is unreliable in realistic cases where measurement noises are present.

  • Simple Derivation and Intuitive Understanding of Independence Test Using HSIC

    In this article, we will derive the HSIC formula in a clear and straightforward manner. We will also explore how to estimate the statistical significance using bootstrap sampling and gain an intuitive understanding of why mapping data into a feature space is crucial for independence testing.

  • Understanding Kernel Principal Component Analysis (Kernel PCA)

    Kernel Principal Component Analysis (Kernel PCA) is a powerful technique used in machine learning for dimensionality reduction. It allows us to perform principal component analysis on data that has been nonlinearly mapped to a higher-dimensional feature space. This article will provide a step-by-step derivation of the Kernel PCA formula, followed by an illustrative example to showcase its practical application. We will also compare our results with explicit mapping in feature space and the Kernel PCA implementation in Scikit-Learn.

  • Encoding Rotated Images with Autoencoder

    In this study, we explore the application of PyTorch-based autoencoders, featuring convolutional layers, to encode images from the Fashion-MNIST dataset. Our autoencoder effectively encodes and decodes original images. However, a notable challenge arises when we feed rotated images into the model. These rotated images are often decoded incorrectly and classified into different categories. We address this issue by training the model on a combination of original and randomly rotated images, enabling it to decode rotated input correctly.

  • Regression Uncertainty Estimation with Conformal Prediction

    In this note, we estimate the regression prediction intervals using various conformal prediction methods. The regression model employed is a Gaussian Process regressor. We compare the confidence intervals generated by conformal prediction and Gaussian Process regression. Without conformal prediction, it is crucial to accurately estimate the variances of the observation noise and the predicted mean. This necessitates optimizing the kernel parameters and the noise variance to maximize the marginal likelihood. However, with conformal prediction, such a requirement is no longer necessary.

  • Positive Definiteness of Kernels

    This post summarizes the proof demonstrating the positive definiteness of the multivariate squared exponential kernel (radial basis function) and exponential kernel. The proofs primarily rely on sources such as (Wendland 2004) and stackexchange.com. Additionally, Python numpy commands are included for numerically testing the positive definiteness of a matrix.

  • Singularity in Covariance Matrix in Gaussian Process Regression

    This post discusses the issue of singularity in the covariance matrix when performing Gaussian Process regression, particularly when dealing with a large number of training data points, as shown in a previous post. We explore two approaches to handle this numerical problem: adjusting the kernel parameters and introducing jitter to the diagonal of the covariance matrix. Additionally, we evaluate the use of low-rank matrix approximations for the covariance matrix.

  • Gaussian Process Regression

    This article discusses Gaussian Process regression, a non-parametric approach for modeling the relationship between input variables and their corresponding outputs. It presents the conditional probability distribution of a multivariate Gaussian and the covariance matrix computation using a kernel function. The article also covers the optimization of kernel hyperparameters to maximize the likelihood function. An implementation of Gaussian Process regression in Python is provided. The article includes examples of noiseless and noisy observation cases and demonstrates the prediction of values with mean and confidence intervals.

  • Conditional Distribution of Multivariate Gaussian Variables: A Simple Derivation

    We present a straightforward derivation for calculating the conditional probability distribution of multivariate Gaussian variables.

  • Multivariate Gaussian Distribution As Linear Transformation of Independent Normally Distributed Random Variables

    This note explores the relationship between multivariate Gaussian variables and the linear transformation of independent, normally distributed random variables. The main results include the derivation of the probability density function (PDF) for multivariate Gaussian distribution and the recognition that there are infinite linear transformations capable of transforming independent, normally distributed random variables into multivariate Gaussian variables. The report also demonstrates specific methods of constructing these transformations using decomposition techniques such as singular value decomposition and Cholesky decomposition.

  • Transformations Corresponding to Kernels

    Mercer’s Theorem is a fundamental result in kernel theory. It states that if we have a positive semi-definite kernel that is symmetric, we can find a mapping function $\phi$ that maps the input vector $\mathbf{x}$ to a higher dimensional space such that the dot product of the transformed vectors equals the kernel function $K(\mathbf{x},\mathbf{y})$. This is what is referred to as the kernel trick in support vector machines (SVM).

  • Using Regression to Check Variable Dependence in Three Types of Directed Acyclic Graphs

    This article explores how regression can be used to determine the dependence between variables in three types of directed acyclic graphs (DAGs): pipe, confounder, and collider. The theoretical analysis of these graphs can be found in the linked blog post.

  • Variable Dependence in Three Types of Directed Acyclic Graphs

    This note examines the dependence between variables in three types of directed acyclic graphs (DAGs): pipe, confounder, and collider.

  • Maximum Likelihood Estimation is An Approximation to Minimization of KL Distance

    This is a brief derivation that maximum likelihood estimation approximates the minimization of the Kullback-Leibler (KL) distance.

  • Effect of Noise in Data On Regression: Linear Model vs. Neural Network

    We have observed that the performance of the linear model for regression is equivalent to or better than more complex nonlinear models like the neural network in cases where the data is noisy. In this note, we compare a linear model and a feed-forward neural network for regression with various amounts of noise in the data.

  • Data Smoothing with P-splines: An Implementation with scikit-learn and PyMC

    This note uses P-splines (Penalized Splines) for data smoothing. Reducing the difference between the coefficients of spline bases makes the fit smoother. The smoothness control is implemented in two ways: 1) the difference between the coefficients as a regularization term in the least square minimization in scikit-learn; and 2) coefficients as Gaussian random walk in PyMC, a probabilistic programming library.

  • Bias in Poetntial Outcomes in Causal Inference

    This note summarizes my understanding of the bias in potential outcomes while reading the book Causal Inference: The Mixtape (https://mixtape.scunning.com).

  • Derivation of Linear Regression Coefficients and Their Variation with Minimal Matrix Algebra

    This is a simple calculation of linear regression coefficients and their variances using covariance and variance with minimal need for matrix algebra. This method can prove the regression anatomy theorem in a straightforward way.

  • XGBoost with GPUs and Multi-core CPU

    We run XGBoost on a multi-core CPU and GPUs. On the CPU, the speed is maximum with 16 cores and does not improve with more cores. A speed-up of 29% can be obtained with a single GPU than the CPU with 16 cores. Interestingly, we do not observe speed-up from one GPU to two GPUs.

  • Hamiltonian Monte Carlo vs. Metropolis

    This note compares Metropolis and Hamiltonian Monte Carlo algorithms, using autocorrelation and effective sample size as metrics. Unimodal target distribution is used in this note, and multimodal target distribution will be discussed in a future note.

  • Estimation of Variability From Observed Data: A Bayesian Perspective

    The uncertainty of estimates from data is a direct result of the posterior distribution of the model for the data generation. This note uses the Bayesian approach to discuss two cases, one of which explains the bootstrap method.

  • Dirichlet Process in Mixture Model

    We use the Dirichlet process to generate the weights in the mixture model to determine the optimal number of components automatically.

  • Multivariate Orthogonal Linear Regression Using PyMC

    This note describes a multivariate orthogonal linear regression method using the PyMC probabilistic programming package. The formulation is based on an intuitive geometrical interpretation.

  • Flag of Ukraine with Matplotlib

    In support of Ukraine and the Ukrainian people, I made a Ukrainian flag with matplotlib.

  • Logic of Science: Review of Bernoulli's Fallacy

    The book is well researched and very lucid. The arguments for the Bayesian approach are convincing. It’s a beneficial book before reading Jaynes’ Probability Theory.

  • Numerical Solution to Monty Hall Problem using PyMC

    We numerically solved the Monty Hall Problem with PyMC3, a probabilistic programming package in Python. The PyMC code is adapted from Austin Rochford’s Introduction to Probabilistic Programming with PyMC.

  • Student's t Mixture Model with PyMC

    In this note, we compare the Gaussian mixture model and Student’s-t mixture model for some two-dimensional data with an unbalanced proportion of clusters, as shown in Figure 1. The result demonstrates that the Student’s-t mixture model performs much better.

  • A Simple Non-Bayesian Solution to Monty Hall Problem

    This short note describes a simple non-Bayesian solution to the Monty Hall problem. A charming Bayesian analysis can be found in the book Bernoulli’s Fallacy.

  • Solution to An Example Problem in Bernoulli's Fallacy

    In this note, we solve an example problem in the book Bernoulli’s Fallacy using three approaches: (1) maximum likelihood, (2) Bayes’ theorem, and (3) MCMC simulation with PyMC3 package in Python.

  • First Blog

    Blogging on GitHub Pages is quick to get started. Jekyll, however, seems to require some effort to learn it well.

subscribe via RSS