Blogs
Posts
Effect of Attention on Text Classification Performance
In this post, we continue from our previous study on text classification using a recurrent neural network (RNN) model. In that study, we explored how different Byte-Pair Encoding (BPE) settings and RNN architectures impact classification performance. We found that BPE configurations significantly influenced the classification performances. This follow-up study focuses on understanding the effect of incorporating an attention mechanism into the RNN model.
Word Embedding and Text Classification Performance
This report investigates using a recurrent neural network (RNN) model for classifying sentences extracted from Chinese Wikipedia articles. We evaluate the classification performance across various Byte-Pair Encoding (BPE) settings and RNN architectures, finding that BPE settings significantly influence classification outcomes.
Power Law Distribution: Word Frequency
In a previous note, we explored one mechanism that leads to power law distributions: the probability of a random walk returning to its starting point for the first time. In this note, we will examine another mechanism that generates power law distribution of the word frequencies within a text.
Power Law Distribution: First Return Time of a Random Walk
Power law distributions are prevalent in various fields. This note derives the probability of a random walk returning to its starting point for the first time (the first return time), which can be approximated by a power law distribution over sufficiently long time scales. Simulation results closely align with both the exact and approximate probabilities.
Multiplicative Processes and the Log-normal Distribution
This note derives log-normal distribution from random multiplicative processes and confirms its application through investment simulations. We also examine how dispersion and inequality among investments grow over time and quantify inequality using the Gini index.
Central Limit Theorem and Cauchy Distribution
This note demonstrates the Central Limit Theorem (CLT) using the Fourier transform of the probability density function (PDF) and emphasizes the requirement that the mean and variance of the random variable must exist. It then contrasts this with the Cauchy distribution as a counter-example, where neither the mean nor the variance is defined, illustrating why the CLT does not hold in this case.
Biorthogonal Basis and Reproducing Kernels
This note initially explores the concept of a biorthogonal basis in a finite vector space. It subsequently applies a similar methodology to derive the reproducing kernel basis in function spaces, enabling the approximation of functions using their pointwise values and the associated dual basis.
Variational Autoencoder for CelebA Image Analysis
In this note, we implement a variational autoencoder (VAE) using the neural network framework in Wolfram Language and train it on the CelebFaces Attributes (CelebA) dataset. New images can be generated by sampling from the learned latent space. We explore how the VAE captures and manipulates image features, particularly the concept of attractiveness.
Comparison of Proportion Tests
This note compares several statistical methods for detecting differences in failure rates between two groups. We explore Fisher’s exact test, the Chi-squared test, and a Bayesian Monte Carlo approach, focusing on their conceptual simplicity, visual interpretability, and insights into uncertainty.
Unit Root Test in the AR(1) Time Series with Monte Carlo Method
Unit root testing is crucial in determining the stationarity of time series data, especially in autoregressive processes like AR(1). In this note, we explore the effectiveness of the Monte Carlo method in unit root testing for AR(1) processes compared to traditional methods like the Augmented Dickey-Fuller (ADF) test.
Unveiling Multidimensional Insights: Radviz Projection and Feature Importance in Regression
Radviz projection simplifies the representation of multidimensional data onto a 2D plane. In this note, we delve into the computation of Radviz projections and demonstrate their application in uncovering important features in multivariate regression analysis.
Numerical Investigation of the Lorenz System
We solve the ordinary differential equations of the Lorenz system to generate time series for future prediction with various models, including XGBoost and deep neural networks. Furthermore, we numerically compute the Lyapunov exponents of the Lorenz system to gain insights into its chaotic behavior.
Causal Inference By Regression
We use a simple example to illustrate that causal inference by regression is unreliable in realistic cases where measurement noises are present.
Simple Derivation and Intuitive Understanding of Independence Test Using HSIC
In this article, we will derive the HSIC formula in a clear and straightforward manner. We will also explore how to estimate the statistical significance using bootstrap sampling and gain an intuitive understanding of why mapping data into a feature space is crucial for independence testing.
Understanding Kernel Principal Component Analysis (Kernel PCA)
Kernel Principal Component Analysis (Kernel PCA) is a powerful technique used in machine learning for dimensionality reduction. It allows us to perform principal component analysis on data that has been nonlinearly mapped to a higher-dimensional feature space. This article will provide a step-by-step derivation of the Kernel PCA formula, followed by an illustrative example to showcase its practical application. We will also compare our results with explicit mapping in feature space and the Kernel PCA implementation in Scikit-Learn.
Encoding Rotated Images with Autoencoder
In this study, we explore the application of PyTorch-based autoencoders, featuring convolutional layers, to encode images from the Fashion-MNIST dataset. Our autoencoder effectively encodes and decodes original images. However, a notable challenge arises when we feed rotated images into the model. These rotated images are often decoded incorrectly and classified into different categories. We address this issue by training the model on a combination of original and randomly rotated images, enabling it to decode rotated input correctly.
Regression Uncertainty Estimation with Conformal Prediction
In this note, we estimate the regression prediction intervals using various conformal prediction methods. The regression model employed is a Gaussian Process regressor. We compare the confidence intervals generated by conformal prediction and Gaussian Process regression. Without conformal prediction, it is crucial to accurately estimate the variances of the observation noise and the predicted mean. This necessitates optimizing the kernel parameters and the noise variance to maximize the marginal likelihood. However, with conformal prediction, such a requirement is no longer necessary.
Positive Definiteness of Kernels
This post summarizes the proof demonstrating the positive definiteness of the multivariate squared exponential kernel (radial basis function) and exponential kernel. The proofs primarily rely on sources such as (Wendland 2004) and stackexchange.com. Additionally, Python
numpy
commands are included for numerically testing the positive definiteness of a matrix.Singularity in Covariance Matrix in Gaussian Process Regression
This post discusses the issue of singularity in the covariance matrix when performing Gaussian Process regression, particularly when dealing with a large number of training data points, as shown in a previous post. We explore two approaches to handle this numerical problem: adjusting the kernel parameters and introducing jitter to the diagonal of the covariance matrix. Additionally, we evaluate the use of low-rank matrix approximations for the covariance matrix.
Gaussian Process Regression
This article discusses Gaussian Process regression, a non-parametric approach for modeling the relationship between input variables and their corresponding outputs. It presents the conditional probability distribution of a multivariate Gaussian and the covariance matrix computation using a kernel function. The article also covers the optimization of kernel hyperparameters to maximize the likelihood function. An implementation of Gaussian Process regression in Python is provided. The article includes examples of noiseless and noisy observation cases and demonstrates the prediction of values with mean and confidence intervals.
Conditional Distribution of Multivariate Gaussian Variables: A Simple Derivation
We present a straightforward derivation for calculating the conditional probability distribution of multivariate Gaussian variables.
Multivariate Gaussian Distribution As Linear Transformation of Independent Normally Distributed Random Variables
This note explores the relationship between multivariate Gaussian variables and the linear transformation of independent, normally distributed random variables. The main results include the derivation of the probability density function (PDF) for multivariate Gaussian distribution and the recognition that there are infinite linear transformations capable of transforming independent, normally distributed random variables into multivariate Gaussian variables. The report also demonstrates specific methods of constructing these transformations using decomposition techniques such as singular value decomposition and Cholesky decomposition.
Transformations Corresponding to Kernels
Mercer’s Theorem is a fundamental result in kernel theory. It states that if we have a positive semi-definite kernel that is symmetric, we can find a mapping function $\phi$ that maps the input vector $\mathbf{x}$ to a higher dimensional space such that the dot product of the transformed vectors equals the kernel function $K(\mathbf{x},\mathbf{y})$. This is what is referred to as the kernel trick in support vector machines (SVM).
Using Regression to Check Variable Dependence in Three Types of Directed Acyclic Graphs
This article explores how regression can be used to determine the dependence between variables in three types of directed acyclic graphs (DAGs): pipe, confounder, and collider. The theoretical analysis of these graphs can be found in the linked blog post.
Variable Dependence in Three Types of Directed Acyclic Graphs
This note examines the dependence between variables in three types of directed acyclic graphs (DAGs): pipe, confounder, and collider.
Maximum Likelihood Estimation is An Approximation to Minimization of KL Distance
This is a brief derivation that maximum likelihood estimation approximates the minimization of the Kullback-Leibler (KL) distance.
Effect of Noise in Data On Regression: Linear Model vs. Neural Network
We have observed that the performance of the linear model for regression is equivalent to or better than more complex nonlinear models like the neural network in cases where the data is noisy. In this note, we compare a linear model and a feed-forward neural network for regression with various amounts of noise in the data.
Data Smoothing with P-splines: An Implementation with scikit-learn and PyMC
This note uses P-splines (Penalized Splines) for data smoothing. Reducing the difference between the coefficients of spline bases makes the fit smoother. The smoothness control is implemented in two ways: 1) the difference between the coefficients as a regularization term in the least square minimization in scikit-learn; and 2) coefficients as Gaussian random walk in PyMC, a probabilistic programming library.
Bias in Poetntial Outcomes in Causal Inference
This note summarizes my understanding of the bias in potential outcomes while reading the book Causal Inference: The Mixtape (https://mixtape.scunning.com).
Derivation of Linear Regression Coefficients and Their Variation with Minimal Matrix Algebra
This is a simple calculation of linear regression coefficients and their variances using covariance and variance with minimal need for matrix algebra. This method can prove the regression anatomy theorem in a straightforward way.
XGBoost with GPUs and Multi-core CPU
We run XGBoost on a multi-core CPU and GPUs. On the CPU, the speed is maximum with 16 cores and does not improve with more cores. A speed-up of 29% can be obtained with a single GPU than the CPU with 16 cores. Interestingly, we do not observe speed-up from one GPU to two GPUs.
Hamiltonian Monte Carlo vs. Metropolis
This note compares Metropolis and Hamiltonian Monte Carlo algorithms, using autocorrelation and effective sample size as metrics. Unimodal target distribution is used in this note, and multimodal target distribution will be discussed in a future note.
Estimation of Variability From Observed Data: A Bayesian Perspective
The uncertainty of estimates from data is a direct result of the posterior distribution of the model for the data generation. This note uses the Bayesian approach to discuss two cases, one of which explains the bootstrap method.
Dirichlet Process in Mixture Model
We use the Dirichlet process to generate the weights in the mixture model to determine the optimal number of components automatically.
Multivariate Orthogonal Linear Regression Using PyMC
This note describes a multivariate orthogonal linear regression method using the PyMC probabilistic programming package. The formulation is based on an intuitive geometrical interpretation.
Flag of Ukraine with Matplotlib
In support of Ukraine and the Ukrainian people, I made a Ukrainian flag with matplotlib.
Logic of Science: Review of Bernoulli's Fallacy
The book is well researched and very lucid. The arguments for the Bayesian approach are convincing. It’s a beneficial book before reading Jaynes’ Probability Theory.
Numerical Solution to Monty Hall Problem using PyMC
We numerically solved the Monty Hall Problem with PyMC3, a probabilistic programming package in Python. The PyMC code is adapted from Austin Rochford’s Introduction to Probabilistic Programming with PyMC.
Student's t Mixture Model with PyMC
In this note, we compare the Gaussian mixture model and Student’s-t mixture model for some two-dimensional data with an unbalanced proportion of clusters, as shown in Figure 1. The result demonstrates that the Student’s-t mixture model performs much better.
A Simple Non-Bayesian Solution to Monty Hall Problem
This short note describes a simple non-Bayesian solution to the Monty Hall problem. A charming Bayesian analysis can be found in the book Bernoulli’s Fallacy.
Solution to An Example Problem in Bernoulli's Fallacy
In this note, we solve an example problem in the book Bernoulli’s Fallacy using three approaches: (1) maximum likelihood, (2) Bayes’ theorem, and (3) MCMC simulation with PyMC3 package in Python.
First Blog
Blogging on GitHub Pages is quick to get started. Jekyll, however, seems to require some effort to learn it well.
subscribe via RSS