Mathematics, Post 1: Supervised Machine Learning

saksena44088
Oct 31, 2025
7 min read

Updated: Nov 7, 2025

Supervised machine learning learns from examples with answers. For example: you give a model many input-output pairs, it makes a guess for each input, checks how far off it was from the known answer, and then tweaks itself to do better next time. Do this over and over, and the model discovers a rule that maps inputs to outputs. When the outputs are numbers, it’s called regression; when they’re categories, it’s classification. Well-run supervised projects also report confidence, not just a single guess, so you can judge how much to trust a prediction. The sections that follow move from this idea to the working details – how data is represented, how performance is scored, how models are improved, and why different families of methods behave the way they do.

The Learning Pipeline

A supervised workflow always follows the same process. You begin with data, pairs of inputs and targets \((x_i, y_i)\). You select a model \(f_\theta\) with adjustable parameters \(\theta\) and use it to produce predictions \(\hat y_i = f_\theta(x_i)\). You quantify error with a loss function, which measures how “wrong” a model is by comparing predictions to true values, and combining across the traning set. For instance: \(L(\theta)=\tfrac{1}{n}\sum_{i=1}^n \ell(\hat y_i, y_i)\).

Training is the process of reducing this single value. The standard update moves parameters in the direction that most quickly decreases loss, \(\theta \leftarrow \theta - \alpha \nabla_\theta L(\theta)\), with a step size \(\alpha\). Keeping testing data the model never sees during training is crucial to determine whether accuracy comes from true learning rather than overfitting (memorization). With this loop, ML replaces guesswork with repeatable improvement.

Vectors and Linear Maps

Vectors represent examples as ordered lists of features, and linear maps express how a model weights and combines those features. The simplest prediction rule takes a weighted sum: \(\hat y = w^\top x\). Stacking nnn examples as rows of a matrix XXX and using a single weight vector www yields all predictions at once, \(\hat y = Xw\). This algebra captures a geometric picture. The dot product \(w^\top x\) measures alignment; the model’s decision is a projection of the input onto a direction that the weights deem important. Changing feature scales changes the geometry and can distort training, so in practice features are often normalized so that no single coordinate dominates the updates. The aim is simple dynamics: when distances and directions are sensible, optimization makes steady, predictable progress.

Loss Functions

Losses translate goals into targets for calculus. For regression, the mean squared error penalizes large mistakes more than small ones, \(\mathrm{MSE}=\tfrac{1}{n}\sum (y-\hat y)^2\). This choice yields smooth gradients and an interpretable measure of fit. For binary classification you pass a linear score through a sigmoid, \(\sigma(z)=\tfrac{1}{1+e^{-z}}\), to obtain a probability, then compare that probability to the truth with cross-entropy, \(\ell= -[y\log \hat p + (1-y)\log(1-\hat p)]\).

The effect is direct: the loss rewards calibrated confidence and punishes being confidently wrong. For many classes you convert a vector of scores to probabilities with softmax, \(\,p_k=\tfrac{e^{z_k}}{\sum_j e^{z_j}}\,\), and again use cross-entropy, which cleanly couples probabilistic interpretation with optimization. In each case, the loss is the bridge from intent to computation.

Gradients and Optimization

Optimization is a blind walk to the valley floor guided by local slope. The gradient \(\nabla_\theta L\) tells you in which direction the loss increases most; stepping against it reduces error fastest locally. Full-batch gradient descent uses all examples to compute each step and gives a stable direction at the cost of time. Mini-batch methods estimate the gradient from small random subsets, trading noise for speed and allowing rapid, hardware-friendly updates. Momentum keeps a moving average of recent gradients so that updates carry through flat regions and are less rattled by noise. Adaptive methods such as Adam rescale coordinates using estimates of gradient variability so that parameters with consistently large slopes do not dominate the step while quiet but reliable directions still move. All of these methods rely on the chain rule to propagate the influence of small parameter changes through the model’s computations to the final loss.

Linear Regression

Linear regression offers an exact algebraic solution and a scalable iterative one. With mean squared error and a design matrix XXX, setting the gradient to zero yields the normal equations \(X^\top(Xw-y)=0\), whose solution is \(w^\star=(X^\top X)^{-1}X^\top y\) when the matrix \(X^\top X\) is invertible and well-conditioned. This route is fast and precise for modest feature counts and well-behaved data. When the matrix is ill-conditioned or when data volumes are large, you instead take gradient steps, \(w \leftarrow w-\alpha X^\top(Xw-y)\). Viewed geometrically, both approaches are balancing a cloud of points with a weighted plane; the closed form leaps to the minimizer in one calculation if the geometry is cooperative, while gradient descent walks there by consistent corrections, which is often more reliable at scale. Feature scaling links these views: when columns of XXX live on similar ranges, both the direct solution and the iterative steps behave numerically better.

Logistic and softmax regression

Classification adapts the same linear core to probabilistic outputs. In the binary case, a model learns a linear boundary \(w^\top x+b=0\) and converts distance from that boundary into a probability using the sigmoid. Training with cross-entropy has a compact gradient, \(\nabla_w L=\tfrac{1}{n}X^\top(\hat p-y)\), which has a simple interpretation: increase the score on the true class and decrease it on the false class, proportional to current confidence. In the multiclass case, you compute one score per class, normalize with softmax, and again obtain a tidy gradient, \(\nabla_W L=\tfrac{1}{n}X^\top(P-Y)\), where \(P\) collects current probabilities and \(Y\) holds one-hot targets. These structures make logistic and softmax regression dependable baselines. They learn clean boundaries, produce calibrated probabilities, and provide gradients that are numerically stable and easy to implement.

Decision trees and ensembles

Trees abandon weighted sums and instead carve the feature space with a sequence of questions that split examples into increasingly uniform groups. Purity is measured by functions of the class proportions in a node, such as Gini impurity \(\,1-\sum_k p_k^2\,\) or entropy \(\,-\sum_k p_k\log p_k\,\). A split is chosen when it creates the largest drop in impurity, and the process repeats within each branch until stopping criteria are met. The resulting model is a patchwork of simple rules that is straightforward to read but sensitive to data quirks; small changes can redirect splits and yield different trees. Ensembles stabilize and improve this idea. Random forests train many trees on perturbed versions of the data and features, then average their outputs, reducing variance without increasing bias dramatically. Boosting fits trees sequentially, each new tree modeling the residual errors of the current ensemble, which can be viewed as taking small steps along the negative gradient of the chosen loss in function space. Even though trees are not defined through calculus, they still fit into the same optimization frame and require the same care in validation.

Neural networks

Neural networks return to linear maps but interleave them with nonlinear activations so that the overall function is more than a single weighted sum. A standard layer computes an affine transformation \(z=Wx+b\) followed by an activation \(a=\phi(z)\) such as ReLU or \(\tanh\). Stacking layers composes these simple blocks, allowing the network to transform raw data into representations where a final decision is easy. The gradients that train these parameters are obtained by backpropagation, which is the chain rule organized for layered structures. Starting at the loss, you compute an error signal at the output and pass it backward, multiplying by local derivatives to obtain the error for the previous layer, and so on, until every weight and bias has an associated correction. For a final sigmoid with binary cross-entropy, the output error simplifies to “prediction minus truth,” \(\delta=\hat y-y\). For a softmax with multiclass cross-entropy, the same simplification holds at the logits, \(\Delta=P-Y\). These identities make the backward pass efficient and numerically stable. In practice, successful training depends on sane initializations, normalized inputs to keep scales comparable across features, activation choices that preserve gradient flow, mini-batches that balance noise and hardware throughput, and learning-rate schedules that are bold early and careful late.

Evaluation and data discipline

A model’s promise is only as credible as its evaluation. You maintain three separate sets of data with distinct roles: training to fit parameters, validation to tune hyperparameters and choose stopping points, and test to produce a final, unbiased estimate once decisions are fixed. When data is scarce, kkk-fold cross-validation rotates the validation role through the data and averages performance, but a final untouched test remains essential. Metrics must reflect the task. In regression, root mean squared error \(\mathrm{RMSE}=\sqrt{\tfrac{1}{n}\sum (y-\hat y)^2}\) reports typical mistakes in natural units. In classification, accuracy can hide failures on rare classes, so precision, recall, and probabilistic metrics such as average cross-entropy and ROC-AUC complement it. The most common source of false confidence is leakage – features that smuggle target information, preprocessing that learns from the entire dataset, or time-series splits that allow the future to inform the past. The remedy is procedural: build pipelines that fit transforms on the training set only, apply them unchanged to validation and test, and respect temporal order when the data evolve over time.

Synthesis

The underlying constructs of ML become understandable once these mathematical pieces are visible. Linear models compute weighted sums and reduce a smooth, convex loss with gradients like \(X^\top(Xw-y)\) or \(X^\top(\hat p-y)\). Logistic and softmax regression interpret those sums as probabilities and adjust parameters with direct, interpretable signals that push mass toward the true classes. Trees segment the space with sequential questions and gain stability from ensembles that average or refine them stage by stage. Neural networks compose simple linear-nonlinear blocks and rely on backpropagation to distribute responsibility for the loss across millions of parameters. In all cases the same structure repeats: represent data in a space where simple operations capture useful regularities, define a single scalar objective that encodes what “better” means, and update parameters with small, reliable steps that move that objective down. The mathematics is compact, the workflow is systematic, and the intuition aligns with the algebra.

Mathematics, Post 1: Supervised Machine Learning

Recent Posts

Comments