Overview

Logistic regression is frequently used in classification and predictive anlaytics1. Put it simply, the model estimates the probability (distribution) of (a) certain event(s).

Input and Output

Input is nothing but the features, just as in the case of linear regression. Denote the input as $\boldsymbol{x}$.

Output can (usually) be a vector. We use $\boldsymbol{p}$ to indicate that it is related with probability distribution.

  • For $K$-class classification, the model need to generate $K$ class probabilities, $\{p_k\}_{k=1}^{K}$. The probabilty should satisfy the normalization condition, that is $\sum_{k=1}^K p_k = 1$.
  • For binary classification, the model will output two probabilities $p$ and $1-p$ (sometimes written as $q$).

Either case, the true degree-of-freedom of output is $K-1$.

How to model the probability distribution?

Logistic regression belongs to a general class of model called generalized linear model (GLM), where output is a function of the linear combination of the input values, that is $$ p_k = f(\boldsymbol{x}^\top \boldsymbol{w}_k). $$ In logistic regression, the output probability for class $k$ is assumed to be modeled by $$ \textcolor{DodgerBlue}{p_k \propto \exp(\boldsymbol{x}^\top \boldsymbol{w}_k)} = \frac1{C} \cdot \exp(\boldsymbol{x}^\top \boldsymbol{w}_k) $$ where $C$ is a normalization constant.

Why $\exp$?

  • One possible intuition behind is that $\exp$ is a strictly positive function, which is one-step closer to the goal of defining a probability.

Binary classification ($K=2$)

$$ p_1 = \frac1C \exp(\boldsymbol{x}^\top \boldsymbol{w}_1), \quad p_2 = \frac1C \exp(\boldsymbol{x}^\top \boldsymbol{w}_2) $$ Therefore $$ C = \exp(\boldsymbol{x}^\top \boldsymbol{w}_1) + \exp(\boldsymbol{x}^\top \boldsymbol{w}_2). $$ Obviously, we can rewrite the two probabilities as \begin{align} p_1 &= \frac{1}{1+\exp(\boldsymbol{x}^\top (\boldsymbol{w}_2 - \boldsymbol{w}_1))} = \frac1{1+\exp(-\boldsymbol{x}^\top\boldsymbol{w})}, \\ p_2 &= \frac{1}{\exp(\boldsymbol{x}^\top (\boldsymbol{w}_1 - \boldsymbol{w}_2))+1} = \frac1{1+\exp(\boldsymbol{x}^\top\boldsymbol{w})}. \notag \end{align}

Notice that the effective parameter we want to learn is actually $\boldsymbol{w} = \boldsymbol{w}_1-\boldsymbol{w}_2$. Equation $\text{(1)}$ is the well-known sigmoid2 function. $$ p_1 = \sigma(\boldsymbol{x}^\top \boldsymbol{w}). $$

Multiclass classification ($K > 2$)

The normalization constant $C$ takes the form $C = \sum_{k=1}^K \exp(\boldsymbol{x}^\top \boldsymbol{w}_k)$. Therefore, for each class, the probability is \begin{equation} p_k = \frac{\exp(\boldsymbol{x}^\top \boldsymbol{w}_k)}{\sum_{k=1}^K \exp(\boldsymbol{x}^\top \boldsymbol{w}_k)} \end{equation} The above equation $\text{(2)}$ is also a well-known function called softmax3.

Model Objective

To find the optimal parameters $\{\boldsymbol{w}_k\}$, one needs to propose an objective function. Since this time, the model is actually generating a distribution as output. One popular choice of "distance" between two distributions can be cross entropy4, whose discrete form reads: $$ H(p, q) = \textcolor{red}{\boldsymbol{-}} \sum_{x \in \mathcal X} p(x) \log q(x) $$ according to Wikipedia. Also, minimizing cross entropy is maximizing likelihood in the context of classification problem.

For $K=2$

Let the input vector as $\boldsymbol{x}$, the output probability as $\boldsymbol{p} = (p, 1-p)$, the (one-hot encoded) ground truth (label) as $\boldsymbol{y} = (1, 0)$ or $(0, 1)$. The cross entropy loss in the binary classification for a single sample is $$ \begin{aligned} H(\boldsymbol{y},\boldsymbol{p}) &= \begin{cases} -(1 \cdot \log p + 0 \cdot \log(1-p)) = -\log p & \text{when~}\boldsymbol{y} = (1,0) \\ -(0 \cdot \log p + 1 \cdot \log(1-p)) = -\log (1-p) & \text{when~}\boldsymbol{y} = (0,1) \end{cases}\\ &= -\boldsymbol{y}^\top \log \boldsymbol{p} \end{aligned} $$ This loss is called $\textbf{B}$inary $\textbf{C}$ross $\textbf{E}$ntropy (BCE) loss.

Side Note

Usually, we see the loss function of binary cross entropy as $$ H(y, p) = -(y\log p + (1-y) \log(1-p)). $$ This is because that we can map $$ \begin{aligned} \boldsymbol{y} &= (1,0) \mapsto 1,\\ \boldsymbol{y} &= (0,1) \mapsto 0.\\ \end{aligned} $$

On the training set, the loss function is $$ \mathcal L = \sum_{i=1}^M H(\boldsymbol{y}^{(i)}, \boldsymbol{p}^{(i)}) = - \sum_{i=1}^M {\boldsymbol{y}^{(i)}}^\top \cdot \log (\boldsymbol{p}^{(i)}) $$

For $K > 2$

The above form is directly generalizable.

Solving the optimization problem

Once the objective function is fixed, we can leverage gradient-descent class methods or Newton class methods to perform the optimization. One good news is that the gradient function of this loss function is elegent: $$ \nabla_{\boldsymbol{w_k}} H(\boldsymbol{y}, \boldsymbol{p}) = (\boldsymbol{p}_k - \boldsymbol{y}_k)\boldsymbol{x} $$ See Equation (5.47) https://web.stanford.edu/~jurafsky/slp3/5.pdf for details. (As a matter of fact, all the gradients that

Model Evaluation

Other than looking at the loss function, there is a popular choice — confusion matrix. See wikipedia5. Here are some popular metrics-of-interest:

  • accuracy is defined by $$ \text{ACC} = \frac{\text{TP} + \text{TN}}{\text{P}+\text{N}}. $$
  • recall (true positive rate, sensitivity) is defined as $$ \text{TPR} = \frac{\text{TP}}{\text{P}}. $$ high recall — model could boldly classify true as true.
  • precision (positive predictive value) is defined as $$ \text{PPV} = \frac{\text{TP}}{\text{TP}+\text{FP}}. $$ high precision — model is limiting classifying false as true.
  • F1 score is defined as harmonic mean of precision and recall.

Discussions

Why cross entropy?

  • Convex objective function.
  • Simple gradient computation.
  • Theory backed — maximum likelihood.

Why not KL Divergence?

They are the same: $$ \min_{\boldsymbol{w}} \sum_{i=1}^N H(\boldsymbol{y}^{(i)}, \boldsymbol{p}^{(i)}) \Leftrightarrow \min_{\boldsymbol{w}} \sum_{i=1}^N D_\text{KL}(\boldsymbol{y}^{(i)} ~||~ \boldsymbol{p}^{(i)}) $$ In fact $$ D_\text{KL} (P ~||~ Q) = H(P, Q) - H(P) $$ and entropy $H(P)$ is constant in the optimization problem.

What will happen if we use a loss like $\|\boldsymbol{y}-\boldsymbol{p}\|^2$?

One can show6 that it would be hard for the optimization problem as the objective function is not convex any more.

Why not using Linear Regression for Binary Classification?

  1. linear regression predicts continuous values, not probabilistic
  2. linear regression is sensitive to outliers

Logistic function VS Sigmoid function

The logistic function7 takes the form $$ p(x) = \frac1{1+\exp(-k(x-x_0))} $$ while the sigmoid function takes the form $$ p(\boldsymbol{x}) = \frac1{1+\exp(-\boldsymbol{x}^\top \boldsymbol{w})} $$ They are very similar and have a connection. But they are not exactly the same!

References