Regression¶
Linear Regression¶
Introduction¶
- Goal:
- Given a training dataset of pairs \((x, y)\), where \(x\) is a vector, \(y\) is a scalar.
- Learn a function (aka. curve or line) \(y' = h(x)\) that best fits the training data.
Examples¶
\(k\)-Nearest Neighbors Regression¶
Decision Tree Regression¶
Linear Functions, Residuals, and Mean Squared Error¶
Regression is predicting real-valued outputs
Common Misunderstanding
Linear functions \(\neq\) Linear decision boundaries.
Optimization for ML¶
In unconstrained optimization, we try minimize (or maximize) a function with no constraints on the inputs to the function.
Given a function $$ J(\boldsymbol{\theta}), J: \mathbb{R}^M \rightarrow \mathbb{R} $$
Our goal is to find $$ \hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta} \in \mathbb{R}^M} J(\boldsymbol{\theta}) $$
Linear Regression as Function Approximation
-
Assume \(\mathcal{D}\) generated as:
\[ \mathbf{x}^{(i)} \sim p^*(\cdot) \\ y^{(i)} = h^*(\mathbf{x}^{(i)}) \] -
Choose hypothesis space, \(\mathcal{H}\): all linear functions in \(M\)-dimensional space
\[ \mathcal{H} = \left\{ h_{\boldsymbol{\theta}}: h_{\boldsymbol{\theta}}(\mathbf{x}) = \boldsymbol{\theta}^{\top} \mathbf{x}, \boldsymbol{\theta} \in \mathbb{R}^M \right\} \] -
Choose an objective function: mean squared error (MSE)
\[ \begin{aligned} J(\boldsymbol{\theta}) &= \frac{1}{N} \sum_{i=1}^N e^{(i)} \\ &= \frac{1}{N} \sum_{i=1}^N \left( y^{(i)} - h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}) \right)^2 \\ &= \frac{1}{N} \sum_{i=1}^N \left( y^{(i)} - \boldsymbol{\theta}^{\top} \mathbf{x}^{(i)} \right)^2 \end{aligned} \] -
Solve the unconstrained optimization problem via favorite method:
- Gradient Descent
- Closed form
- etc.
-
Test time: given a new \(\mathbf{x}\), make prediction \(\hat{y}\)
\[ \hat{y} = h_{\hat{\boldsymbol{\theta}}}(\mathbf{x}) = \hat{\boldsymbol{\theta}}^{\top} \mathbf{x} \]
Optimization Method #0: Random Guess¶
Notation Trick: Folding in the Intercept Term
We can fold in the intercept term by adding a column of 1's to the input matrix \(\mathbf{X}\).
Random Guess
- Pick a random \(\boldsymbol{\theta}\)
- Evaluate \(J(\boldsymbol{\theta})\)
- Repeat step 1 and 2 many times
- Return the \(\boldsymbol{\theta}\) that gives the smallest \(J(\boldsymbol{\theta})\)
Optimization Method #1: Gradient Descent¶
Gradient Descent
- Choose an initial point \(\vec{\boldsymbol{\theta}}\)
-
Repeat \(t = 1, 2, \dots\)
a. Compute gradient \(\vec{g} = \nabla \mathbf{J}(\vec{\boldsymbol{\theta}}) = \frac{1}{N} \sum_{i=1}^{N} \nabla \mathbf{J}^{(i)} (\boldsymbol{\theta})\)
b. Select step size \(\delta_t \in R\)
c. Update parameters \(\vec{\boldsymbol{\theta}} \leftarrow \vec{\boldsymbol{\theta}} - \delta_t \vec{g}\)
-
Return the best \(\vec{\boldsymbol{\theta}}\) when some stopping criterion is reached.
Support Vector Regression¶
where \(C\) trades off the model complexity (i.e., the flatness of the model) and data misfit.
Stochastic Gradient Descent¶
- Instead of computing the gradient over the entire dataset, we compute the gradient over a single example.
- Approximate true gradient by the gradient of one randomly chosen example.
- SGD reduces MSE much more rapidly than GD.
Logistic Regression¶
Capturing Uncertainty¶
Linear Logistic Regression¶
- Idea: Predict \(+1\) if probability \(> 0.5\).
Likelihood Function¶
Given \(N\) independent, identically distributed (i.i.d.) samples \(D ={x(1), x(2), \dots, x(N)}\) from a discrete random variable \(X\) with probability mass function (pmf) \(p(x\mid \theta) \dots\) $$ L(\theta) = p(y^{(1)}, y^{(2)}, \dots, y^{(N)} \mid \theta) = \prod_{i=1}^N p(y^{(i)} \mid \theta) $$
Given \(N\) i.i.d. samples \(D = \{(x(1), y(1)), (x(2), y(2)), \dots, (x(N), y(N))\}\) from a pair of discrete random variable \(X\) with pmf \(p(x\mid \theta)\) and a discrete random variable \(Y\) with pmf \(p(y\mid x, \theta) \dots\) $$ L(\theta) = p(y^{(1)}, x^{(1)}, y^{(2)}, x^{(2)}, \dots, y^{(N)}, x^{(N)} \mid \theta) = \prod_{i=1}^N p(y^{(i)}, x^{(i)} \mid \theta) $$
Logistic Regression¶
- Data: Inputs are continuous vectors of length M. Outputs are discrete.
- Model: Logistic function applied to dot product of parameters with input vector.
- Learning: Finds the parameters that minimize some objective function.
- Prediction: Output is the most probable class.
Mini-Batch SGD¶
- Approximate true gradient by the average gradient of \(K\) randomly chosen examples.
Bayes Optimal Classifier¶
- Definition: The classifier that minimizes the probability of misclassification.
- Bayes decision rule:
Comparison Between Different Losses¶
Aspect | 0/1 Loss | Asymmetric Loss | Log-Loss (Cross-Entropy) |
---|---|---|---|
Purpose | Measures classification accuracy | Adjusts decision boundary penalties | Optimizes logistic regression |
Mathematically Differentiable? | |||
Used for? | Evaluation | Adjusting decision boundary | Optimization |
Training Logistic Regression? | |||
Alternative to Incorporate in Logistic Regression? | Adjust decision threshold \(\alpha\) | Use weighted log-loss | Directly used |