Linear Regression: Fitting a Line to Data
Linear regression finds the best-fit line through data points by minimising the sum of squared residuals (least squares method).
Slope and Intercept Formulas
m = [nΣ(xy) - ΣxΣy] / [nΣ(x²) - (Σx)²]
b = (Σy - mΣx) / n
Line: ŷ = mx + b
Worked Example
Data: (1,2),(2,4),(3,5),(4,4),(5,6)
n=5, Σx=15, Σy=21, Σxy=72, Σx²=55
m = (5×72 - 15×21) / (5×55 - 225)
= (360-315) / (275-225) = 45/50 = 0.9
b = (21 - 0.9×15)/5 = (21-13.5)/5 = 1.5
Line: ŷ = 0.9x + 1.5
R² (Coefficient of Determination)
R² = 1 - SS_res / SS_tot
SS_res = Σ(yᵢ - ŷᵢ)² (sum of squared residuals)
SS_tot = Σ(yᵢ - ȳ)² (total variation)
R² = 0: line explains nothing
R² = 1: perfect fit
R² = 0.8: line explains 80% of variance
Calculate linear regression: Free Linear Regression Calculator
Linear Regression Quick-Reference Table
| R² value | Interpretation | Typical context |
|---|---|---|
| 0.00–0.19 | Very weak fit | Social sciences, noisy data |
| 0.20–0.49 | Weak to moderate fit | Economic forecasting |
| 0.50–0.74 | Moderate fit | Many business models |
| 0.75–0.89 | Good fit | Controlled experiments |
| 0.90–0.99 | Strong fit | Physical sciences, engineering |
| 1.00 | Perfect fit (suspect data) | Overfitting or duplicate data |
How Linear Regression Works
Simple linear regression fits the line ŷ = b₀ + b₁x that minimises the sum of squared residuals (OLS — Ordinary Least Squares). The slope b₁ = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ[(xᵢ − x̄)²]; the intercept b₀ = ȳ − b₁x̄. R² (coefficient of determination) measures the proportion of variance in y explained by x; R² = 1 − SS_residual/SS_total.
Linear regression assumptions: linearity (correct functional form), independence of errors, homoscedasticity (constant variance), and normality of residuals for inference. Violations — such as heteroscedasticity (variance increasing with x) or autocorrelation (time-series data) — require corrections like weighted least squares or robust standard errors.
Common Mistakes
- Confusing correlation with causation: High R² shows that x predicts y in your sample, not that x causes y. Ice cream sales and drowning rates are both correlated with temperature — neither causes the other.
- Extrapolating beyond the data range: A regression line fit to data over x ∈ [10, 50] may predict poorly outside that range. The relationship may be non-linear beyond your observed region.
- Ignoring outliers and influential points: A single outlier with high leverage can dramatically change the slope. Always plot your data and check residuals before trusting regression coefficients.
Frequently Asked Questions
Multiple regression fits ŷ = b₀ + b₁x₁ + b₂x₂ + … + bₖxₖ, allowing prediction from several independent variables simultaneously. Adjusted R² (penalises adding useless variables) should be used instead of R² for model comparison. Multicollinearity — when predictors are highly correlated with each other — inflates standard errors and makes individual coefficients unstable.
Calculate the t-statistic: t = b₁ / SE(b₁), with df = n−2. If |t| > t_critical (typically 1.96 for n > 30 at α=0.05), the slope is statistically significant — i.e., unlikely to be zero in the population. The p-value for the slope tells you the probability of observing such a large t-statistic by chance if the true slope were zero.
Use correlation (Pearson r) to measure the strength and direction of linear association without prediction. Use regression when you want to predict y from x or estimate the magnitude of the x–y relationship (slope in interpretable units). Pearson r = ±√R² for simple regression; the sign matches the sign of the slope b₁.