Basic StatisticsAdvanced· 10 min read

Simple Linear Regression: Predicting Values with a Line

Linear regression uses a line to predict values. Learn the equation ŷ = a + bx, what the slope and R² mean, with practical examples.

RF

Renato Freitas

Updated on May 5, 2026

What is linear regression?

Simple linear regression is a statistical technique for modeling the relationship between a dependent variable (Y, what we want to predict) and an independent variable (X, what we use to predict). The idea is to find the line that best fits the data by minimizing prediction errors.

The classic example is predicting the price of a house (Y) based on its area in square meters (X). Intuitively, we expect larger houses to cost more — regression quantifies that relationship precisely and allows predictions for new houses.

🧮 Try it yourself — CalcSim

Want more features? Download CalcSim IA app

The regression line equation

The regression line is described by the equation ŷ = a + bx, where ŷ is the predicted value of Y for a given X, b is the slope coefficient and a is the intercept (where the line crosses the Y-axis when X = 0).

The slope b indicates how much Y changes on average for each unit increase in X. If a house price regression gives b = 3,500, this means each additional square meter is associated with an average increase of $3,500 in price.

The intercept a is the predicted value of Y when X = 0. In many practical contexts, this value has no direct interpretation — a house with 0 m² does not exist. But mathematically it is necessary to complete the line equation.

  • ŷ: predicted value (not the actual value)
  • b (slope): change in Y per unit of X
  • a (intercept): value of Y when X = 0
  • Residual: difference between the actual and predicted value (y − ŷ)

The coefficient of determination R²

R² (the coefficient of determination) measures the proportion of the variation in Y that is explained by the regression model. It ranges from 0 to 1 (or 0% to 100%). An R² of 0.85 means that 85% of the variation in house prices is explained by the variation in area, while the remaining 15% is due to other factors not included in the model.

High R² does not mean the model is good in all contexts. In social sciences, R² of 0.4 may already be considered robust. In experimental physics, R² above 0.99 is expected. R² also does not indicate whether the relationship is causal — only how well the line fits the data.

Note that R² is literally the square of Pearson's r correlation coefficient in simple regression. If the correlation between area and price is r = 0.92, then R² = 0.85 — the same 85% mentioned above.

Assumptions and limitations

Linear regression assumes that the relationship between X and Y is approximately linear (check with a scatter plot), that residuals are approximately normal with constant variance (homoscedasticity), and that observations are independent of each other.

The ordinary least squares method — which minimizes the sum of squared residuals — is the most widely used to estimate a and b. It is sensitive to outliers: a single point far from the general trend can pull the line significantly. In practical analyses, always check for influential points.

Extrapolation — using the model to predict Y with X values far beyond the range of the data — is risky. The line may behave in a completely different way outside the observed limits. The model for temperature and ice cream sales, for example, cannot reliably predict sales at −30°C.

Frequently asked questions

What is the difference between linear regression and correlation?

Correlation measures the strength and direction of the relationship between two variables — it is a single symmetric measure (r of X with Y equals r of Y with X). Regression creates a predictive model with a defined direction: X predicts Y. Regression allows quantitative predictions; correlation only describes the association.

How do I know if my data meet the regression assumptions?

Analyze the residuals. Plot residuals (y − ŷ) against fitted values: a random pattern around zero indicates the assumptions are satisfied. A funnel pattern suggests heteroscedasticity; a curve suggests non-linearity. Q-Q plots of residuals check for normality.

Can I use regression with categorical variables as X?

Yes, using dummy variables (0 or 1). For example, to include 'has a garage' (yes/no) in a price prediction, encode yes = 1 and no = 0. This is called regression with dummy variables and is widely used in econometrics and social sciences.

What is multiple regression and when should I use it?

Multiple regression uses two or more independent variables to predict Y. Instead of ŷ = a + bx, we have ŷ = a + b₁x₁ + b₂x₂ + ... Use it when multiple factors simultaneously influence Y — such as predicting house prices considering area, number of bedrooms and neighborhood together.

Are predictions from regression exact?

No — they are mean estimates. The regression line predicts the average value of Y for a given X. Individual values vary around that prediction (the residuals). Prediction intervals (wider than confidence intervals for the mean) capture where new individual values will likely fall.

Was this article helpful?

Rate with stars to help us improve the content.

Sign in to rate this article.

Still have questions?

The AI Professor explains step by step

Ask a question in natural language and get a personalised explanation about Basic Statistics — or any other topic.

Prefer to solve it on your phone?

Download the free app →

Keep learning