Supervised machine learning is a type of machine learning where the model is trained on a labeled dataset. This means that for each input data point, there is a corresponding output label. The goal of supervised learning is to learn a mapping from inputs to outputs, so that the model can make predictions on new, unseen data.
Scenario: Predicting House Prices
Let’s consider a real-world scenario: predicting house prices based on features like the size of the house, number of bedrooms, location, and age of the house. This is a common problem in the real estate industry, where you want to estimate the price of a house based on historical data.
Sample Data
Here’s a simple dataset with features like house size, number of bedrooms, location (encoded as a number), age, and the price of the house:
House Size (sq ft) | Bedrooms | Location (encoded) | Age (years) | Price ($) |
---|---|---|---|---|
1500 | 3 | 1 | 10 | 300,000 |
2500 | 4 | 2 | 5 | 500,000 |
1800 | 3 | 2 | 20 | 280,000 |
2200 | 4 | 3 | 15 | 450,000 |
1300 | 2 | 1 | 25 | 200,000 |
Approach
We’ll use a supervised learning algorithm to predict the house price based on the other features. We’ll use the Linear Regression algorithm, which is simple yet effective for this type of problem.
Python Code Example
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# Sample data
data = {
'House Size (sq ft)': [1500, 2500, 1800, 2200, 1300],
'Bedrooms': [3, 4, 3, 4, 2],
'Location (encoded)': [1, 2, 2, 3, 1],
'Age (years)': [10, 5, 20, 15, 25],
'Price ($)': [300000, 500000, 280000, 450000, 200000]
}
# Convert data into a DataFrame
df = pd.DataFrame(data)
# Define features (X) and target (y)
X = df[['House Size (sq ft)', 'Bedrooms', 'Location (encoded)', 'Age (years)']]
y = df['Price ($)']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Evaluate the model using Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Display the model's coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
Explanation
- Data Preparation: We first create a dataset containing features like house size, number of bedrooms, location, age, and the corresponding price.
- Feature and Target Separation: The features (
X
) are the input variables we use to predict the output, and the target (y
) is the variable we want to predict (in this case, the house price). - Train-Test Split: We split the dataset into a training set (80%) and a testing set (20%). The training set is used to train the model, and the testing set is used to evaluate its performance.
- Model Training: We use the
LinearRegression
model fromsklearn
and fit it on the training data. - Model Prediction and Evaluation: After training, we make predictions on the test set and evaluate the model’s performance using the Mean Absolute Error (MAE), which measures the average absolute difference between the predicted and actual prices.
- Model Coefficients: Finally, we display the coefficients and intercept of the model, which help us understand how each feature contributes to the prediction.
Real-World Application
In a real-world project, a similar approach can be used in various domains:
- Real Estate: Predicting house prices based on features like location, size, and amenities.
- Finance: Predicting stock prices based on historical data and economic indicators.
- Healthcare: Predicting patient outcomes based on medical history and treatment data.
- Retail: Forecasting sales based on historical data, promotions, and seasonal trends.
Supervised learning models like linear regression are powerful tools for making predictions when you have labeled data, and they can be adapted to various applications across different industries.