Health Insurance Cross-Sell Prediction with Machine Learning

Introduction

The Health Insurance Cross Sell Prediction project is an excellent opportunity for data science learners, machine learning students, data analysts, and AI enthusiasts to gain hands-on experience. In this project, you will learn how to predict whether a customer is likely to purchase vehicle insurance based on their personal and policy-related information.

This project is particularly useful for insurance companies aiming to identify potential leads and optimize marketing strategies. We’ll be using Python and popular ML libraries to accomplish this task.

Project Objective

The primary goal of this project is to build a machine learning model that predicts whether an existing customer is interested in buying vehicle insurance.

Why This Matters:

Higher Conversion Rates
Cost-effective Marketing
Enhanced Customer Targeting

Dataset Link: Kaggle Health Insurance Cross Sell Prediction

Understanding the Dataset

The dataset consists of several key columns:

id: Unique identifier
Gender: Male/Female
Age: Customer’s age
Driving_License: 0 = No, 1 = Yes
Region_Code: Categorical variable
Previously_Insured: 0 = No, 1 = Yes
Vehicle_Age: < 1 Year / 1-2 Year / > 2 Years
Vehicle_Damage: Yes/No
Annual_Premium: Premium paid by customer
Policy_Sales_Channel: Categorical variable
Vintage: Customer engagement period
Response: Target variable (1 = Interested, 0 = Not Interested)

Data Preprocessing

Code Section: Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Load Data

data = pd.read_csv('train.csv')
data.head()

Handle Missing Values

print(data.isnull().sum())

Encode Categorical Variables

le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])
data['Vehicle_Age'] = le.fit_transform(data['Vehicle_Age'])
data['Vehicle_Damage'] = le.fit_transform(data['Vehicle_Damage'])

Feature Scaling

scaler = StandardScaler()
data['Annual_Premium'] = scaler.fit_transform(data[['Annual_Premium']])
data['Vintage'] = scaler.fit_transform(data[['Vintage']])

Exploratory Data Analysis (EDA)

Visualize Target Variable

sns.countplot(x='Response', data=data)
plt.title('Target Variable Distribution')
plt.show()

Correlation Heatmap

corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation')
plt.show()

Feature Engineering

Feature Selection

We will remove the id column since it does not contribute to prediction.

X = data.drop(['id', 'Response'], axis=1)
y = data['Response']

Model Selection

We’ll compare the performance of multiple models:

Logistic Regression
Random Forest
XGBoost

Model Training

Split the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Logistic Regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

Model Evaluation

Evaluation Metrics Function

def evaluate_model(y_test, y_pred):
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
    print('Classification Report:\n', classification_report(y_test, y_pred))

Evaluate All Models

print("\nLogistic Regression:")
evaluate_model(y_test, y_pred_lr)

print("\nRandom Forest:")
evaluate_model(y_test, y_pred_rf)

print("\nXGBoost:")
evaluate_model(y_test, y_pred_xgb)

Hyperparameter Tuning

You can improve model performance using GridSearchCV.

Example for Random Forest

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                           param_grid=param_grid,
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

Save the Model

After selecting your best-performing model (e.g., XGBoost), you can save it for deployment.

Save Model Using Joblib

import joblib

# Save the model
joblib.dump(xgb, 'xgb_cross_sell_model.pkl')

Deploy the Model Using Streamlit

Now, let’s deploy the model with Streamlit.

Install Streamlit

pip install streamlit

Create Streamlit App

Create a new Python file app.py:

import streamlit as st
import joblib
import numpy as np

# Load the trained model
model = joblib.load('xgb_cross_sell_model.pkl')

st.title('Health Insurance Cross-Sell Prediction')

# Input fields
age = st.slider('Age', 18, 100, 30)
driving_license = st.radio('Driving License', [0, 1])
region_code = st.number_input('Region Code', min_value=0.0, max_value=50.0, value=28.0)
previously_insured = st.radio('Previously Insured', [0, 1])
vehicle_age = st.selectbox('Vehicle Age', [0, 1, 2])
vehicle_damage = st.radio('Vehicle Damage', [0, 1])
annual_premium = st.number_input('Annual Premium', min_value=0.0, max_value=100000.0, value=30000.0)
policy_sales_channel = st.number_input('Policy Sales Channel', min_value=1.0, max_value=200.0, value=26.0)
vintage = st.slider('Vintage', 0, 300, 150)

# Predict button
if st.button('Predict'):
    input_data = np.array([[age, driving_license, region_code, previously_insured,
                            vehicle_age, vehicle_damage, annual_premium,
                            policy_sales_channel, vintage]])
    prediction = model.predict(input_data)
    
    if prediction[0] == 1:
        st.success('The customer is likely to buy vehicle insurance.')
    else:
        st.info('The customer is not likely to buy vehicle insurance.')

Run the Streamlit App

streamlit run app.py

This will open a web interface where you can input values and see the model’s prediction.

Conclusion

In this Health Insurance Cross Sell Prediction project, we successfully:

Loaded and preprocessed the dataset.
Performed exploratory data analysis.
Engineered features and selected models.
Trained and assessed models for XGBoost, Random Forest, and Logistic Regression.
Improved model performance using hyperparameter tuning.

Key Takeaways

Data Preprocessing is crucial for high model performance.
Random Forest and XGBoost often outperform simpler models.
Hyperparameter tuning can significantly enhance results.

If you found this tutorial helpful, consider exploring more advanced topics like model interpretability with SHAP or deploying your model with Flask or Streamlit. Stay tuned for more ML project tutorials!

Additional Machine Learning Project Ideas

Here are some additional machine learning projects you can explore to further develop your skills:

Personal Injury Case Outcome Prediction
Predict the outcome of personal injury legal cases based on historical data.
Exploratory Data Analysis on the Titanic Dataset
Perform comprehensive EDA on the Titanic dataset and build predictive models.
Big Mart Sales Prediction Project (2025)
Forecast sales for different Big Mart outlets using regression techniques.
Potato Leaf Disease Detection Project
Detect and classify diseases in potato leaves using image classification with deep learning.
Hand Gesture Recognition Project Using Deep Learning
Recognize hand gestures from video or image data using CNNs and RNNs.
Car Accident Attorney – Case Viability Prediction
Estimate whether auto accident cases will be profitable for attorneys.
Credit Card Fraud Detection
Use anomaly detection methods to find fraudulent credit card transactions.
Modeling Insurance Claim Severity
Predict the severity of insurance claims based on customer and incident data.