Introduction
The Health Insurance Cross Sell Prediction project is an excellent opportunity for data science learners, machine learning students, data analysts, and AI enthusiasts to gain hands-on experience. In this project, you will learn how to predict whether a customer is likely to purchase vehicle insurance based on their personal and policy-related information.
This project is particularly useful for insurance companies aiming to identify potential leads and optimize marketing strategies. We’ll be using Python and popular ML libraries to accomplish this task.
Project Objective
The primary goal of this project is to build a machine learning model that predicts whether an existing customer is interested in buying vehicle insurance.
Why This Matters:
- Higher Conversion Rates
- Cost-effective Marketing
- Enhanced Customer Targeting
Dataset Link: Kaggle Health Insurance Cross Sell Prediction
Understanding the Dataset
The dataset consists of several key columns:
- id: Unique identifier
- Gender: Male/Female
- Age: Customer’s age
- Driving_License: 0 = No, 1 = Yes
- Region_Code: Categorical variable
- Previously_Insured: 0 = No, 1 = Yes
- Vehicle_Age: < 1 Year / 1-2 Year / > 2 Years
- Vehicle_Damage: Yes/No
- Annual_Premium: Premium paid by customer
- Policy_Sales_Channel: Categorical variable
- Vintage: Customer engagement period
- Response: Target variable (1 = Interested, 0 = Not Interested)
Data Preprocessing
Code Section: Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Load Data
data = pd.read_csv('train.csv') data.head()
Handle Missing Values
print(data.isnull().sum())
Encode Categorical Variables
le = LabelEncoder() data['Gender'] = le.fit_transform(data['Gender']) data['Vehicle_Age'] = le.fit_transform(data['Vehicle_Age']) data['Vehicle_Damage'] = le.fit_transform(data['Vehicle_Damage'])
Feature Scaling
scaler = StandardScaler() data['Annual_Premium'] = scaler.fit_transform(data[['Annual_Premium']]) data['Vintage'] = scaler.fit_transform(data[['Vintage']])
Exploratory Data Analysis (EDA)
Visualize Target Variable
sns.countplot(x='Response', data=data) plt.title('Target Variable Distribution') plt.show()
Correlation Heatmap
corr_matrix = data.corr() sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') plt.title('Feature Correlation') plt.show()
Feature Engineering
Feature Selection
We will remove the id column since it does not contribute to prediction.
X = data.drop(['id', 'Response'], axis=1) y = data['Response']
Model Selection
We’ll compare the performance of multiple models:
- Logistic Regression
- Random Forest
- XGBoost
Model Training
Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Logistic Regression
from sklearn.linear_model import LogisticRegression lr = LogisticRegression() lr.fit(X_train, y_train) y_pred_lr = lr.predict(X_test)
Random Forest
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) y_pred_rf = rf.predict(X_test)
XGBoost
from xgboost import XGBClassifier xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss') xgb.fit(X_train, y_train) y_pred_xgb = xgb.predict(X_test)
Model Evaluation
Evaluation Metrics Function
def evaluate_model(y_test, y_pred): print('Accuracy:', accuracy_score(y_test, y_pred)) print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred)) print('Classification Report:\n', classification_report(y_test, y_pred))
Evaluate All Models
print("\nLogistic Regression:") evaluate_model(y_test, y_pred_lr) print("\nRandom Forest:") evaluate_model(y_test, y_pred_rf) print("\nXGBoost:") evaluate_model(y_test, y_pred_xgb)
Hyperparameter Tuning
You can improve model performance using GridSearchCV.
Example for Random Forest
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200], 'max_depth': [10, 20], 'min_samples_split': [2, 5] } grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2) grid_search.fit(X_train, y_train) print("Best Parameters:", grid_search.best_params_)
Save the Model
After selecting your best-performing model (e.g., XGBoost), you can save it for deployment.
Save Model Using Joblib
import joblib # Save the model joblib.dump(xgb, 'xgb_cross_sell_model.pkl')
Deploy the Model Using Streamlit
Now, let’s deploy the model with Streamlit.
Install Streamlit
pip install streamlit
Create Streamlit App
Create a new Python file app.py
:
import streamlit as st import joblib import numpy as np # Load the trained model model = joblib.load('xgb_cross_sell_model.pkl') st.title('Health Insurance Cross-Sell Prediction') # Input fields age = st.slider('Age', 18, 100, 30) driving_license = st.radio('Driving License', [0, 1]) region_code = st.number_input('Region Code', min_value=0.0, max_value=50.0, value=28.0) previously_insured = st.radio('Previously Insured', [0, 1]) vehicle_age = st.selectbox('Vehicle Age', [0, 1, 2]) vehicle_damage = st.radio('Vehicle Damage', [0, 1]) annual_premium = st.number_input('Annual Premium', min_value=0.0, max_value=100000.0, value=30000.0) policy_sales_channel = st.number_input('Policy Sales Channel', min_value=1.0, max_value=200.0, value=26.0) vintage = st.slider('Vintage', 0, 300, 150) # Predict button if st.button('Predict'): input_data = np.array([[age, driving_license, region_code, previously_insured, vehicle_age, vehicle_damage, annual_premium, policy_sales_channel, vintage]]) prediction = model.predict(input_data) if prediction[0] == 1: st.success('The customer is likely to buy vehicle insurance.') else: st.info('The customer is not likely to buy vehicle insurance.')
Run the Streamlit App
streamlit run app.py
This will open a web interface where you can input values and see the model’s prediction.
Conclusion
In this Health Insurance Cross Sell Prediction project, we successfully:
- Loaded and preprocessed the dataset.
- Performed exploratory data analysis.
- Engineered features and selected models.
- Trained and assessed models for XGBoost, Random Forest, and Logistic Regression.
- Improved model performance using hyperparameter tuning.
Key Takeaways
- Data Preprocessing is crucial for high model performance.
- Random Forest and XGBoost often outperform simpler models.
- Hyperparameter tuning can significantly enhance results.
If you found this tutorial helpful, consider exploring more advanced topics like model interpretability with SHAP or deploying your model with Flask or Streamlit. Stay tuned for more ML project tutorials!
Additional Machine Learning Project Ideas
Here are some additional machine learning projects you can explore to further develop your skills:
- Personal Injury Case Outcome Prediction
Predict the outcome of personal injury legal cases based on historical data. - Exploratory Data Analysis on the Titanic Dataset
Perform comprehensive EDA on the Titanic dataset and build predictive models. - Big Mart Sales Prediction Project (2025)
Forecast sales for different Big Mart outlets using regression techniques. - Potato Leaf Disease Detection Project
- Detect and classify diseases in potato leaves using image classification with deep learning.
- Hand Gesture Recognition Project Using Deep Learning
- Recognize hand gestures from video or image data using CNNs and RNNs.
- Car Accident Attorney – Case Viability Prediction
- Estimate whether auto accident cases will be profitable for attorneys.
- Credit Card Fraud Detection
- Use anomaly detection methods to find fraudulent credit card transactions.
- Modeling Insurance Claim Severity
- Predict the severity of insurance claims based on customer and incident data.