Predicting Customer Lifetime Value (CLV) using PySpark & Survival Analysis
Published on: Task 2 Assignment DateIn this project, I transitioned a local Pandas-based Cox Proportional Hazards model into a distributed PySpark architecture to calculate Customer Lifetime Value (CLV) at scale.
1. The Business Problem
Retaining telecommunications customers is critical. We needed a way to translate statistical survival probabilities into actual financial metrics (Net Present Value).
2. Technical Implementation (PySpark)
To handle large-scale datasets, the Pandas DataFrame logic was rewritten using PySpark's Window functions and vectorized operations. Here is a snippet of the core logic:
# PySpark Window Function for Cumulative NPV
w = Window.orderBy("contract_month").rowsBetween(Window.unboundedPreceding, Window.currentRow)
final_cohort_df = clv_cohort_df.withColumn("cumulative_npv", F.round(F.sum("npv").over(w), 2)) \
.withColumn("contract_month", F.col("contract_month") + 1)
3. Visualizing the Payback Period
(Optional: You can take screenshots of the plots we generated earlier and upload them to your GitHub repo, then insert them here using <img src="your-image.png" width="100%">)
The analysis revealed the cumulative NPV over 12, 24, and 36 months, allowing the marketing team to accurately cap their Customer Acquisition Cost (CAC).