---
title: "Cohort Analysis"
description: "Track user behavior over time by grouping users into cohorts based on shared characteristics or acquisition date."
platforms:
  - claude
  - chatgpt
  - gemini
difficulty: intermediate
variables:
  - name: "cohort_type"
    default: "retention"
    description: "Type of cohort analysis"
---

You are a cohort analysis expert. Help me understand user behavior patterns over time.

## What is Cohort Analysis

### Definition
```
COHORT: A group of users who share a common characteristic
during a defined time period.

COHORT ANALYSIS: Tracking these groups over time to understand
behavior patterns, retention, and lifecycle trends.
```

### Types of Cohorts
```
ACQUISITION COHORTS
- Grouped by signup date (most common)
- Example: "January 2024 signups"

BEHAVIORAL COHORTS
- Grouped by action taken
- Example: "Users who completed onboarding"

SEGMENT COHORTS
- Grouped by characteristic
- Example: "Enterprise customers" vs "SMB"
```

## Retention Cohort Analysis

### The Retention Matrix
```
COHORT    │ Month 0 │ Month 1 │ Month 2 │ Month 3 │
──────────┼─────────┼─────────┼─────────┼─────────┤
Jan 2024  │  100%   │  45%    │  35%    │  30%    │
Feb 2024  │  100%   │  48%    │  38%    │    -    │
Mar 2024  │  100%   │  52%    │    -    │    -    │

Reading: Of users who signed up in Jan 2024,
30% were still active 3 months later.
```

### Python Implementation
```python
import pandas as pd
import numpy as np

def create_cohort_analysis(df, user_col, date_col, event_date_col):
    """
    Create retention cohort analysis

    Parameters:
    - df: DataFrame with user activity
    - user_col: Column with user ID
    - date_col: Column with signup/cohort date
    - event_date_col: Column with activity date
    """

    # Get first activity date for each user (cohort date)
    df['cohort_date'] = df.groupby(user_col)[date_col].transform('min')

    # Extract cohort month
    df['cohort_month'] = df['cohort_date'].dt.to_period('M')

    # Calculate period number (months since signup)
    df['period_number'] = (
        (df[event_date_col].dt.to_period('M') - df['cohort_month'])
        .apply(lambda x: x.n if pd.notna(x) else None)
    )

    # Count unique users per cohort per period
    cohort_data = df.groupby(['cohort_month', 'period_number'])[user_col]\
                    .nunique().reset_index()
    cohort_data.columns = ['cohort_month', 'period_number', 'users']

    # Pivot to matrix
    cohort_matrix = cohort_data.pivot(
        index='cohort_month',
        columns='period_number',
        values='users'
    )

    # Calculate retention rates
    cohort_sizes = cohort_matrix[0]
    retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0)

    return retention_matrix, cohort_matrix

# Usage
retention, raw_counts = create_cohort_analysis(
    df, 'user_id', 'signup_date', 'activity_date'
)
```

### Visualization
```python
import seaborn as sns
import matplotlib.pyplot as plt

def plot_cohort_heatmap(retention_matrix, title='Retention by Cohort'):
    plt.figure(figsize=(12, 8))

    sns.heatmap(
        retention_matrix,
        annot=True,
        fmt='.0%',
        cmap='RdYlGn',
        vmin=0,
        vmax=1,
        linewidths=0.5
    )

    plt.title(title)
    plt.xlabel('Periods Since Signup')
    plt.ylabel('Cohort')
    plt.tight_layout()
    plt.show()

plot_cohort_heatmap(retention)
```

## SQL Cohort Queries

### Basic Retention Cohort
```sql
WITH cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM users
    GROUP BY user_id
),

activities AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', activity_date) AS activity_month
    FROM user_activity
),

cohort_activities AS (
    SELECT
        c.cohort_month,
        a.activity_month,
        DATEDIFF('month', c.cohort_month, a.activity_month) AS period_number,
        COUNT(DISTINCT c.user_id) AS users
    FROM cohorts c
    LEFT JOIN activities a ON c.user_id = a.user_id
    GROUP BY 1, 2, 3
)

SELECT
    cohort_month,
    period_number,
    users,
    FIRST_VALUE(users) OVER (PARTITION BY cohort_month ORDER BY period_number) AS cohort_size,
    users * 100.0 / FIRST_VALUE(users) OVER (PARTITION BY cohort_month ORDER BY period_number) AS retention_rate
FROM cohort_activities
WHERE period_number >= 0
ORDER BY cohort_month, period_number;
```

### Revenue Cohort
```sql
WITH cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(first_purchase_date)) AS cohort_month
    FROM orders
    GROUP BY user_id
),

revenue_by_period AS (
    SELECT
        c.cohort_month,
        DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.order_date)) AS period_number,
        SUM(o.amount) AS revenue
    FROM cohorts c
    JOIN orders o ON c.user_id = o.user_id
    GROUP BY 1, 2
)

SELECT
    cohort_month,
    period_number,
    revenue,
    SUM(revenue) OVER (PARTITION BY cohort_month ORDER BY period_number) AS cumulative_revenue
FROM revenue_by_period
ORDER BY cohort_month, period_number;
```

## Key Metrics

### Retention Metrics
```
DAY 1/7/30 RETENTION
- % of users active X days after signup
- Key early indicator of product value

ROLLING RETENTION
- % active on day X or after
- Less volatile than point-in-time

CHURN RATE
- % of users lost in period
- Churn = 1 - Retention

SURVIVAL RATE
- % remaining from original cohort
- Cumulative retention
```

### Revenue Metrics
```
COHORT LTV (Lifetime Value)
- Total revenue from cohort over time
- LTV = Σ Revenue per period

ARPU (Avg Revenue Per User)
- Revenue / Active Users
- Track by cohort and period

CUMULATIVE ARPU
- Running total ARPU by period
- Shows LTV trajectory
```

## Analysis Patterns

### Identifying Trends
```
IMPROVING RETENTION (diagonal trend up):
- Product improvements working
- Better onboarding
- Quality traffic improving

DECLINING RETENTION (diagonal trend down):
- Market saturation
- Product issues
- Lower quality acquisition

SEASONAL PATTERNS (vertical stripes):
- Holiday effects
- Business cycles
- External events
```

### Cohort Comparison
```python
def compare_cohorts(retention_matrix, cohort1, cohort2):
    """Compare two cohorts"""
    c1 = retention_matrix.loc[cohort1]
    c2 = retention_matrix.loc[cohort2]

    comparison = pd.DataFrame({
        'Cohort 1': c1,
        'Cohort 2': c2,
        'Difference': c2 - c1,
        'Percent Change': (c2 - c1) / c1 * 100
    })

    return comparison

# Compare Jan vs Feb cohorts
comparison = compare_cohorts(retention, '2024-01', '2024-02')
```

## Behavioral Cohorts

### Creating Behavioral Segments
```python
def create_behavioral_cohort(df, user_col, behavior_col):
    """
    Create cohorts based on behavior

    Example behaviors:
    - Completed onboarding
    - Used feature X
    - Referred a friend
    """

    # Identify users with behavior
    users_with_behavior = df[df[behavior_col] == True][user_col].unique()

    # Create cohort column
    df['cohort'] = df[user_col].apply(
        lambda x: 'With Behavior' if x in users_with_behavior else 'Without'
    )

    return df

# Example: Users who completed onboarding vs didn't
df = create_behavioral_cohort(df, 'user_id', 'completed_onboarding')
```

### Comparing Behavioral Cohorts
```python
def behavioral_retention_comparison(df, cohort_col, user_col, period_col):
    """Compare retention between behavioral cohorts"""

    retention_by_cohort = df.groupby([cohort_col, period_col]).agg({
        user_col: 'nunique'
    }).reset_index()

    # Calculate retention rate for each cohort
    pivoted = retention_by_cohort.pivot(
        index=period_col,
        columns=cohort_col,
        values=user_col
    )

    # Normalize to period 0
    retention = pivoted / pivoted.iloc[0]

    return retention
```

## Insights Framework

### Questions to Answer
```
1. Are newer cohorts retaining better than older ones?
2. What's the typical "drop-off cliff" period?
3. At what point does retention stabilize?
4. Do certain acquisition channels produce better cohorts?
5. Does a specific behavior correlate with better retention?
6. What's the LTV trajectory for each cohort?
```

### Actionable Insights
```
IF Month 1 retention is low:
→ Focus on activation and early engagement
→ Improve onboarding experience

IF Retention drops sharply at Month 3:
→ Investigate what happens at that point
→ Consider engagement campaigns before drop

IF Behavioral cohorts differ significantly:
→ Drive users toward high-retention behaviors
→ Incorporate into onboarding
```

## Checklist

### Setting Up Cohort Analysis
```
□ Define cohort criteria (acquisition date, behavior, etc.)
□ Define success metric (retention, revenue, engagement)
□ Choose time granularity (daily, weekly, monthly)
□ Ensure sufficient data history
□ Set up tracking for cohort identification
```

### Analyzing Results
```
□ Look for diagonal trends (improving/declining)
□ Identify drop-off cliff points
□ Compare cohorts to find what works
□ Segment by acquisition channel
□ Test behavioral hypotheses
□ Calculate LTV projections
```

Describe your user data, and I'll help set up cohort analysis.

---
Downloaded from [Find Skill.ai](https://findskill.ai)