Too Messy to Melt

Downloadable attendance churn dataset and walkthrough of its creation - from messy and manual Excel to a tidy and reproducible Python pipeline.
Football
Analytics
Data Engineering
Author

Diggy

Published

January 15, 2026

Introduction

In pandas (and tidy-data workflows), melt() is the standard tool for reshaping a wide table into a long one.
In this project, the original spreadsheet (see below) wasn’t just wide — it was inconsistent across season formats, team naming conventions, and placeholder values (e.g., COVID). A naive pd.melt() would have produced incorrect joins and silent errors.

Unusably messy attendance spreadsheet

Unusably messy attendance spreadsheet

This page documents the reproducible reshaping pipeline used to build the final open dataset.

The final output of this process is the perfect_tidy_data.csv file. Read on to learn more about its structure, contents and composition or simply Download perfect_tidy_data.csv. The CSV becomes an analysis-ready Python DataFrame after some further enrichment - see the final code block for details.

Dataset Scope

All Premier League relegations with usable attendance data. This means Covid seasons are excluded. Each Premier League relegation event is represented across a consistent four-season window and includes league performance as well as attendance figures. Data is on a per-season basis, not per-match. - Year -1: Season before relegation
- Year 0: Relegation season (Premier League)
- Year +1: First post-relegation season
- Year +2: Second post-relegation season

A temporary Year +3 is generated programmatically only to determine the Year +2 season outcome (Promoted / Survived / Relegated). After outcome extraction, the Year +3 rows are dropped.

Granularity is one row per team × season × relegation event. So one relegation event (e.g., Burnley 2023-24) is represented by four rows (Year -1 through Year +2).

Data sources

Attendance data

Attendance values were manually compiled into an Excel workbook using:

  • European-Football-Statistics.co.uk (season-level attendances)

Because COVID-affected seasons are not comparable to normal seasons, I sometimes entered the string value COVID into attendance cells during manual compilation. The pipeline converts these placeholders to missing values (pd.NA) before numeric conversion.

Download Relegation Attendance Churn.xlsx

League position and performance

League tier, final position, and raw season performance totals come from:

  • The Fjelstul English Football Database (standings dataset)
    Fjelstul, Joshua C. “The Fjelstul English Football Database v1.1.0.” May 26, 2024. https://www.github.com/jfjelstul/englishfootball

Licensing note: The Fjelstul database is distributed under CC-BY-SA 4.0. This derived dataset is made available under the same CC BY-SA 4.0 license.

  • Patch file
    The Fjelstul dataset does not include the 2024/25 season, so I have supplemented it with a patch file (standings24_25.csv) to include the most recent completed season.

Download standings24_25.csv

Data dictionary

Identifiers and structure

Column Type Description
Relegation_Event_ID string Unique ID for a relegation event ("{Team} {Season}")
Team string Standardized club name (matches standings naming)
Season string Season in YYYY-YY format (e.g., 2002-03)
Year_vs_Relegation int Relative season index: -1, 0, 1, 2

Attendance

Column Type Description
Attendance int (nullable) Average home league attendance. Missing where unusable (e.g. COVID).

League status and outcome

Column Type Description
Tier int (nullable) League tier (1 = top flight, 2 = Championship, etc.)
Position int (nullable) Final league position
Year_End_Outcome string Promoted, Survived, or Relegated

Raw season performance totals

Column Type Description
Games_Played_History int (nullable) League matches played
Wins_History int (nullable) League wins
Goals_For_History int (nullable) League goals scored
Goals_Against_History int (nullable) League goals conceded
Points_History int (nullable) League points total

Inputs and outputs

File Purpose
Relegation Attendance Churn.xlsx Wide-format attendance sheet
Standings24_25.csv Patch file for 24/25 season
File Purpose
perfect_tidy_data.csv Final tidy dataset

Dataset design

Primary identifier: Relegation_Event_ID

A relegation event is uniquely identified as:

Relegation_Event_ID = "{Team} {RelegationSeason}"

Example: Burnley 2023-24

This identifier is repeated across the four rows representing Year -1 through Year +2.

Reshaping pipeline

Step 0: Prerequisites

import pandas as pd
import numpy as np
# You may need to install thefuzz
pip install thefuzz
from thefuzz import process

Step 1 — Load the wide spreadsheet and clean attendance columns

Why: The attendance sheet is wide-format, and attendance values include commas and placeholders (e.g., COVID).

What happens: - Strip commas (e.g., 23,72123721) - Convert COVID and nan placeholders to missing values (pd.NA) - Coerce to numeric

attendance_cols = [c for c in df_wide.columns if "Attendance" in c]
for col in attendance_cols:
    df_wide[col] = df_wide[col].astype(str).str.replace(',', '').replace('COVID', pd.NA).replace('nan', pd.NA)
    df_wide[col] = pd.to_numeric(df_wide[col], errors='coerce')

Step 2 — Load standings data (for auditing + enrichment)

Why: Team naming mismatches and season key mismatches can silently break merges. The standings dataset provides a stable reference list of teams and seasons.

What happens: - Load standings.csv - Optionally append standings24_25.csv - Rename and select relevant columns - Coerce numeric columns

try:
    # Load 'season' as string to be safe
    df_history = pd.read_csv('../../assets/files/datasets/standings.csv', dtype={'season': str})

    # --- Load patch file and combine ---
    try:
        df_patch = pd.read_csv('../../assets/files/datasets/standings24_25.csv', dtype={'season': str})
        # Append patch data to history data
        df_history = pd.concat([df_history, df_patch], ignore_index=True)
        print("Step 2a: Loaded and applied 'standings24_25.csv'.")
    except FileNotFoundError:
        print("Step 2a: 'standings24_25.csv' not found. Skipping patch.")
    # --- End new patch logic ---

    print("Step 2b: Loaded historical data for auditing.")

    # Create the cleaned df_positions here for later use
    df_positions = df_history.rename(columns={
        'team_name': 'Team',
        'season': 'Season',
        'position': 'Position_History',
        'played': 'Games_Played_History',
        'wins': 'Wins_History',
        'goals_for': 'Goals_For_History',
        'goals_against': 'Goals_Against_History',
        'points': 'Points_History',
        'tier': 'Tier_History'
    })[['Team', 'Season', 'Position_History', 'Games_Played_History', 'Wins_History',
        'Goals_For_History', 'Goals_Against_History', 'Points_History', 'Tier_History']]

    # Convert new history columns to numeric
    numeric_cols = ['Position_History', 'Games_Played_History', 'Wins_History',
                    'Goals_For_History', 'Goals_Against_History', 'Points_History', 'Tier_History']
    for col in numeric_cols:
        df_positions[col] = pd.to_numeric(df_positions[col], errors='coerce')

Step 3 — Audit team names (fuzzy matching)

Why: The manual spreadsheet uses informal names (Man City, Spurs, etc.). The merge requires standardized names.

What happens:
- Identify team names in attendance sheet that don’t match standings - Propose high-confidence fuzzy matches (>85) - Warn on low-confidence values

teams_in_wide_df = set(df_wide['Relegated Team'].astype(str).unique())
    teams_in_history_df = set(df_positions['Team'].astype(str).unique())

    unmatched_teams = teams_in_wide_df.difference(teams_in_history_df)

    # Create an empty map to build automatically
    auto_team_name_map = {}

    if unmatched_teams:
        print("--- AUDIT: TEAM NAME MISMATCHES FOUND ---")
        print("The following teams from your file do not match the history file.")
        print("Attempting to build an automatic mapping...")

        # Loop through each unique unmatched team and find the best match
        for team in sorted(unmatched_teams):
            # process.extractOne returns a tuple: (best_match, score)
            suggestion = process.extractOne(team, teams_in_history_df)

            # Auto-map spaces, asterisks, and high-confidence matches
            if suggestion[1] > 85:
                print(f"  - Auto-mapping: \"{team}\" -> \"{suggestion[0]}\" (Score: {suggestion[1]})")
                auto_team_name_map[team] = suggestion[0]
            else:
                # Don't map low-confidence or junk data
                if team not in ['nan', 'TEAM', 'Relegated Team']:
                    print(
                        f"  - WARNING: \"{team}\" has no confident match. (Best: \"{suggestion[0]}\" at {suggestion[1]}%)")
                    print(f"    -> Please add a fix for \"{team}\" to 'manual_overrides_map' in Step 4.")

        print("---------------------------------------------")
    else:
        print("--- AUDIT: TEAM NAMES OK ---")
        print("All team names in your file match the history file.")

except FileNotFoundError:
    print("Error: '../../assets/files/datasets/standings.csv' not found. Skipping audit and enrichment.")
    # Create empty dataframes if file not found
    df_positions = pd.DataFrame(columns=['Team', 'Season', 'Position_History', 'Games_Played_History',
                                         'Wins_History', 'Goals_For_History', 'Goals_Against_History',
                                         'Points_History', 'Tier_History'])
    auto_team_name_map = {}

Step 4 — Standardize team names (auto-map + manual overrides)

Why: Fuzzy matching is helpful but not perfect; known edge cases are fixed with manual overrides.

What happens: - Merge the auto-map and manual overrides
- Replace names in the wide spreadsheet
- Filter out remaining junk/unmapped rows

manual_overrides_map = {
    "Man Utd": "Manchester United",
    "Man City": "Manchester City",
    "Spurs": "Tottenham Hotspur",
    "QPR": "Queens Park Rangers",
    "Wolves": "Wolverhampton Wanderers",
    "Sheff Utd": "Sheffield United",
    "Sheff Wed": "Sheffield Wednesday",
    "Nott'm Forest": "Nottingham Forest",
    "Notts County": "Nottingham Forest"
    # e.g., if audit warns about "Bradford", add:
    # "Bradford": "Bradford City",
}
# Combine the auto-generated map with the manual overrides
# This is where we combine the two dictionaries.
# The 'manual_overrides_map' will overwrite any conflicting keys from 'auto_team_name_map'.
final_team_map = {**auto_team_name_map, **manual_overrides_map}

df_wide['Relegated Team'] = df_wide['Relegated Team'].replace(final_team_map)
print("Step 4: Standardized 'Relegated Team' names using auto-mapping and manual overrides.")
# --- 4.5. FILTERING STEP ---
# Now that names are mapped, we can filter out junk rows that didn't get mapped
teams_in_history_df = set(df_positions['Team'].astype(str).unique())
df_wide = df_wide[df_wide['Relegated Team'].isin(teams_in_history_df)]
print("Step 4.5: Filtered out junk rows (e.g., 'nan', 'TEAM').")

Step 5 — Create a stable event key (Relegation_Event_ID)

Why: Each relegation event needs a stable identifier across Year -1..+2.

df_wide['Relegation_Event_ID'] = df_wide['Relegated Team'] + ' ' + df_wide['Season']
print("Step 5: Created 'Relegation_Event_ID'.")

Step 6 — Convert wide → tidy (safe slices + concat)

Why: The attendance sheet is wide and not safely meltable without careful column mapping.

What happens:
- Create four normalized slices (Year -1..+2)
- Concatenate into one long table with a consistent schema

## Helper function to create each slice
def make_slice(df, team_col, season_col, att_col, year_vs):
    # --- Check if att_col exists, otherwise use pd.NA ---
    # This makes the function robust for the missing Y3 attendance
    if att_col not in df.columns:
        # Create a temporary column of NAs if the attendance col doesn't exist
        df = df.assign(Attendance_tmp=pd.NA) 
    else:
        # If the col *does* exist, rename it
        df = df.rename(columns={att_col: "Attendance_tmp"})
        
    slice_df = df.rename(columns={
        team_col: "Team",
        season_col: "Season_tmp"
    }).assign(Year_vs_Relegation=year_vs)

    # Select and rename final columns
    slice_df = slice_df[['Relegation_Event_ID', 'Team', 'Season_tmp', 'Attendance_tmp', 'Year_vs_Relegation']].copy()
    slice_df.columns = ['Relegation_Event_ID', 'Team', 'Season', 'Attendance', 'Year_vs_Relegation']

    return slice_df

# Create slices (Only up to Year 2)
slice_minus1 = make_slice(df_wide, "Relegated Team", "Year Before", "Year Before Att", -1)
slice_0      = make_slice(df_wide, "Relegated Team", "Season", "Attendance", 0)
slice_1      = make_slice(df_wide, "Relegated Team", "Year After", "Attendance year after", 1)
slice_2      = make_slice(df_wide, "Relegated Team", "2 years after", "Attendance 2 years after", 2)

# Concatenate slices safely
df_tidy = pd.concat([slice_minus1, slice_0, slice_1, slice_2], ignore_index=True)

print("Step 6: Data has been 'tidied' successfully!")

Step 7 — Standardize season keys (YYYY-YY)

  • Why: Season formats vary (e.g., 1991-1992 vs 1992-93) and must align for joins.
# --- HELPER FUNCTION 1 ---
def format_season(season_str):
    """Converts '1991-1992' to '1991-92' and leaves '1992-93' as is."""
    if pd.isna(season_str):
        return pd.NA
    parts = str(season_str).split('-')
    if len(parts) == 2:
        if len(parts[1]) == 4:  # Format is '1991-1992'
            return f"{parts[0]}-{parts[1][-2:]}"
        else:  # Format is '1992-93'
            return season_str
    return season_str  # Return as-is if not in expected format

# --- HELPER FUNCTION 2 ---
def increment_season(season_str):
    """Converts a season string like '1995-96' to '1996-97'."""
    if pd.isna(season_str):
        return pd.NA
    try:
        start_year = int(season_str.split('-')[0])
        next_start_year = start_year + 1
        
        # Handle the '1999-00' case
        if next_start_year == 1999:
            return "1999-00"
        
        next_end_year_short = str(next_start_year + 1)[-2:] # e.g., 97
        return f"{next_start_year}-{next_end_year_short}"
    except Exception as e:
        print(f"Error incrementing season '{season_str}': {e}")
        return pd.NA

# Apply the formatting
df_tidy['Season'] = df_tidy['Season'].apply(format_season)
print("Step 7: Standardized 'Season' key to 'YYYY-YY' format (e.g., '1992-93').")

Step 7.5 — Temporary Year +3 rows (outcome extraction only)

Why: To compute the outcome of Year +2, we need to observe the next season’s tier transition.

print("Step 7.5: Generating temporary Year 3 rows for outcome calculation...")

# 1. Find all Year 2 rows
df_y2_rows = df_tidy[df_tidy['Year_vs_Relegation'] == 2].copy()

# 2. Transform them into Year 3 rows
df_y2_rows['Year_vs_Relegation'] = 3
df_y2_rows['Attendance'] = pd.NA # Attendance data is not needed
df_y2_rows['Season'] = df_y2_rows['Season'].apply(increment_season)

# 3. Concatenate these helper rows back onto the main tidy dataframe
df_tidy = pd.concat([df_tidy, df_y2_rows], ignore_index=True)
print("Step 7.5: Temporary Year 3 rows created and added.")

Step 8 — Enrich tidy data with standings (raw totals)

Why: Add tier, position, and season totals for later analysis while keeping this dataset “raw”.

if not df_positions.empty:
    # --- Upgrade the history file's 'Season' column ---
    season_numeric = pd.to_numeric(df_positions['Season'], errors='coerce').dropna()
    season_end_year = (season_numeric + 1).astype(int).astype(str).str.zfill(4).str[-2:]
    df_positions.loc[season_numeric.index, 'Season'] = season_numeric.astype(int).astype(str) + '-' + season_end_year
    print("Step 8a: Upgraded history file 'Season' key.")

    # --- MODIFICATION ---
    # We are NO LONGER calculating per-game metrics here.
    # We are ONLY merging the raw values.
    df_positions_merge = df_positions[[
        'Team', 'Season', 'Position_History', 'Tier_History',
        'Games_Played_History', 'Wins_History', 'Goals_For_History', 
        'Goals_Against_History', 'Points_History'
    ]]
    # --- END MODIFICATION ---

    # --- Merge the data ---
    print("Step 8b: Merging tidy data with position data...")
    df_final = pd.merge(
        df_tidy,
        df_positions_merge,  # Use the merge-ready dataframe with raw columns
        on=['Team', 'Season'],
        how='left'
    )
    df_final['Position'] = df_final['Position_History']
    df_final['Tier'] = df_final['Tier_History']
    print("Step 8: Data enriched. 'df_final' created and 'Tier' column added.")

else:
    print("Step 8: Skipping merge as 'standings.csv' was not loaded.")
    df_final = df_tidy.copy() 
    # Add empty columns
    for col in ['Position', 'Tier', 'Games_Played_History', 'Wins_History', 'Goals_For_History', 'Goals_Against_History', 'Points_History']:
        df_final[col] = pd.NA

Step 8.5 — Compute Year_End_Outcome from tier transitions

Why: Outcomes (Promoted/Survived/Relegated) are derived from how tier changes year-to-year within each event.

print("Step 8.5: Calculating 'Year_End_Outcome' based on precise timeline...")

# 1. Calculate the Tier Change (Tier_Next_Year - Tier_Current_Year)
df_final['Tier_Next_Year'] = df_final.groupby('Relegation_Event_ID')['Tier'].shift(-1)
df_final['Tier_Change'] = df_final['Tier_Next_Year'] - df_final['Tier']

# 2. Map the Tier Change to the Outcome Label
def map_tier_change_to_outcome(row):
    """Maps the outcome based on the user's exact timeline definition."""
    
    year_vs_relegation = row['Year_vs_Relegation'] 
    current_tier = row['Tier']
    tier_change = row['Tier_Change']

    # --- 1. Handle Year -1 (Based on CURRENT Tier) ---
    if year_vs_relegation == -1:
        if current_tier == 1:
            return 'Survived'
        if current_tier == 2:
            return 'Promoted'
        return pd.NA

    # --- 2. Handle Year 0 (The Relegation Event) ---
    if year_vs_relegation == 0:
        return 'Relegated'

    # --- 3. Handle Year +1 and +2 (Based on TIER CHANGE) ---
    if year_vs_relegation in [1, 2]:
        if pd.isna(tier_change):
            return pd.NA 
        
        if tier_change == -1:
            return 'Promoted'
        if tier_change == 0:
            return 'Survived'
        if tier_change == 1:
            return 'Relegated'
        
        return pd.NA

    return pd.NA # Catches Y3 row

# Apply the function
df_final['Year_End_Outcome'] = df_final.apply(map_tier_change_to_outcome, axis=1)
print("Step 8.5: 'Year_End_Outcome' column successfully created.")

Step 9 — Finalize schema and export

What happens:
- Cast numeric fields to nullable integers
- Drop helper Year +3 rows
- Select final column set and sort
- Write perfect_tidy_data.csv

print("Step 9: Polishing final data...")

# --- MODIFICATION: Added all _History columns ---
final_columns = [
    'Relegation_Event_ID', 'Team', 'Season', 'Tier',
    'Position', 'Attendance', 'Year_vs_Relegation', 'Year_End_Outcome',
    'Games_Played_History', 'Wins_History', 'Goals_For_History', 
    'Goals_Against_History', 'Points_History'
]
# --- END MODIFICATION ---

# Ensure final columns exist before slicing
for col in final_columns:
    if col not in df_final.columns:
        df_final[col] = pd.NA

# --- MODIFICATION: Cast all numeric columns to Int64 ---
for col in ['Attendance', 'Position', 'Tier', 'Games_Played_History', 'Wins_History', 'Goals_For_History', 'Goals_Against_History', 'Points_History']:
    df_final[col] = pd.to_numeric(df_final[col], errors='coerce').astype('Int64')
# --- END MODIFICATION ---

# --- DROP THE HELPER ROW ---
df_final = df_final[df_final['Year_vs_Relegation'] != 3].copy()

# Keep only the final columns and sort
df_final = df_final[final_columns].sort_values(by=['Relegation_Event_ID', 'Year_vs_Relegation'])

# Save the final, perfect data to a new CSV in the 'datasets' folder
output_path = "../../assets/files/datasets/perfect_tidy_data.csv"
df_final.to_csv(output_path, index=False)
import os
print(f"Step 9: Success! Final table saved to '{output_path}'")
#| output: false

From tidy data to analysis-ready tables

While that completes the reshaping pipeline, further steps are needed to prepare the data for analysis and visualization.

print("--- PREPARING DATA FOR ANALYSIS ---")
# --- ANALYSIS PREP ---
# This block creates the 'df_plot' DataFrame for all visualizations
# 1. Get the baseline attendance (Year 0) for each event
df_y0_baseline = df_final[
    df_final['Year_vs_Relegation'] == 0
][['Relegation_Event_ID', 'Attendance']].rename(columns={'Attendance': 'Attendance_Y0_Baseline'})

# 2. Merge this baseline back onto the main df
df_plot = pd.merge(df_final, df_y0_baseline, on='Relegation_Event_ID', how='left')

# 3. Calculate the Pct Change vs. Year 0 for all years
df_plot['Pct_Change_vs_Y0'] = (
    (df_plot['Attendance'] - df_plot['Attendance_Y0_Baseline']) / 
    df_plot['Attendance_Y0_Baseline']
)

# 4. Get the *key outcome* from Year 1 (the first post-relegation season)
df_y1_outcome = df_final[
    df_final['Year_vs_Relegation'] == 1
][['Relegation_Event_ID', 'Year_End_Outcome']].rename(columns={'Year_End_Outcome': 'Outcome_Group'})

# 5. Merge this outcome group onto the whole plot dataset
df_plot = pd.merge(df_plot, df_y1_outcome, on='Relegation_Event_ID', how='left')


# 5.5: IMPUTE 2024-25 OUTCOMES ---
# The Y+1 outcome for the 2023-24 cohort is NA. We impute it manually.
print("Imputing outcomes for 2023-24 cohort...")

# Define our imputations based on likely 2024-25 season-end performance
impute_map = {
    'Burnley 2023-24': 'Promoted',
    'Sheffield United 2023-24': 'Survived', # Assuming playoffs = Survived
    'Luton Town 2023-24': 'Relegated'
}

# Apply the imputation
# This finds rows where Outcome_Group is NA and maps the Event ID to our new value
df_plot['Outcome_Group'] = df_plot['Outcome_Group'].fillna(
    df_plot['Relegation_Event_ID'].map(impute_map)
)
print("Imputation complete.")

# 6. Convert Pct_Change to a percentage number (e.g., 0.05 -> 5)
df_plot['Pct_Change_vs_Y0_Display'] = df_plot['Pct_Change_vs_Y0'] * 100

# 7. Calculate Per-Game metrics on the fly
df_plot['Goals_For_Per_Game'] = df_plot['Goals_For_History'] / df_plot['Games_Played_History']
df_plot['Points_Per_Game'] = df_plot['Points_History'] / df_plot['Games_Played_History']

# 8. Define the FINAL anomaly list
# We are KEEPING Wimbledon, as it's a valid case study.
anomaly_events = [
    'Middlesbrough 1992-93', # Rebuild
    'Sunderland 1996-97',    # New stadium
    'Leicester City 2001-02', # New stadium
    'Bolton Wanderers 1995-96'  # New stadium (in Y+2)
]

# 9. Create the FINAL filtered DataFrame for plotting
df_plot_filtered = df_plot[~df_plot['Relegation_Event_ID'].isin(anomaly_events)]
print(f"Removed {len(df_plot) - len(df_plot_filtered)} total data points for structural anomalies.")

# 10. Create the cleaner plot labels
label_map = {
    'Promoted': 'Promoted (Back in PL)',
    'Survived': 'Survived (Still in Champ)',
    'Relegated': 'Relegated (Down to L1)'
}
df_plot['Plot_Group_Label'] = df_plot['Outcome_Group'].map(label_map)
df_plot_filtered['Plot_Group_Label'] = df_plot_filtered['Outcome_Group'].map(label_map)

print("--- Analysis data is ready. ---")

Analysis prep script (full, copy-paste ready)

## From tidy data to analysis-ready tables
print("--- PREPARING DATA FOR ANALYSIS ---")

# --- ANALYSIS PREP ---
# This block creates the 'df_plot' DataFrame for all visualizations
# 1. Get the baseline attendance (Year 0) for each event
df_y0_baseline = df_final[
    df_final['Year_vs_Relegation'] == 0
][['Relegation_Event_ID', 'Attendance']].rename(columns={'Attendance': 'Attendance_Y0_Baseline'})

# 2. Merge this baseline back onto the main df
df_plot = pd.merge(df_final, df_y0_baseline, on='Relegation_Event_ID', how='left')

# 3. Calculate the Pct Change vs. Year 0 for all years
df_plot['Pct_Change_vs_Y0'] = (
    (df_plot['Attendance'] - df_plot['Attendance_Y0_Baseline']) / 
    df_plot['Attendance_Y0_Baseline']
)

# 4. Get the *key outcome* from Year 1 (the first post-relegation season)
df_y1_outcome = df_final[
    df_final['Year_vs_Relegation'] == 1
][['Relegation_Event_ID', 'Year_End_Outcome']].rename(columns={'Year_End_Outcome': 'Outcome_Group'})

# 5. Merge this outcome group onto the whole plot dataset
df_plot = pd.merge(df_plot, df_y1_outcome, on='Relegation_Event_ID', how='left')


# 5.5: IMPUTE 2024-25 OUTCOMES ---
# The Y+1 outcome for the 2023-24 cohort is NA. We impute it manually.
print("Imputing outcomes for 2023-24 cohort...")

# Define our imputations based on likely 2024-25 season-end performance
impute_map = {
    'Burnley 2023-24': 'Promoted',
    'Sheffield United 2023-24': 'Survived', # Assuming playoffs = Survived
    'Luton Town 2023-24': 'Relegated'
}

# Apply the imputation
# This finds rows where Outcome_Group is NA and maps the Event ID to our new value
df_plot['Outcome_Group'] = df_plot['Outcome_Group'].fillna(
    df_plot['Relegation_Event_ID'].map(impute_map)
)
print("Imputation complete.")

# 6. Convert Pct_Change to a percentage number (e.g., 0.05 -> 5)
df_plot['Pct_Change_vs_Y0_Display'] = df_plot['Pct_Change_vs_Y0'] * 100

# 7. Calculate Per-Game metrics on the fly
df_plot['Goals_For_Per_Game'] = df_plot['Goals_For_History'] / df_plot['Games_Played_History']
df_plot['Points_Per_Game'] = df_plot['Points_History'] / df_plot['Games_Played_History']

# 8. Define the FINAL anomaly list
# We are KEEPING Wimbledon, as it's a valid case study.
anomaly_events = [
    'Middlesbrough 1992-93', # Rebuild
    'Sunderland 1996-97',    # New stadium
    'Leicester City 2001-02', # New stadium
    'Bolton Wanderers 1995-96'  # New stadium (in Y+2)
]

# 9. Create the FINAL filtered DataFrame for plotting
df_plot_filtered = df_plot[~df_plot['Relegation_Event_ID'].isin(anomaly_events)]
print(f"Removed {len(df_plot) - len(df_plot_filtered)} total data points for structural anomalies.")

# 10. Create the cleaner plot labels
label_map = {
    'Promoted': 'Promoted (Back in PL)',
    'Survived': 'Survived (Still in Champ)',
    'Relegated': 'Relegated (Down to L1)'
}
df_plot['Plot_Group_Label'] = df_plot['Outcome_Group'].map(label_map)
df_plot_filtered['Plot_Group_Label'] = df_plot_filtered['Outcome_Group'].map(label_map)

print("--- Analysis data is ready. ---")