Project Overview
Ranked 2nd out of 72 participants in a machine learning course competition focused on tabular data classification. The challenge involved classifying buildings into 5 distinct categories using a combination of geographical data and building metadata.
The competition required participants to leverage both geographical coordinates and building characteristics to accurately predict building types, making it a complex feature engineering and model selection challenge.
Technical Approach
Feature Engineering
Feature engineering was the key to success in this competition. We created over 400 features by leveraging domain knowledge about urban planning and geographical data. The main feature categories included:
- Temporal Features: Extracted day, month, year from dates and computed time gaps between events
- Categorical Encoding: Converted urban types and geographical categories into binary attributes
- Geometric Features: Calculated length, width, and their ratios to distinguish between different building types (e.g., roads vs. buildings)
- Spatial Neighbors: Exploited the fact that the dataset contained blocks of buildings from satellite views by adding features from the k-nearest neighbors within a certain radius
Model Selection & Ensemble
We tested multiple models and achieved the best results with an ensemble of three gradient boosting algorithms:
- XGBoost: Primary model with extensive hyperparameter tuning
- LightGBM: Optimized for speed and memory efficiency
- CatBoost: Specialized in handling categorical features
The ensemble was validated using 10-fold cross-validation with manual hyperparameter tuning for each model.
Results & Lessons Learned
This was my first experience with classical machine learning, which provided valuable insights into the differences between traditional ML and deep learning approaches.
Key Takeaways:
- Feature Engineering Dominance: Unlike deep learning where architecture matters most, traditional ML performance heavily depends on feature engineering
- Ensemble Benefits: Combining multiple models consistently outperformed individual models
- Validation Strategy: Proper cross-validation prevented overfitting and provided reliable performance estimates
- Domain Knowledge: Understanding spatial relationships and urban planning concepts was crucial for creating meaningful features