Rocket Launch Success Prediction Using Logistic Regression

Introduction

Stasy Hsieh
4 min readSep 16, 2024

In this project, we aim to predict the success of rocket launches based on features such as launch year, rocket status, and other variables. The dataset used for this analysis is from the Kaggle Space Missions dataset, and we applied Logistic Regression to model the binary classification of mission success.

Problem Statement

The primary goal of this project is to build a model that accurately predicts whether a rocket launch will be successful (MissionSuccess), given certain features. Our analysis focuses on optimizing a Logistic Regression model to classify successful and failed rocket launches, using techniques such as hyperparameter tuning, threshold adjustments, and misclassification analysis.

Steps Involved in the Project:

1. Data Cleaning and Preparation

2. Model Training using Logistic Regression

3. Hyperparameter Tuning using Grid Search

4. Threshold Adjustment and Misclassification Analysis

5. Model Evaluation and Conclusion

1. Data Cleaning and Preparation

Data Preprocessing

1. Scaling the Features: We used StandardScaler to normalize the features to improve the performance of the Logistic Regression model.

2. Feature Selection: We selected key features such as LaunchYear and RocketStatus to be used in the model.

2. Model Training using Logistic Regression

Correlation Matrix

We computed the correlation matrix to understand the relationships between different features and the target (MissionSuccess).

# Code to generate correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Correlation Matrix Heatmap:

From the heatmap, we can observe that the LaunchYear has a weak correlation with MissionSuccess, and there is little correlation between RocketStatusCode and the target.

3. Hyperparameter Tuning using Grid Search

Logistic Regression Model Training

We used Logistic Regression for the initial model and applied GridSearchCV to find the optimal hyperparameters. The best parameters were:

 • C = 0.01
• penalty = 'l1'
• solver = 'liblinear'
• max_iter = 200

Feature Importance:

After fitting the model, we plotted the feature importance to identify which features had the most impact on predicting mission success.

# Plot feature importance for Logistic Regression
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': best_model.coef_[0]})
coefficients.sort_values(by='Coefficient', ascending=False).plot(kind='barh', x='Feature', y='Coefficient')

Feature Importance Plot:

From this plot, it’s clear that LaunchYear had the highest importance in predicting the mission’s success, while RocketStatusCode had relatively less influence.

4. Threshold Adjustment and Misclassification Analysis

Mission Success Rate Over Years

We analyzed how the success rate of rocket launches evolved over time, using the LaunchYear feature.

# Plotting the mission success rate over years
success_by_year = df.groupby('LaunchYear')['MissionSuccess'].mean()
success_by_year.plot(kind='line', title='Mission Success Rate Over Years')

Mission Success Rate Over Years:

The mission success rate increased significantly over the decades, indicating improved reliability in rocket launches as technology advanced.

Threshold Adjustment

We initially used the default threshold of 0.5 to classify mission success. To improve precision, we adjusted the threshold to 0.6 and 0.7. This allowed us to reduce the number of false positives while balancing recall.

# Adjust threshold and evaluate precision and recall
threshold = 0.6 # Experiment with different thresholds
y_pred_adjusted = (best_model.predict_proba(X_test_scaled)[:, 1] >= threshold).astype(int)

Confusion Matrix After Threshold Adjustment:

After adjusting the threshold, we analyzed the confusion matrix to evaluate the model’s performance.

# Plot confusion matrix for adjusted threshold
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_adjusted)

Confusion Matrix:

This matrix shows that the model captured all successful missions (recall = 1.0) but struggled with false positives, predicting a few failed missions as successful.

5. Model Evaluation and Conclusion

Model Evaluation: ROC Curve

Finally, we plotted the ROC curve to evaluate the trade-offs between true positive rate and false positive rate at different threshold levels.

# Plotting the ROC curve
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test_scaled)[:,1])
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')

ROC Curve:

The ROC curve gives a visual representation of the model’s performance across different thresholds. The model has decent performance with an acceptable true positive rate.

Rocket Status Analysis

We also examined the impact of RocketStatus on mission success:

# Plotting mission success based on rocket status
sns.countplot(x='RocketStatus', hue='MissionSuccess', data=df)

Rocket Status Analysis:

From this plot, it is evident that both Active and Retired rockets had high mission success rates, though there are more Retired rockets in the dataset.

Conclusion

  • Model Performance: Logistic Regression achieved a high recall score (1.0), meaning it successfully identified all successful launches. However, it had some issues with false positives, which we addressed through threshold adjustment.
  • Next Steps: Given the performance of the Logistic Regression model, we may want to try more complex models such as Random Forest or Gradient Boosting to see if we can further improve precision and overall accuracy without the need for heavy threshold tuning.

References:

  1. Space Missions Dataset from Kaggle.
  2. Scikit-learn documentation for Logistic Regression.

--

--

Stasy Hsieh
Stasy Hsieh

Written by Stasy Hsieh

Bare honest witness to the world as I have experienced with it.

No responses yet