ChatGPT has captured the public’s attention with its myriad of use cases, spanning from idea generation and marketing content creation to code debugging and data analysis.
However, amid the Generative AI gold rush, there are still several questions remaining - especially regarding the risks and rewards of generative AI. Alongside the real-world value of ChatGPT that left so many people in awe, the more provocative implications of AI advancements are yet to be uncovered.
For example, a large and looming question remains:
What will happen if ChatGPT puts power in the hands of those that are not prepared or interested in using this technology responsibly?
After all, although ChatGPT wasn’t specifically designed to make programming accessible to the general public, it’s known to be able to generate code, and even malware when prompted by malicious actors on the dark web.
But how about Machine Learning? Can ChatGPT be leveraged to build a viable ML model and potentially serve as an ML engineer? Is it capable of contributing significantly to the future of Machine Learning?
Our goal was to seek some answers, and here’s the approach we took.
Using ChatGPT to build a classification model
After giving it some thought, we decided to run an experiment to investigate how well ChatGPT can handle the task of building a classification model for predicting loan default.
The dataset used for this purpose was the “German Credit” dataset, a well known public dataset for loan default classification problems.
We started with a simple general instruction, asking ChatGPT for code that will train a model on that dataset and then we gradually asked it to update the code to increase the performance or to solve certain issues.
Note: Each block of code only shows the new or updated parts from the previously generated responses.
Prompt 1
Can you give me a snippet of python code that trains a classification model on the publicly available German Credit loan default dataset?
ChatGPT
Sure! Here’s a snippet of Python code that trains a classification model on the German Credit loan default dataset using scikit-learn:
|
|
Result
The code compiled without any problems and this was the output:
|
|
The code provided by ChatGPT did a good job on training a simple DecisionTreeClassifier
on the specified dataset. It also knew where to find and download that data from and what its features were. Great!
However, it’s worth mentioning that, even though encoding the categorical features using a LabelEncoder
inside a for loop worked, the proper approach would have been to use an OrdinalEncoder
and encode all features at once.
Without specifying the results of the obtained model, we asked ChatGPT to improve the performance of the model in hope of giving a more complex approach to training the model. Here’s how it went.
Prompt 2
Can you improve the performance of the model?
ChatGPT
Yes, there are several ways to potentially improve the performance of the model. Here are a few suggestions:
- Feature engineering: Consider creating new features from existing ones or transforming existing features to better capture the relationships between them and the target variable. For example, you could create a new feature that combines the duration and installment_rate features to capture the total amount paid back per installment.
- Hyperparameter tuning: Try different hyperparameter values for the decision tree classifier and see if you can improve its performance. You can use scikit-learn’s GridSearchCV or RandomizedSearchCV to automate this process.
- Ensemble methods: Instead of using a single decision tree classifier, you could try using ensemble methods such as random forests or gradient boosting, which can often improve performance.
- Data preprocessing: Consider scaling the numerical features, handling missing values, or removing outliers to improve the quality of the data and the performance of the model.
Here’s an updated code snippet that includes some of these suggestions:
|
|
- Imbalanced classes: If the dataset has imbalanced classes, you can try using techniques such as oversampling or undersampling to balance the classes and improve the performance of the model.
- More advanced models: You could also try using more advanced models such as neural networks or support vector machines (SVMs) to see if they can improve the performance of the model.
These are just a few suggestions for improving the performance of the model. Depending on the specific problem and dataset, there may be other techniques that are more effective.
Result
As you can see above, ChatGPT offered a lot of suggestions of how we could improve model performance, such as hyperparameter tuning, ensemble methods, data preprocessing, addressing class imbalance and using more advanced models.
This shows it has knowledge of the general approach of training a classifier.
Compared to the previously provided code, this time it added a StandardScaler for numerical features, replaced the DecisionTreeClassifier
with a RandomForestClassifier
and added a grid search to find some better init params for the classifier.
This was the output when running the code:
|
|
As you can see, this newly trained classifier has a better performance than the first one when looking at the accuracy metric.
After confirming that ChatGPT is capable of providing code that does the necessary data preprocessing and performs a grid search for finding better init params to achieve a better performance, we asked it to change the code, so the data preprocessing and the inference are done inside a pipeline.
Prompt 3
Without doing a grid search, can you give me a snippet of code that does all the necessary preprocessing steps for this specific dataset and trains a model for classification? Also, can you do the preprocessing and the inference inside of a pipeline?
ChatGPT
Sure! Here’s an example code snippet that preprocesses the data and trains a random forest classifier using a pipeline:
|
|
In this code, the preprocessor
transformer applies one-hot encoding to the categorical features and scaling to the numerical features using the ColumnTransformer
class.
The clf
pipeline applies the preprocessor
transformer and trains a random forest classifier. The fit()
method is used to train the classifier on the training set, and the predict()
method is used to make predictions on the testing set.
The accuracy and confusion matrix are then calculated using the accuracy_score()
and confusion_matrix()
functions from scikit-learn.
Result
The updated code incorporated the preprocessing of the numerical and categorical features inside a ColumnTransformer
, that was subsequently included inside of a Pipeline together with the classifier to successfully complete the task.
Our next step was to compute the classification report for the resulting model, in order to see the actual performance on each class.
|
|
Noticing there is a lower performance on the positive class by comparing the f1-score for each of the predicted classes, we tried giving the chatbot the task of increasing the performance for that class to see what approach it will take.
Prompt 4
The performance for the positive class is lower than the performance for the negative class. How can the code be modified to mitigate that?
ChatGPT
One way to address the issue of imbalanced class distribution and improve the performance of the positive class is to use class weighting in the random forest classifier. This can be done by setting the class_weight
parameter of the RandomForestClassifier
to 'balanced'
.
Here’s the modified code snippet:
|
|
In this code, the RandomForestClassifier
is initialized with the class_weight
parameter set to 'balanced'
, which assigns weights inversely proportional to the class frequencies. This means that the positive class will have a higher weight than the negative class, which should help to improve the performance of the positive class.
The classification_report()
function from scikit-learn is used to print the precision, recall, and F1-score for both classes.
Result
For this task, the only change to the previously provided code was to set the RandomForestClassifier
parameter class_weight
to “balanced” to possibly give a higher weight to the positive class, if the number of samples in the training set corresponding to each class was highly imbalanced.
Even though this is a good strategy when dealing with an imbalanced dataset like the one we used, the performance of the resulting model for the positive class actually decreased if we compare the f1-score.
|
|
Hence, we tried asking for other ways of increasing performance without any oversampling techniques.
Prompt 5
The performance difference between the two classes is still high. What else can be done to solve that other than oversampling?
ChatGPT
If class weighting does not improve the performance of the positive class enough, there are several other techniques that can be used to address class imbalance without oversampling. Here are a few options:
- Using different performance metrics: Accuracy can be a misleading metric when dealing with imbalanced classes, as it tends to favor the majority class. Instead, metrics such as precision, recall, and F1-score should be used to evaluate the model’s performance on each class separately. This can provide a better understanding of how well the model is performing for each class and can help to identify where the model needs improvement.
- Threshold adjustment: The threshold used for classification can be adjusted to better balance the precision and recall of the positive class. By default, scikit-learn’s predict() method uses a threshold of 0.5, meaning that any predicted probability above 0.5 is classified as positive. However, this threshold can be adjusted to a higher or lower value, depending on the desired trade-off between precision and recall.
- Cost-sensitive learning: This approach involves assigning different costs to different types of errors. For example, misclassifying a positive instance as negative may be more costly than the opposite. By adjusting the costs associated with each type of error, the model can be trained to minimize the total cost of misclassification.
- Ensembling: Ensembling techniques such as bagging and boosting can be used to improve the performance of the model on the positive class. By combining multiple models, each trained on a different subset of the data or using a different algorithm, the model can capture a wider range of patterns in the data and achieve better performance.
Here’s an example of how to adjust the classification threshold:
|
|
In this code, the predict_proba()
method is used to get the predicted probabilities of the positive class. The threshold for classification is then set to 0.4, meaning that any predicted probability above 0.4 is classified as positive. This threshold can be adjusted to achieve the desired trade-off between precision and recall.
Result
The code was modified so that after training, the predicted probabilities given by the model for the test set are used together with a threshold of 0.4 to obtain new predictions for the data. The response also contained a correct explanation of how the value of the threshold can provide a trade-off between precision and recall.
Although correct, we wanted the threshold adjustment to be set inside of the prediction pipeline, and this is where the ChatGPT began having problems when trying to provide code that would run without errors and achieve the desired outcome.
Prompt 6
Can you modify the code so that the threshold adjustment is inside of the pipeline?
ChatGPT
Sure! Here’s an example of how to adjust the classification threshold inside a pipeline:
In this code, the classification threshold is set to 0.4 inside the pipeline using a custom transformer. Here’s the code for the transformer:
|
|
This transformer takes the classifier (clf
) as input and returns the predicted classes based on the specified threshold. The ThresholdAdjuster
transformer can then be added to the pipeline like this:
|
|
Result
This time around, an instance of a custom class was added in the pipeline after the RandomForestClassifier
, but this code could not be run without error as the RandomForestClassifier
doesn’t have the necessary fit and transform methods to be able to add any other components after it in the pipeline.
However, it’s worth mentioning that although the provided code had an error, it’s still pretty impressive that it knew how to properly define a custom class by extending scikit-learn classes and overriding some of their methods in the attempt to achieve the desired result.
After several tries to obtain a different output from ChatGPT by regenerating the response, this error would still persist, so we tried to explain what the problem was in the hope of getting a working solution.
Prompt 7
The ThresholdAdjuster can’t be added in the pipeline after the RandomForestClassifier because the RandomForestClassifier does not have the methods fit and transform.
ChatGPT
You’re right, I apologize for the mistake. Here’s a modified version of the code where the ThresholdAdjuster
is placed after the RandomForestClassifier
in the pipeline:
|
|
In this modified code, the ThresholdAdjuster
transformer is now defined inside the pipeline as a class, and it takes the threshold
parameter as input. The predict
method of the ThresholdAdjuster
transformer is then used to adjust the threshold of the predicted classes. Finally, the pipeline is fitted and the threshold-adjusted predictions are made on the testing set.
Result
After acknowledging the mistake, the response continued by specifying that the ThresholdAdjuster will still be added in the pipeline after the RandomForestClassifier
.
This time, the only difference between the initial code and the current one was renaming the transform method to predict, but the provided code still produced the same error as before.
Instead of just explaining why the code fails, we tried adding the returned error to see if the bot would get a better understanding of the problem and how to fix it.
Did it do better this time around?
Prompt 8
When I tried running the code I got the following error: TypeError: “All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘RandomForestClassifier(class_weight=‘balanced’, random_state=42)’ (type <class ‘sklearn.ensemble._forest.RandomForestClassifier’>) doesn’t”. How can the code be modified to fix this?
ChatGPT
This error message is occurring because RandomForestClassifier
does not have a transform
method, which is required by scikit-learn pipelines.
To fix this, we can wrap the RandomForestClassifier
inside a FunctionTransformer
which converts the predict
method of the RandomForestClassifier
to a transformer that implements the transform
method. Here’s the updated code:
|
|
In this updated code, we use FunctionTransformer
to transform the predict
method of RandomForestClassifier
into a transformer that can be used in the pipeline. We also move the ThresholdAdjuster
transformer after the classifier
transformer in the pipeline.
Result
This time around, the generated code produced a different error: “NotFittedError: This RandomForestClassifier instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.".
This was due to the replacing of the RandomForestClassifier
in the pipeline with a FunctionTransformer
that uses the predict
method of an instance of RandomForestClassifier
that was not fitted yet.
Once again, we gave the chatbot the returned error to see if it could find a working solution.
Prompt 9
When I run the code now I get a different error: “NotFittedError: This RandomForestClassifier instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.". How can the code be modified to fix this?
ChatGPT
This error can occur if the ThresholdAdjuster
transformer is called before the RandomForestClassifier
has been fitted.
To fix this, we can define a custom estimator that wraps the RandomForestClassifier
and adds a predict_proba
method to it. This will allow us to fit the entire pipeline, including the ThresholdAdjuster
, before making predictions.
Here’s the updated code:
|
|
In this updated code, we use FunctionTransformer
to transform the predict
method of RandomForestClassifier
into a transformer that can be used in the pipeline. We also move the ThresholdAdjuster
transformer after the classifier
transformer in the pipeline.
Result
This time, the solution was to first create a custom class CustomRandomForestClassifier
that overrides the predict_proba
method to return the predicted probabilities for the positive class only, then to modify the fit
method of the ThresholdAdjuster
to instantiate and fit an instance of the custom random forest classifier.
It also modified the predict
method of the ThresholdAdjuster
to obtain the prediction probabilities by calling predict_proba
of its instance of CustomRandomForrestClassifier
(this would return the probabilities only for the positive class) and then again try to select only the probabilities for the positive class prompting the following error: “IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed”
This error would have been avoided either by using an instance of a RandomForrestClassifier instead of the custom one, or by not trying to get the probabilities for the positive class for a second time (replace
|
|
with
|
|
Another more subtle error was that of the assumption that the labels of the negative and positive class are [0, 1]. The provided code (with the mentioned changes) would produce the correct result only if the corresponding labels of the negative and positive classes are the same as the indices of those classes (which was not the case for this dataset as the values of the labels were “1” and “2”).
The way to solve this was to either encode the labels from the start to match the indices or to replace the line of code from ThresholdAdjuster predict from
|
|
to
|
|
Conclusion
Overall, ChatGPT is very good at providing basic code for training models. It has knowledge of publicly available datasets and code libraries, and it’s very good at describing different approaches and techniques used in the process of training a machine learning model in detail.
By being able to create code templates, ML engineers can accelerate their iteration cycle significantly, as opposed to having to write every piece of code from scratch.
However, when faced with the task of providing a more complex implementation, ChatGPT struggles to give a correct solution or a code that runs without errors. Another limitation lies in its inability to autonomously build upon its own prior ideas and deductions without guidance.
ChatGPT provides effective general approaches to address the problem at hand, but it lacks the ability to determine the correct specific steps to take from the outset, without user testing and feedback on its responses.
On top of this, there were some instances when it continued to produce the same error, even when the issue with the provided code was explained.
Obtaining significant insights is a challenging process for humans, as it involves building upon prior experiences and hard-won knowledge. However, ChatGPT is yet to develop this ability and depends on a “prompt engineer” for direction.
Our takeaway?
Despite being a valuable starting point for Machine Learning concepts and strategies, GPT3.5 currently lacks the cognitive depth required for self-sufficient ML engineering.
Needless to say, one should always err on the side of caution when building ML models and remain committed to the responsible use of AI. Generative AI is a technology that seems to evolve very fast, GPT4 already being considered much more capable of complex reasoning than its predecessor.
Keeping this in mind, we do believe that ChatGPT has a place in the future of Machine Learning and, as we move towards more sophisticated generative AI models, it is our responsibility to ensure accountability in a rapidly-evolving technological landscape.
At Lumenova AI, we have made it our mission to empower companies to make Responsible AI a part of their DNA. In an age where opaque decision-making is no longer enough, we are committed to delivering value through our state-of-the-art AI Trust Platform that enables businesses to make AI ethical, fair, and transparent.