How can we get a better model?¶

In the previous post, we built a single layer neural network that showed an impressive test accuracy of 98.2%. That said, in the previous post, we chose some default parameters -- we trained on a cross-entropy loss, we used 50 epochs, we used an 'Adam' optimizer, and we didn't set a batch size. If we were able to get such impressive results out of the box, can we do better? And if so, how?

Enter hyperparameter tuning. Hyperparameter tuning is the process of searching through combinations of neural net hyperparameters to find the ones that perform the best. There are a number of strategies to do this efficiently; here, we use scikit-learn's GridSearchCV function to optimize over hyperparameters, using 5-fold cross-validation.

This post was interesting because it begins to get at the limits of ChatGPT4, at least in so far as my prompting is concerned. Read on to see the failure modes of ChatGPT in this notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})

%matplotlib inline

# This enables SVG graphics inline. 
%config InlineBackend.figure_formats = {'png', 'retina'}

rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")

mpl.rcParams['xtick.labelsize'] = 16 
mpl.rcParams['ytick.labelsize'] = 16 
mpl.rcParams['legend.fontsize'] = 14
2023-05-22 14:25:26.357157: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Data¶

Reload the data we used previously:

In [2]:
# chatgpt suggested:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
data = pd.read_csv(url, header=None)

# The first column is an ID (we can ignore this), the second column is the label (M = malignant, B = benign), 
# and the rest are features
labels = data.iloc[:, 1]
features = data.iloc[:, 2:]

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)
features_train, features_test, labels_train, labels_test =\
    train_test_split(features_scaled, labels_encoded,
                     test_size=0.2, random_state=42)

I gave chatGPT the minimal code from the previous post, and I asked it to help me write a hyperparameter tuning script. This is what it came up with:

In [3]:
# chatgpt suggested the following, but is deprecated:
#from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
# the right code is now:
from scikeras.wrappers import KerasClassifier

# from here, all chatgpt except choice of hyperparams, which is mine:
from sklearn.model_selection import GridSearchCV

def create_model(optimizer='adam', loss='binary_crossentropy'):
    m = Sequential()
    m.add(Dense(1, activation='sigmoid', input_shape=(features_train.shape[1],)))
    m.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    return m

# hyperparameters to tune over
param_grid = {
    'optimizer': ['SGD', 'RMSprop', 'Adam'],
    'epochs': [10, 20, 30],
    'batch_size': [10, 20, 30],
}

# do the grid search
model = KerasClassifier(model=create_model, verbose=0)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10)
grid_result = grid.fit(features_train, labels_train)

print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
WARNING:tensorflow:5 out of the last 15 calls to <function Model.make_predict_function.<locals>.predict_function at 0x17caab310> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:5 out of the last 13 calls to <function Model.make_predict_function.<locals>.predict_function at 0x10f3c9700> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
Best: 0.9757487922705315 using {'batch_size': 20, 'epochs': 30, 'optimizer': 'SGD'}

Next, I asked it to fit the best model to the data, given the hyperparameter tuning results:

In [4]:
# Extract the best parameters
best_params = grid_result.best_params_

# Train the model with the best parameters
model = create_model(best_params['optimizer'])
history = model.fit(features_train, labels_train,
                    epochs=best_params['epochs'],
                    batch_size=best_params['batch_size'],
                    verbose=0)

import matplotlib.pyplot as plt

# Plot training accuracy
plt.figure(figsize=(12, 6))
plt.plot(history.history['accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()

# Plot training loss
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()
In [5]:
loss, accuracy = model.evaluate(features_test, labels_test, verbose=0)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")
Test loss: 0.10274569690227509
Test accuracy: 0.9824561476707458

Wow! We went from 94% accuracy to 98.2% accuracy, just by tuning the model a tiny bit! That is seriously cool--and ChatGPT did most of the work!

One of the things I was wondering about was whether we could also tune the choice of loss function and whether a multi-layer NN might outperform our very simple single-layer network. To study this, I modified the param_grids dictionary, and I re-wrote the create_model function (with Chat's help) to have a variable number of relu layers. Then, I asked Chat to hyperparameter optimize this new function....

In [10]:
def create_model(optimizer='adam', loss='binary_crossentropy', num_layers=1):
    model = Sequential()

    # add relu layers
    for _ in range(num_layers):
        model.add(Dense(10, activation='relu'))

    # final layer for classification:
    model.add(Dense(1, activation='sigmoid'))
    # compile
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
    return model

param_grid = {
    'optimizer': ['SGD'],
    'loss': ['binary_crossentropy', 'hinge'],
    'epochs': [50, 100, 150],
    'batch_size': [10, 20],
    'num_layers': [1, 2]
}

model = KerasClassifier(model=create_model, verbose=0)

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_result = grid.fit(features_train, labels_train)

print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 20
     11 param_grid = {
     12     'optimizer': ['SGD'],
     13     'loss': ['binary_crossentropy', 'hinge'],
   (...)
     16     'num_layers': [1, 2]
     17 }
     19 grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
---> 20 grid_result = grid.fit(features_train, labels_train)
     22 print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_search.py:874, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    868     results = self._format_results(
    869         all_candidate_params, n_splits, all_out, all_more_results
    870     )
    872     return results
--> 874 self._run_search(evaluate_candidates)
    876 # multimetric is determined here because in the case of a callable
    877 # self.scoring the return type is only known after calling
    878 first_test_score = all_out[0]["test_scores"]

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_search.py:1388, in GridSearchCV._run_search(self, evaluate_candidates)
   1386 def _run_search(self, evaluate_candidates):
   1387     """Search all candidates in param_grid"""
-> 1388     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_search.py:821, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    813 if self.verbose > 0:
    814     print(
    815         "Fitting {0} folds for each of {1} candidates,"
    816         " totalling {2} fits".format(
    817             n_splits, n_candidates, n_candidates * n_splits
    818         )
    819     )
--> 821 out = parallel(
    822     delayed(_fit_and_score)(
    823         clone(base_estimator),
    824         X,
    825         y,
    826         train=train,
    827         test=test,
    828         parameters=parameters,
    829         split_progress=(split_idx, n_splits),
    830         candidate_progress=(cand_idx, n_candidates),
    831         **fit_and_score_kwargs,
    832     )
    833     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    834         enumerate(candidate_params), enumerate(cv.split(X, y, groups))
    835     )
    836 )
    838 if len(out) < 1:
    839     raise ValueError(
    840         "No fits were performed. "
    841         "Was the CV iterator empty? "
    842         "Were there no candidates?"
    843     )

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/utils/parallel.py:63, in Parallel.__call__(self, iterable)
     58 config = get_config()
     59 iterable_with_config = (
     60     (_with_config(delayed_func, config), args, kwargs)
     61     for delayed_func, args, kwargs in iterable
     62 )
---> 63 return super().__call__(iterable_with_config)

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/parallel.py:1085, in Parallel.__call__(self, iterable)
   1076 try:
   1077     # Only set self._iterating to True if at least a batch
   1078     # was dispatched. In particular this covers the edge
   (...)
   1082     # was very quick and its callback already dispatched all the
   1083     # remaining jobs.
   1084     self._iterating = False
-> 1085     if self.dispatch_one_batch(iterator):
   1086         self._iterating = self._original_iterator is not None
   1088     while self.dispatch_one_batch(iterator):

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
    899     return False
    900 else:
--> 901     self._dispatch(tasks)
    902     return True

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/parallel.py:819, in Parallel._dispatch(self, batch)
    817 with self._lock:
    818     job_idx = len(self._jobs)
--> 819     job = self._backend.apply_async(batch, callback=cb)
    820     # A job can complete so quickly than its callback is
    821     # called before we get here, causing self._jobs to
    822     # grow. To ensure correct results ordering, .insert is
    823     # used (rather than .append) in the following line
    824     self._jobs.insert(job_idx, job)

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/_parallel_backends.py:597, in ImmediateResult.__init__(self, batch)
    594 def __init__(self, batch):
    595     # Don't delay the application, to avoid keeping the input
    596     # arguments in memory
--> 597     self.results = batch()

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/parallel.py:288, in BatchedCalls.__call__(self)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/joblib/parallel.py:288, in <listcomp>(.0)
    284 def __call__(self):
    285     # Set the default nested backend to self._backend but do not set the
    286     # change the default number of processes to -1
    287     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288         return [func(*args, **kwargs)
    289                 for func, args, kwargs in self.items]

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/utils/parallel.py:123, in _FuncWrapper.__call__(self, *args, **kwargs)
    121     config = {}
    122 with config_context(**config):
--> 123     return self.function(*args, **kwargs)

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:674, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    671     for k, v in parameters.items():
    672         cloned_parameters[k] = clone(v, safe=False)
--> 674     estimator = estimator.set_params(**cloned_parameters)
    676 start_time = time.time()
    678 X_train, y_train = _safe_split(estimator, X, y, train)

File ~/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py:1168, in BaseWrapper.set_params(self, **params)
   1164             super().set_params(**{param: value})
   1165         except ValueError:
   1166             # Give a SciKeras specific user message to aid
   1167             # in moving from the Keras wrappers
-> 1168             raise ValueError(
   1169                 f"Invalid parameter {param} for estimator {self.__name__}."
   1170                 "\nThis issue can likely be resolved by setting this parameter"
   1171                 f" in the {self.__name__} constructor:"
   1172                 f"\n`{self.__name__}({param}={value})`"
   1173                 "\nCheck the list of available parameters with"
   1174                 " `estimator.get_params().keys()`"
   1175             ) from None
   1176 return self

ValueError: Invalid parameter num_layers for estimator KerasClassifier.
This issue can likely be resolved by setting this parameter in the KerasClassifier constructor:
`KerasClassifier(num_layers=1)`
Check the list of available parameters with `estimator.get_params().keys()`

And it failed! So I gave the error message to ChatGPT and it was completely unable to fix the problem. To me, that is surprising--the error gives a solution that in fact works. However, ChatGPT was totally unablo to find a simple solution to this. Simply pasting the error message and trying to get it to fix the code (prompt: Please fix the code that is giving this error message) led to increasingly worse solutions.

Eventually, I remembered that zero-shot prompts that ask these LLMs to reason about their logic and enumerate steps frequently perform better. So, I prompted Chat to "Please reason about the error message by breaking it into pieces. Then suggest a solution based on this analysis of the error message", and it output the following (very hacky) code:

In [10]:
from sklearn.model_selection import GridSearchCV

def create_model_func(num_layers=1):
    """A wrapper around `create_model`, which specifies how many layers `create_model` should have"""
    
    def create_model(optimizer='adam', loss='binary_crossentropy'):
        model = Sequential()

        # here chat made a mistake: when adding features, it's important to specify the 
        # number of inputs into each layer

        # add layers. the architecture for this network goes from
        # M features --> 10 features with relu activation --> 10 .... --> 1 sigmoid node
        for _ in range(num_layers):
            if _ == 0:
                model.add(Dense(10, activation='relu', input_shape=(features_train.shape[1],)))
            else:
                model.add(Dense(10, activation='relu', input_shape=(10,)))
        if num_layers == 1:
            model.add(Dense(1, activation='sigmoid', input_shape=(features_train.shape[1],)))
        else:
            model.add(Dense(1, activation='sigmoid', input_shape=(10,)))
        model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
        return model

    #return a function with the number of layers pre-specified
    return create_model

param_grid = {
    'optimizer': ['SGD', 'Adam'],
    'loss': ['binary_crossentropy', 'hinge'],
    'epochs': [20, 40, 60],
    'batch_size': [10, 20, 30],
}

models = []
for num_layers in [1, 2]:
    model = KerasClassifier(model=create_model_func(num_layers), verbose=0)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10)
    grid_result = grid.fit(features_train, labels_train)
    print(f"Best for {num_layers} layers: {grid_result.best_score_} using {grid_result.best_params_}")
    models.append(grid_result)
/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning: 
180 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
180 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 1494, in fit
    super().fit(X=X, y=y, sample_weight=sample_weight, **kwargs)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 762, in fit
    self._fit(
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 929, in _fit
    self._check_model_compatibility(y)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 571, in _check_model_compatibility
    raise ValueError(
ValueError: loss=hinge but model compiled with binary_crossentropy. Data may not match loss function!

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-finite: [0.97130435 0.96917874        nan        nan 0.98024155 0.98246377
        nan        nan 0.97357488 0.96917874        nan        nan
 0.9626087  0.97806763        nan        nan 0.97140097 0.97144928
        nan        nan 0.9757971  0.97362319        nan        nan
 0.96483092 0.94942029        nan        nan 0.97362319 0.97135266
        nan        nan 0.97362319 0.9736715         nan        nan]
  warnings.warn(
Best for 1 layers: 0.9824637681159419 using {'batch_size': 10, 'epochs': 40, 'loss': 'binary_crossentropy', 'optimizer': 'Adam'}
/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning: 
180 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
180 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 1494, in fit
    super().fit(X=X, y=y, sample_weight=sample_weight, **kwargs)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 762, in fit
    self._fit(
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 929, in _fit
    self._check_model_compatibility(y)
  File "/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/scikeras/wrappers.py", line 571, in _check_model_compatibility
    raise ValueError(
ValueError: loss=hinge but model compiled with binary_crossentropy. Data may not match loss function!

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/Users/davidangeles/opt/anaconda3/envs/gene_expression_env/lib/python3.9/site-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-finite: [0.97801932 0.9736715         nan        nan 0.97352657 0.97797101
        nan        nan 0.97806763 0.97149758        nan        nan
 0.97144928 0.96913043        nan        nan 0.97570048 0.96483092
        nan        nan 0.96700483 0.97352657        nan        nan
 0.9736715  0.96048309        nan        nan 0.9647343  0.97801932
        nan        nan 0.97574879 0.96690821        nan        nan]
  warnings.warn(
Best for 2 layers: 0.9780676328502416 using {'batch_size': 10, 'epochs': 60, 'loss': 'binary_crossentropy', 'optimizer': 'SGD'}

According to this code, it would be best to use a single layer!