import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc
rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage{cmbright}')
rc('font', **{'family': 'sans-serif', 'sans-serif': ['Helvetica']})
%matplotlib inline
# This enables SVG graphics inline.
%config InlineBackend.figure_formats = {'png', 'retina'}
rc = {'lines.linewidth': 2,
'axes.labelsize': 18,
'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style("dark")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['legend.fontsize'] = 14
In all of this following blog post, I have tried to be explicit about what chatgpt wrote and what I wrote. I think it's likely going to be important for us to cite chatGPT (science, people!), but also to try to train ourselves to notice common errors or discrepancies between the way a person would solve a problem and the way a foundation model would solve it.
# chatgpt suggested:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
data = pd.read_csv(url, header=None)
# The first column is an ID (we can ignore this), the second column is the label (M = malignant, B = benign),
# and the rest are features
labels = data.iloc[:, 1]
features = data.iloc[:, 2:]
# chat suggested:
## Preview the data
#print(features.head())
#print(labels.head())
# i preferred:
print(data[1].unique())
data.head(1)
['M' 'B']
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | M | 17.99 | 10.38 | 122.8 | 1001.0 | 0.1184 | 0.2776 | 0.3001 | 0.1471 | ... | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.1189 |
1 rows × 32 columns
# chat suggested:
# sns.countplot(labels), which did not work.
# upon prompting, chat suggested the following
# line, which DID work:
sns.countplot(x=labels)
<Axes: xlabel='1', ylabel='count'>
# chat wrote everything in the next few cells:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test =\
train_test_split(features_scaled, labels_encoded,
test_size=0.2, random_state=42)
For this next section, ChatGPT suggested a few lines of code with little to no explanation for their choice. Upon prompting it to explain its choices, it managed to do so quite well. The following code and text is entirely output via ChatGPT:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
2023-05-21 02:44:44.645900: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
These lines import the required modules. TensorFlow is a powerful library for numerical computation, particularly well-suited for large-scale ML.
The function Sequential
specifies linear stacks of layers that you can easily create a neural network model with. Dense
from tensorflow.keras.layers
is the regular fully-connected neural network layer.
model = Sequential()
This line creates an instance of a Sequential model. This will be our neural network model that we will add layers to. It's called "sequential" because it allows you to build a model layer by layer, in a sequential manner. Each layer has weights that correspond to the layer that follows it.
model.add(
Dense(1, activation='sigmoid',
input_shape=(features_train.shape[1],))
)
Above, we are adding a Dense layer to our model. This next layer is fully connected, meaning all the neurons in the previous layer are connected to all the neurons in the following layer.
The arguments passed to the Dense
function dictate how the layer will be constructed.
1
, specifies the number of neurons in the layer. DAVID'S WORDS: We chose a single neuron since we are building a simple, 1 layer network which outputs a single classification value (M
or B
).activation
argument is used to specify the activation function for the layer. Here we use a sigmoid
function because it outputs values between 0 and 1, which is perfect for binary classification problems. Other common choices are relu
, tanh
and softmax
(useful for multi-class classification problems).input_shape
argument specifies the shape of the previous layer or in this case the number of features in our data.model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
The compile function configures the model for training. It receives three important arguments:
The loss function. This is the objective that the model will try to minimize. It's also called the cost function. For binary classification problems, 'binary_crossentropy' is a common choice. It calculates the cross-entropy loss between true labels and predicted labels, which is a good measure for classification tasks.
The optimizer. This is the algorithm that the model uses to minimize the loss function. There are several optimizers available, but 'adam' is a good default choice. Adam stands for Adaptive Moment Estimation, and it's an algorithm for gradient descent optimization. It's computationally efficient and requires little memory.
The list of metrics. These are the measures of quality that the model will track during training and testing. Since we are working on a binary classification problem, 'accuracy' (the proportion of correct predictions) is a reasonable metric to use.
history = model.fit(features_train, labels_train,
epochs=50,
validation_split=0.2)
Epoch 1/50 12/12 [==============================] - 0s 12ms/step - loss: 0.4670 - accuracy: 0.8242 - val_loss: 0.4572 - val_accuracy: 0.8132 Epoch 2/50 12/12 [==============================] - 0s 3ms/step - loss: 0.4250 - accuracy: 0.8599 - val_loss: 0.4166 - val_accuracy: 0.8571 Epoch 3/50 12/12 [==============================] - 0s 3ms/step - loss: 0.3916 - accuracy: 0.8764 - val_loss: 0.3814 - val_accuracy: 0.8681 Epoch 4/50 12/12 [==============================] - 0s 3ms/step - loss: 0.3615 - accuracy: 0.8984 - val_loss: 0.3528 - val_accuracy: 0.8791 Epoch 5/50 12/12 [==============================] - 0s 3ms/step - loss: 0.3372 - accuracy: 0.9093 - val_loss: 0.3284 - val_accuracy: 0.8791 Epoch 6/50 12/12 [==============================] - 0s 3ms/step - loss: 0.3163 - accuracy: 0.9203 - val_loss: 0.3082 - val_accuracy: 0.9011 Epoch 7/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2983 - accuracy: 0.9231 - val_loss: 0.2907 - val_accuracy: 0.9121 Epoch 8/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2827 - accuracy: 0.9258 - val_loss: 0.2765 - val_accuracy: 0.9121 Epoch 9/50 12/12 [==============================] - 0s 2ms/step - loss: 0.2684 - accuracy: 0.9313 - val_loss: 0.2637 - val_accuracy: 0.9121 Epoch 10/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2561 - accuracy: 0.9368 - val_loss: 0.2523 - val_accuracy: 0.9121 Epoch 11/50 12/12 [==============================] - 0s 2ms/step - loss: 0.2451 - accuracy: 0.9396 - val_loss: 0.2421 - val_accuracy: 0.9231 Epoch 12/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2354 - accuracy: 0.9396 - val_loss: 0.2331 - val_accuracy: 0.9231 Epoch 13/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2263 - accuracy: 0.9396 - val_loss: 0.2253 - val_accuracy: 0.9231 Epoch 14/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2182 - accuracy: 0.9451 - val_loss: 0.2178 - val_accuracy: 0.9231 Epoch 15/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2107 - accuracy: 0.9505 - val_loss: 0.2114 - val_accuracy: 0.9231 Epoch 16/50 12/12 [==============================] - 0s 3ms/step - loss: 0.2038 - accuracy: 0.9505 - val_loss: 0.2056 - val_accuracy: 0.9341 Epoch 17/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1975 - accuracy: 0.9505 - val_loss: 0.2003 - val_accuracy: 0.9341 Epoch 18/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1918 - accuracy: 0.9505 - val_loss: 0.1955 - val_accuracy: 0.9341 Epoch 19/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1865 - accuracy: 0.9505 - val_loss: 0.1912 - val_accuracy: 0.9451 Epoch 20/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1816 - accuracy: 0.9505 - val_loss: 0.1873 - val_accuracy: 0.9451 Epoch 21/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1770 - accuracy: 0.9505 - val_loss: 0.1835 - val_accuracy: 0.9451 Epoch 22/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1727 - accuracy: 0.9505 - val_loss: 0.1801 - val_accuracy: 0.9560 Epoch 23/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1690 - accuracy: 0.9505 - val_loss: 0.1770 - val_accuracy: 0.9560 Epoch 24/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1651 - accuracy: 0.9505 - val_loss: 0.1737 - val_accuracy: 0.9560 Epoch 25/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1614 - accuracy: 0.9505 - val_loss: 0.1709 - val_accuracy: 0.9560 Epoch 26/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1581 - accuracy: 0.9505 - val_loss: 0.1685 - val_accuracy: 0.9560 Epoch 27/50 12/12 [==============================] - 0s 4ms/step - loss: 0.1549 - accuracy: 0.9505 - val_loss: 0.1660 - val_accuracy: 0.9560 Epoch 28/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1521 - accuracy: 0.9533 - val_loss: 0.1636 - val_accuracy: 0.9670 Epoch 29/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1492 - accuracy: 0.9505 - val_loss: 0.1618 - val_accuracy: 0.9670 Epoch 30/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1466 - accuracy: 0.9505 - val_loss: 0.1597 - val_accuracy: 0.9670 Epoch 31/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1441 - accuracy: 0.9560 - val_loss: 0.1578 - val_accuracy: 0.9670 Epoch 32/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1417 - accuracy: 0.9588 - val_loss: 0.1560 - val_accuracy: 0.9670 Epoch 33/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1394 - accuracy: 0.9643 - val_loss: 0.1541 - val_accuracy: 0.9670 Epoch 34/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1372 - accuracy: 0.9643 - val_loss: 0.1524 - val_accuracy: 0.9670 Epoch 35/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1351 - accuracy: 0.9643 - val_loss: 0.1509 - val_accuracy: 0.9670 Epoch 36/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1331 - accuracy: 0.9643 - val_loss: 0.1494 - val_accuracy: 0.9670 Epoch 37/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1312 - accuracy: 0.9670 - val_loss: 0.1478 - val_accuracy: 0.9670 Epoch 38/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1293 - accuracy: 0.9698 - val_loss: 0.1465 - val_accuracy: 0.9670 Epoch 39/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1276 - accuracy: 0.9698 - val_loss: 0.1453 - val_accuracy: 0.9670 Epoch 40/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1259 - accuracy: 0.9698 - val_loss: 0.1440 - val_accuracy: 0.9670 Epoch 41/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1243 - accuracy: 0.9698 - val_loss: 0.1426 - val_accuracy: 0.9670 Epoch 42/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1227 - accuracy: 0.9698 - val_loss: 0.1415 - val_accuracy: 0.9670 Epoch 43/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1211 - accuracy: 0.9698 - val_loss: 0.1405 - val_accuracy: 0.9670 Epoch 44/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1197 - accuracy: 0.9698 - val_loss: 0.1394 - val_accuracy: 0.9670 Epoch 45/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1183 - accuracy: 0.9698 - val_loss: 0.1384 - val_accuracy: 0.9670 Epoch 46/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1170 - accuracy: 0.9698 - val_loss: 0.1372 - val_accuracy: 0.9670 Epoch 47/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1158 - accuracy: 0.9698 - val_loss: 0.1362 - val_accuracy: 0.9670 Epoch 48/50 12/12 [==============================] - 0s 2ms/step - loss: 0.1146 - accuracy: 0.9698 - val_loss: 0.1352 - val_accuracy: 0.9670 Epoch 49/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1133 - accuracy: 0.9698 - val_loss: 0.1341 - val_accuracy: 0.9670 Epoch 50/50 12/12 [==============================] - 0s 3ms/step - loss: 0.1121 - accuracy: 0.9698 - val_loss: 0.1331 - val_accuracy: 0.9670
David's words: This next section is interestign -- ChatGPT cut off at the end, and I had to prompt it to finish. Here is the output:
The fit function is where the training process takes place. It receives the training data and labels, the number of epochs, and the validation split.
features_train
and labels_train
are the input data and the corresponding labels for training.
epochs
is the number of times the training process will pass through the entire dataset. Here, we set it to 50.
`validation_split
I apologize for the abrupt cutoff earlier. Let's continue the explanation.
validation_split=0.2
argument in the fit function is used to set aside a fraction of the training data for validation. The value 0.2 means 20% of the training data will be used as validation data. The model will then not train on this data and instead use it to evaluate the loss and any model metrics at the end of each epoch. This helps monitor the model's performance and prevent overfitting on the training data.plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
<matplotlib.legend.Legend at 0x11de69190>
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
<matplotlib.legend.Legend at 0x1736f1a90>
loss, accuracy = model.evaluate(features_test, labels_test)
print(f'Test accuracy: {accuracy}')
4/4 [==============================] - 0s 2ms/step - loss: 0.0936 - accuracy: 0.9825 Test accuracy: 0.9824561476707458