Local Development
Setup
For our training example on a local machine, we will be using the nanoGPT (https://github.com/karpathy/nanoGPT) model by Andrej Karpathy to explain how it works at a basic level. You can follow this step by step video tutorial by non other than Andrej Karpathy here for an in depth understanding of how this all works or you can read an article by New York Times (or access a PDF of it here).
Installing Python
First, we have to make sure Python is installed. While the latest python version is Python 3.12, we need Python 3.10. This is because Python 3.10 is the latest python version that supports an important machine learning package PyTorch. If you have a lower version python such as Python 3.9 or Python 3.8 it should be fine but will not work for older versions.
Windows
For most machines, make sure to download Windows installed (64-bit) version for Python 3.10.13 [here]
macOS
For the macOS version, you should download the latest Python 3.10 version which is 3.10.11 [here]
Linux
You have many options for Linux, ...as always. You could either download directly from the package manager like so
sudo apt install python3.10
or build directly from source [here]
Copying Code
Now that we have Python setup, we should access this github repository from Andrej Karpathy to get all the necessary code.
Setting up a virtual environment
After you have your python installed and repository cloned, the next step is to make a virtual environment. Inside the directory you downloaded, open your ide of choice from the directory. We use virtual environments to make sure there are no conflicts between package versions. If this is the only python version you have installed, you can just do
python3 -m venv venv
or if this doesn't work for some reason also try
python -m venv venv
If you have multiple python versions already installed, you would need to find the location of the executable for the specific python version. So in my case, it would look something like this.
C:/Python310/python.exe -m venv venv
This will create a virtual python environment in the current directory called venv.
To start the virtual envornment simply type activate
in the terminal and you should be placed in the virtual environment if you are in the same directory as the venv folder. If this does not work, make sure your terminal is in the current directory and the venv file is also located in the current directory. You can also try
Windows
.\venv\Scripts\activate
macOS / Linux
source venv/bin/activate
To exit the virtual environment, simply type deactivate
in the termianl or
Windows
.\venv\Scripts\deactivate
macOS / Linux
source venv/bin/deactivate
Now that we have our environment setup, we should download the required packages for this demonstration. This part will depend on if you have a dedicated GPU that supports CUDA or not. Go to this website and scroll down until you see the install pytorch
section. Select your options and copy and paste the output inside your virtual environment. This will download the PyTorch package and all the required dependencies.
Training
To finally start training, go to the bigram.py
file and for our first run make sure to lower the max_iters
parameter from the given 3,000 down to 2,000 for faster results. As you can see from the results, after training for a bit, the model outputs its results. You can mess with the parameters to get better results and you can see the loss go down as the iteration number increases.
Saving Model
For bigger models you want to train multiple times or cannot train in a single session, it is important to save your work so that later on, you can load the model and start training again.
Pytorch streamlines the process of saving a model by using checkpoints.
import torch
checkpoint = {
'iter': iter,
'model_state_dict': m.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
#... other parameters you want to keep
}
checkpoint_path = #The path of the checkpoint
torch.save(checkpoint, checkpoint_path)
Loading the model is also straightforward
import torch
# Load the checkpoint and create a empty model
loaded_checkpoint = torch.load(checkpoint_path)
model = BigramLanguageModel()
# Basic setup
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = model.to(device)
# Load the model parameters from checkpoint
m.load_state_dict(loaded_checkpoint['model_state_dict'])
# Create basic optimizer and load from checkpoint
optimizer = torch.optim.AdamW(m.parameters())
optimizer.load_state_dict(loaded_checkpoint['optimizer_state_dict'])
# Get vales
iter = loaded_checkpoint['iter']
loss = loaded_checkpoint['loss']
# Testing to see if model was correctly loaded
print(iter)
print(loss)
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
As you can see saving and loading a model allows you to maintain progess and to train bigger models over time.
To incorporate this into the nanoGPT training model we are using would look something like this,
Before
...
for iter in range(max_iters):
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# sample a batch of data
xb, yb = get_batch('train')
# evaluate the loss
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
After
...
checkpoint_path = # The path of your checkpoint
for iter in range(max_iters):
# save a checkpoint every 100 iteration or if we are at the last iteration
if iter % 1000 == 0 or iter == max_iters - 1:
checkpoint = {
'iter': iter,
'model_state_dict': m.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}
torch.save(checkpoint, checkpoint_path)
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# sample a batch of data
xb, yb = get_batch('train')
# evaluate the loss
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))