A Week of Learning: Neural Nets, GenAI, and a Few Missteps
Over the last 7 days, I’ve learnt a lot about Generative AI - and Neural Networks in general - from Andrej Karpathy and Grant Sanderson’s videos.
Specifically, I studied
- Intro to large language models by Andrej
- Software is changing - again - Talk by Andrej at AI Startup School by Y Combinator
- Deep Dive into LLMs like ChatGPT by Andrej
- The entire Neural Network series by 3Blue1Brown
If you’re a product manager or a business person and can watch only one of these, watch the second one.
Most interesting learnings
- LLMs are not a new type of ML model called “transformers”. They’re still good old neural networks. Transformer is just a type of neural network architecture that has three things - attention block (from the famous 2017 Google paper, “Attention is all you need”), MLP or feedforward block (this is the classic neural network), and something called a layer normalisation.
- A huge part of success of the transformer architecture is because of how well attention block can be parallelised. Deep Learning was known to scale well and the attention blocks made insane scale possible - at least until the scaling laws held.
- System prompts and prompts in general are actually important. It’s not a good idea to dismiss apps and products as “just a wrapper around ChatGPT”. The way you ask questions impacts what answer you get (see: reversal curse). ChatGPT’s “study mode” came in just last week. That is essentially just a system prompt on ChatGPT but creating meaningful value.
- Evals or evaluations are important because of all the emergent behaviours present. We cannot confidently predict how a new training run or a new dataset will impact model performance. Hence it’s important to have a set of ‘automated QA steps’ to run the model through whenever making a major update.
- The biggest gap in current models is they don’t have an equivalent of ‘sleep’. Something that lets you store model learnings outside the context window. This would allow model’s in-context knowledge to be utilised over the longer term. Right now, if this knowledge is important, it’s added as a patch to the system prompt. A good example is counting of r in raspberry - Claude fixed this by hardcoding a line in their system prompt to handle this particular case.
I’m not getting into the technicalities of LLMs and how they actually work because that’s covered very well in the videos.
Challenges (+learnings) due to theory vs practical differences
Andrej described an LLM as a set of two files. One file has model parameters which are a list of numbers, and one with some ‘trivial’ code to interpret those params.
Since I had Llama and Gemma models offline via Ollama, I immediately decided to mess around with the param file to see how model performance would degrade. This is where the trouble began.
Expectation | Reality |
---|---|
I have Ollama and Llama/Gemma available offline. So I should have these two files which Andrej describes | Turns out Ollama builds its own abstractions of these models to run them very efficiently for inference. I needed access to raw model weights which would not be available in a model downloaded through Ollama |
I’ll just find the weights somewhere. There are so many open source models | Looks like it’s rather difficult to get to the ‘bare machine’ view which Andrej described. Even if I could get my hands on a model (which I did after a lot of back and forth on huggingface), running inference without Ollama becomes really expensive. |
I’ll work with the smallest famous model available: GPT-2. I’ll be able to perturb the weights and analyse model degradation closely. It might first lose its grasp of nouns, then of grammar… At any rate, it’d be interesting to observe. | Detailed as a long list below |
What happened when I tried to perturb the weights file.
- Model parameters are not a ‘list’ in that it’s a single array. These are a bunch of matrices/tensors with a set structure. I later learnt about the types of matrices present. For GPT3, there are 8 types.
- GPT2 is a base model (vs an instruct model). It’s designed simply for next token prediction without an assistant like behaviour. So my idea to run evaluation using ‘subjective and objective questions from various fields of human endeavour’ would not directly work. On the bright side, I learnt a technique to get assistant-like answers out of a base model.
- When I finally got everything working and set up my eval, I learnt that…GPT2 is terrible by today’s standards. The final model performance itself is so poor, it leaves hardly any room to study degradation.
All of this was when I’d only watched the first video I’ve mentioned above.
Since then, I’ve come to a place where I can laugh at some of my beliefs(/knowledge) from a week ago. They’re embarrassing enough that I’m hesitant admitting to them in public. I’ll still go ahead in case my foolishness inspires somebody to embrace and overcome theirs.
P.S. I followed up this learning with an attempt at the MNIST digit recognizer competition on Kaggle. It’s about practical learnings from training a Neural Net and also about using ChatGPT to get a block level understanding of CNNs insanely quickly.