As an individual with limited access and compute, I have been wondering if I could build a decent large-language model for a while. As the big mega corporations are focused on getting bigger and bigger models, I am going small!
Introducing MicroLlama300M, my LLM that beat BERT
I made MicroLlama300M with the following restrictions:
- My overall budget is $500.
- Must pretrain an LLM from scratch with a fully open-source dataset and model.
- Not allowed to finetune a model or use another LLM such as GPT-4 to generate any training data.
Github: https://github.com/keeeeenw/MicroLlama
This project is heavily based on TinyLlama, which is an awesome open-source project aimed to pretraining a 1.1.1B Llama model on 1T tokens.
This project is work in progress. Currently, I have spent $280 on compute using 4 x Nvidia 4090 on Vast.ai and $3 on AWS S3 storage after 4 days of training of the 300M Llama model with 50B tokens.
I modified TinyLlama to support the following features (I will release my forked version of the source code after some clean up):
- Pretrain a smaller size 300M model on Slimpajama
- Removed Starcoderdata so that my model can focus on Slimpajama. This also means my model probably cannot do coding without fine-tuning.
- Added the ability to process and tokenize Slimpajama while downloading the data. The original setup only works with pre-downloaded data. This turns out to be a good time-saver because downloading 800G+ of data on a non-commercial Internet is very slow, and processing all of Slimpajama data also takes time.
- Various helper scripts and Python code such as python code for uploading the pretrained checkpoint to the huggingface hub.
- Bug fixes.
Evaluation results
I performed the experiment using the standard lm-evaluation-harness setup. Following the same setup as TinyLlama, I used acc_norm for all datasets except for winogrande and boolq which used acc as the metrics.
- keeeeenw/MicroLlama is the evaluation results for my 300M Llama model on 50B tokens.
- google-best/bert-large-uncased is the baseline because it is one of the most popular small LLMs and it has a similar parameter count of 336M.
- PY007/TinyLlama-1.1B-Chat-v0.1 as a sanity check I perform evaluation against one of the TinyLlama models to validate my setup for lm-evaluation-harness. These numbers are exactly the same as the ones reported by TinyLlama.
- TinyLlama-1.1B-intermediate-step-1431k-3T is evaluation result for the best model created and reported by TinyLlama.
To reproduce my numbers, please install lm-evaluation-harness and run the following command:
lm_eval \
--model hf \
--model_args pretrained=keeeeenw/MicroLlama,dtype="float",tokenizer=TinyLlama/TinyLlama-1.1B-step-50K-105b \
--tasks hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa \
--device cuda:0 \
--batch_size 64
Observations
- Because keeeeenw/MicroLlama is much smaller than TinyLlama, our model does not achieve the same impressive results but the numbers are closer than I expected.
- Our model outperforms google-best/bert-large-uncased which is actually slightly larger. The only dataset that google-best/bert-large-uncased outperformed our model is ARC_c (arc_challenge). I will provide more analysis as future study.
Based on the evaluation above, our model should be a good starting point for fine-tunning tasks that are typically performed using the BERT family of models. Some of tasks may include
- sentence transformer: https://huggingface.co/sentence-transformers
- bertscore: https://huggingface.co/spaces/evaluate-metric/bertscore
- A light-weight chatbot after some finetuning.
Want to try it out?
- Install dependencies
pip install transformers
pip install torch
2. Run code!
import torch
import transformers
from transformers import AutoTokenizer, LlamaForCausalLM
def generate_text(prompt, model, tokenizer):
text_generator = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
tokenizer=tokenizer
)
formatted_prompt = f"Question: {prompt} Answer:"
sequences = text_generator(
formatted_prompt,
do_sample=True,
top_k=5,
top_p=0.9,
num_return_sequences=1,
repetition_penalty=1.5,
max_new_tokens=128,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
# use the same tokenizer as TinyLlama
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-step-50K-105b")
# load model from huggingface
# question from https://www.reddit.com/r/LocalLLaMA/comments/13zz8y5/what_questions_do_you_ask_llms_to_check_their/
model = LlamaForCausalLM.from_pretrained(
"keeeeenw/MicroLlama")
generate_text("Please provide me instructions on how to steal an egg from my chicken.", model, tokenizer)