2025.01 โ€“ 2025.02

TinyLLM

hoonably/TinyLLM
Loading repository details...

Models

Model NameAffiliationModel SizeRelease Date๐Ÿ”— Link
BloomBigScience560M2022.11Bloom
BloomzBigScience560M2022.11Bloomz
Cerebras-GPTCerebras590M2023.03Cerebras-GPT
Cerebras-GPTCerebras256M2023.03Cerebras-GPT
Cerebras-GPTCerebras111M2023.03Cerebras-GPT
Danube3H2O500M2024.07Danube3
Flan-T5GoogleBase2023.01Flan-T5
LaMini-GPTMBZUAI774M2023.04LaMini-GPT
LaMini-GPTMBZUAI124M2023.04LaMini-GPT
LiteLlamaahxt460MN/ALiteLlama
OPTMeta350M2022.05OPT
OPTMeta125M2022.05OPT
PythiaEleutherAI410M2023.03Pythia
PythiaEleutherAI160M2023.03Pythia
PhoneLMmllmTeam0.5B2024.11PhoneLM
Qwen1.5Alibaba0.5B2024.02Qwen1.5
Qwen2.5Alibaba0.5B2024.09Qwen2.5
SmolLMHugging Face360M2024.07SmolLM
SmolLMHugging Face135M2024.07SmolLM
TinyLlamaTinyLlama1.1B2023.12TinyLlama

Evaluation Datasets

Dataset NameExplanation๐Ÿ”— Link
ARCScience question dataset for QA.
- ARC-e : ARC-easy
ai2_arc
OBQAa QA dataset modeled after open-book exams, designed to test multi-step reasoning, commonsense knowledge, and deep text comprehension.openbookqa
BoolQQA dataset for yes/no questionsboolq
PIQAQA dataset for physical commonsense reasoning and a correspondingpiqa
SIQAquestion-answering, designed to evaluate social commonsense reasoning about people's actions and their social implications.social_i_qa
WinoGrandefill-in-the-blank problemswinogrande
HellaSwagCommon sense natural language reasoninghellaswag

Environment

Jetson Orin Nano 8GB RAM link python: 3.10.2

Evaluation Result

nan: Failed inference (memory issue)

1. Model Size (MB)

ModelParametersARC-eBoolQOBQAPIQASIQAWinoGrandeAvg.
Cerebras-GPT-111M111M423.624423.624423.624423.624423.624423.624423.624
Cerebras-GPT-256M256M976.475976.475976.475976.475976.475976.475976.475
Cerebras-GPT-590M590M2251.862251.862251.862251.862251.862251.862251.86
LaMini-GPT-124M124M474.703474.703474.703474.703474.703474.703474.703
LaMini-GPT-774M774M2952.7nan2952.7nannan2952.72952.7
LiteLlama-460M-1T460M1761.191761.191761.191761.191761.191761.191761.19
Qwen1.5-0.5B500M1769.971769.971769.97nan1769.971769.971769.97
Qwen2.5-0.5B500M1884.591884.591884.591884.591884.591884.591884.59
SmolLM-135M135M513.134513.134513.134513.134513.134513.134513.134
SmolLM-360M360M1380.241380.241380.241380.241380.241380.241380.24
bloom-560m560M2133.232133.232133.23nan2133.232133.232133.23
bloomz-560m560M2133.232133.232133.232133.232133.232133.232133.23
opt-125m125M477.75477.75477.75477.75477.75477.75477.75
opt-350m350M1263.411263.411263.411263.411263.411263.411263.41
pythia-160m160M619.213619.213619.213619.213619.213619.213619.213
pythia-410m410M1546.231546.231546.231546.231546.231546.231546.23

2. Accuracy (%)

ModelParametersARC-eBoolQOBQAPIQASIQAWinoGrandeAvg.
Cerebras-GPT-111M111M26.4912382549.233.749.565936.9929
Cerebras-GPT-256M256M26.491238.12549.233.749.565937.0095
Cerebras-GPT-590M590M26.491237.92549.233.749.565936.9762
LaMini-GPT-124M124M24.912362.42450.833.150.434140.9411
LaMini-GPT-774M774M32.6316nan31.2nannan51.065538.299
LiteLlama-460M-1T460M25.61438.125.249.33449.565936.9633
Qwen1.5-0.5B500M54.736859.742nan42.350.828749.9131
Qwen2.5-0.5B500M62.80764.644.859.552.350.907755.8191
SmolLM-135M135M24.210560.425.849.932.850.434140.5908
SmolLM-360M360M21.754439.723.652.534.349.565936.9034
bloom-560m560M26.315838.325.8nan33.749.565934.7363
bloomz-560m560M24.210562.521.850.632.950.355240.3943
opt-125m125M26.491243.425.449.433.749.48737.9797
opt-350m350M26.315838.424.849.932.649.565936.9303
pythia-160m160M26.31583825.248.63349.250236.7277
pythia-410m410M26.315837.82549.233.749.644836.9434

Model average accuracy

3. Inference Time (ms)

ModelParametersARC-eBoolQOBQAPIQASIQAWinoGrandeAvg.
Cerebras-GPT-111M111M47.848275.053442.651967.594446.560349.356654.8441
Cerebras-GPT-256M256M118.458197.908104.827118.09125.46176.8072123.592
Cerebras-GPT-590M590M251.496407.803227.772279.49252.317195.598269.079
LaMini-GPT-124M124M55.516791.16950.360770.016753.633151.4862.0294
LaMini-GPT-774M774M331.246nan288.771nannan241.842287.286
LiteLlama-460M-1T460M173.447278.426156.089181.297173.079124.438181.129
Qwen1.5-0.5B500M175.618305.815155.574nan174.686146.179191.574
Qwen2.5-0.5B500M197.737330.037173.806213.579197.794143.201209.359
SmolLM-135M135M125.591143.362124.124138.117125.266125.496130.326
SmolLM-360M360M161.715274.149151.589176.19158.66143.47177.629
bloom-560m560M206.418357.453178.741nan213.107149.083220.96
bloomz-560m560M206.628357.817178.633257.519213.568148.324227.081
opt-125m125M56.635286.719251.837663.603555.367746.662360.1376
opt-350m350M144.791231.819129.27148.27142.364100.038149.425
pythia-160m160M57.341189.245353.225263.168655.574750.2161.4608
pythia-410m410M153.6247.236135.242153.365150.89103.307157.273

Average inference time

4. Peak GPU Memory Usage (GB)

ModelParametersARC-eBoolQOBQAPIQASIQAWinoGrandeAvg.
Cerebras-GPT-111M111M0.5185920.6071120.5113040.6041090.5092120.486320.539441
Cerebras-GPT-256M256M1.092691.211781.075521.205221.073411.049911.11809
Cerebras-GPT-590M590M2.384332.551912.372.546452.366612.337642.42616
LaMini-GPT-124M124M0.5183010.5824090.5110140.5803490.5089220.4990370.533339
LaMini-GPT-774M774M3.03155nan3.02419nannan3.012093.02261
LiteLlama-460M-1T460M1.765781.837481.757611.835121.755261.744121.78256
Qwen1.5-0.5B500M1.937482.192731.90575nan1.899081.859931.959
Qwen2.5-0.5B500M1.951482.152251.926652.147511.921281.890481.99828
SmolLM-135M135M0.5558580.6334390.5462750.6316140.5465170.5341950.57465
SmolLM-360M360M1.40551.494921.394451.492811.394731.380531.42716
bloom-560m560M2.303972.712272.25459nan2.243322.178512.33853
bloomz-560m560M2.303972.712272.254592.712672.243322.178512.40089
opt-125m125M0.5217420.6140460.5119070.6118870.509060.4956020.544041
opt-350m350M1.308011.432361.293861.429811.289771.270411.33737
pythia-160m160M0.7278750.8387280.7190820.8348720.7170610.6895470.754528
pythia-410m410M1.71041.89711.689831.89461.68531.654061.75521

GPU memory usage

Thoughts

This was the very first project I worked on after joining a research lab as an intern โ€” and honestly, it felt like a proper first project. On my first day, I was handed a Jetson Nano and told to set it up. My first thought was: what is this? It's so slow and frustrating! That was basically my first real encounter with Ubuntu. It felt like dealing with an old computer, but surprisingly it supported CUDA (though not the latest versions).

At the time, I had no idea what I was even doing. Dataset? HuggingFace? What are those? I only started to get it after seeing others measuring accuracy โ€” oh, so this is on-device AI! It's running entirely on this GPU without any server. Thatโ€™s when it clicked. When I saw models getting only 30% accuracy, I was like, is this even working? Feels like random guessing. Turns out it was because no fine-tuning had been done yet.

It was also fascinating to learn that there are so many tiny LLMs out there, and the goal was to compare them in terms of latency and accuracy to see which ones perform best.

I also got introduced to WandB and started learning how to write automation scripts. It was actually my first time automating anything. Since the measurements were slow and involved lots of repetition, I finally understood why automation matters. Ever since then, Iโ€™ve preferred automating workflows in all my projects.

Looking back, I had no idea what I was doing and just followed along at first โ€” but I think my professor intentionally gave me this as a starting point: explore whether LLMs could run on small edge devices and what kind of efficiency they could reach. But along the way, I ended up learning so many valuable things โ€” Docker, WandB, automation, conda environments, data visualization, and more.

It might not seem like a big deal later, but for someone who started out knowing nothing, this was a really important project. It gave me a solid foundation and made me feel like I was finally part of a research group.