Quick index of official model cards / system cards for major frontier (closed) and open(-weights) model families. Each entry includes the standard BibTeX citation following our BibTeX guide. Models are grouped by provider, and different model generations have separate entries. When citing a foundation model in a paper (e.g., as a baseline or backbone), use the canonical citation from this page — see the paper writing guide for more details.
Foundation Models
OpenAI
GPT-3 — Language Models are Few-Shot Learners
Scaled autoregressive LM to 175B parameters, demonstrating strong few-shot performance across NLP tasks. Seminal because it revealed emergent in-context learning from scale, shifting the NLP paradigm from fine-tuning to prompting
@inproceedings{brown2020language,
title = {{Language Models are Few-Shot Learners}},
author = {Brown, T. and Mann, B. and Ryder, N. and Subbiah, M. and Kaplan, J. and Dhariwal, P. and others},
booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
year = {2020}
}
InstructGPT — Training Language Models to Follow Instructions with Human Feedback
Introduced RLHF to align GPT-3 with human intent, forming the basis for ChatGPT. Seminal because it established the instruction-following alignment paradigm adopted by virtually every subsequent chat model
@inproceedings{ouyang2022training,
title = {{Training Language Models to Follow Instructions with Human Feedback}},
author = {Ouyang, L. and Wu, J. and Jiang, X. and Almeida, D. and Wainwright, C. and Mishkin, P. and others},
booktitle = {{Advances in Neural Information Processing Systems (NeurIPS)}},
year = {2022}
}
GPT-4 — GPT-4 Technical Report
Multimodal large language model achieving human-level performance on professional and academic benchmarks. Seminal because it defined the frontier multimodal LLM benchmark and catalyzed mainstream enterprise and developer adoption of LLMs
@misc{openai2023gpt4,
title = {{GPT-4 Technical Report}},
author = {OpenAI},
year = {2023},
eprint = {2303.08774},
archivePrefix = {arXiv}
}
GPT-4o — GPT-4o System Card
Omnimodal model natively processing text, audio, image, and video; also covers GPT-4o mini
@misc{openai2024gpt4o,
title = {{GPT-4o System Card}},
author = {OpenAI},
year = {2024},
eprint = {2410.21276},
archivePrefix = {arXiv}
}
o1 — OpenAI o1 System Card
Reasoning-focused LRM trained with large-scale reinforcement learning to perform extended chain-of-thought before responding
@misc{openai2024o1,
title = {{OpenAI o1 System Card}},
author = {OpenAI},
year = {2024},
eprint = {2412.16720},
archivePrefix = {arXiv}
}
o3-mini — OpenAI o3-mini System Card
Smaller reasoning model with adjustable reasoning effort; no arXiv paper
@misc{openai2025o3mini,
title = {{OpenAI o3-mini System Card}},
author = {OpenAI},
year = {2025},
url = {https://openai.com/index/o3-mini-system-card/},
note = {Accessed February 9, 2026}
}
GPT-4.5 — GPT-4.5 System Card
Largest pre-trained model focused on broad knowledge and reduced hallucinations; research preview emphasizing unsupervised learning scale
@misc{openai2025gpt45,
title = {{GPT-4.5 System Card}},
author = {OpenAI},
year = {2025},
url = {https://openai.com/index/gpt-4-5-system-card/},
note = {Accessed February 9, 2026}
}
o3 — OpenAI o3 and o4-mini System Card
Most powerful reasoning model with full tool use (browsing, code, images); first system card under Preparedness Framework v2
@misc{openai2025o3,
title = {{OpenAI o3 and o4-mini System Card}},
author = {OpenAI},
year = {2025},
url = {https://openai.com/index/o3-o4-mini-system-card/},
note = {Accessed February 9, 2026}
}
o4-mini — OpenAI o3 and o4-mini System Card
Cost-efficient reasoning model excelling at math and coding; achieves 99.5% on AIME 2025 with tool use
@misc{openai2025o4mini,
title = {{OpenAI o3 and o4-mini System Card}},
author = {OpenAI},
year = {2025},
url = {https://openai.com/index/o3-o4-mini-system-card/},
note = {Accessed February 9, 2026}
}
GPT-5 — GPT-5 System Card
Unified system with a real-time router dispatching across fast (main) and deep-reasoning (thinking) sub-models, replacing GPT-4o and o3; significant reduction in hallucinations and sycophancy
@misc{openai2025gpt5,
title = {{GPT-5 System Card}},
author = {OpenAI},
year = {2025},
eprint = {2601.03267},
archivePrefix = {arXiv}
}
GPT-5.2 — GPT-5.2 System Card
Most capable model for professional knowledge work; achieves 100% on AIME 2025 competition math
@misc{openai2025gpt52,
title = {{GPT-5.2 System Card}},
author = {OpenAI},
year = {2025},
url = {https://openai.com/index/introducing-gpt-5-2/},
note = {Accessed February 9, 2026}
}
Meta
LLaMA — LLaMA: Open and Efficient Foundation Language Models
First open-weights LLM family (7B--65B) competitive with much larger proprietary models. Seminal because it kicked off the open-weights movement, enabling the entire ecosystem of community fine-tuning and open LLM research
@misc{touvron2023llama,
title = {{LLaMA: Open and Efficient Foundation Language Models}},
author = {Touvron, H. and Lavril, T. and Izacard, G. and Martinet, X. and Lachaux, M. and Lacroix, T. and others},
year = {2023},
eprint = {2302.13971},
archivePrefix = {arXiv}
}
Llama 2 — Llama 2: Open Foundation and Fine-Tuned Chat Models
Open-weights 7B--70B models with RLHF chat variants, widely adopted for fine-tuning
@misc{touvron2023llama2,
title = {{Llama 2: Open Foundation and Fine-Tuned Chat Models}},
author = {Touvron, H. and Martin, L. and Stone, K. and Albert, P. and Almahairi, A. and Babaei, Y. and others},
year = {2023},
eprint = {2307.09288},
archivePrefix = {arXiv}
}
Llama 3.1 — The Llama 3 Herd of Models
Canonical paper for the Llama 3 family (8B, 70B, 405B); covers pre-training, post-training, multimodal, and safety
@misc{grattafiori2024llama,
title = {{The Llama 3 Herd of Models}},
author = {Grattafiori, A. and Dubey, A. and Jauhri, A. and Pandey, A. and Kadian, A. and Al-Dahle, A. and others},
year = {2024},
eprint = {2407.21783},
archivePrefix = {arXiv}
}
Llama 3.2 — Llama 3.2 Model Card
Lightweight text models (1B, 3B) and multimodal vision-language models (11B, 90B); no standalone paper, cite the Herd paper or model card
@misc{meta2024llama32,
title = {{Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models}},
author = {{AI@Meta}},
year = {2024},
url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/},
note = {Accessed February 9, 2026}
}
Llama 3.3 — Llama 3.3 Model Card
70B instruction-tuned model matching Llama 3.1 405B quality at lower cost; no standalone paper
@misc{meta2024llama33,
title = {{Llama 3.3 Model Card}},
author = {{AI@Meta}},
year = {2024},
url = {https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md},
note = {Accessed February 9, 2026}
}
Llama 4 — Llama 4 Model Card
Natively multimodal MoE models (Scout 17Bx16E, Maverick 17Bx128E); no technical paper published
@misc{meta2025llama4,
title = {{Llama 4 Model Card}},
author = {{AI@Meta}},
year = {2025},
url = {https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md},
note = {Accessed February 9, 2026}
}
Anthropic
Claude 3 — The Claude 3 Model Family: Opus, Sonnet, Haiku
Three-tier multimodal model family achieving state-of-the-art on GPQA, MMLU, and MMMU
@misc{anthropic2024claude3,
title = {{The Claude 3 Model Family: Opus, Sonnet, Haiku}},
author = {Anthropic},
year = {2024},
url = {https://www.anthropic.com/news/claude-3-family},
note = {Accessed February 9, 2026}
}
Claude 3.5 — The Claude 3.5 Model Family Addendum
Updated Claude 3 model card with Claude 3.5 Sonnet and Haiku evaluations
@misc{anthropic2024claude35,
title = {{The Claude Model Spec and Evaluations Addendum}},
author = {Anthropic},
year = {2024},
url = {https://www.anthropic.com/news/claude-3-5-sonnet},
note = {Accessed February 9, 2026}
}
Claude 3.7 Sonnet — Claude 3.7 Sonnet System Card
First hybrid reasoning model from Anthropic with configurable extended thinking (up to 128K tokens); visible chain-of-thought and dual-mode operation
@misc{anthropic2025claude37,
title = {{Claude 3.7 Sonnet System Card}},
author = {Anthropic},
year = {2025},
url = {https://www.anthropic.com/claude-3-7-sonnet-system-card},
note = {Accessed February 9, 2026}
}
Claude Opus 4 — Claude 4 System Card
Most powerful Anthropic model capable of autonomous multi-hour workflows; deployed under AI Safety Level 3 Standard
@misc{anthropic2025opus4,
title = {{Claude 4 System Card}},
author = {Anthropic},
year = {2025},
url = {https://www.anthropic.com/claude-4-system-card},
note = {Accessed February 9, 2026}
}
Claude Sonnet 4 — Claude 4 System Card
General-purpose successor to Sonnet 3.7 with improved coding and hybrid thinking; deployed under AI Safety Level 2 Standard
@misc{anthropic2025sonnet4,
title = {{Claude 4 System Card}},
author = {Anthropic},
year = {2025},
url = {https://www.anthropic.com/claude-4-system-card},
note = {Accessed February 9, 2026}
}
Claude Opus 4.5 — Claude Opus 4.5 System Card
State-of-the-art for coding, agents, and computer use; strong at real-world software engineering, deep research, and agentic workflows
@misc{anthropic2025opus45,
title = {{Claude Opus 4.5 System Card}},
author = {Anthropic},
year = {2025},
url = {https://www.anthropic.com/claude-opus-4-5-system-card},
note = {Accessed February 9, 2026}
}
Google / DeepMind
PaLM 2 — PaLM 2 Technical Report
Compute-optimal multilingual model powering Bard/Gemini with improved reasoning and coding
@misc{anil2023palm,
title = {{PaLM 2 Technical Report}},
author = {Anil, R. and Dai, A. and Firat, O. and Johnson, M. and Lepikhin, D. and Passos, A. and others},
year = {2023},
eprint = {2305.10403},
archivePrefix = {arXiv}
}
Gemini 1.0 — Gemini: A Family of Highly Capable Multimodal Models
First natively multimodal frontier model family (Ultra, Pro, Nano). Seminal because it pioneered training multimodality from scratch rather than bolting vision onto a text model, setting the direction for the field
@misc{geminiteam2023gemini,
title = {{Gemini: A Family of Highly Capable Multimodal Models}},
author = {{Gemini Team} and Anil, R. and Borgeaud, S. and Alayrac, J. and Yu, J. and Soricut, R. and others},
year = {2023},
eprint = {2312.11805},
archivePrefix = {arXiv}
}
Gemini 1.5 — Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context
Long-context MoE model supporting up to 10M tokens with near-perfect recall
@misc{geminiteam2024gemini15,
title = {{Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context}},
author = {{Gemini Team} and Reid, M. and Savinov, N. and Teber, D. and Bapna, A. and Bowman, R. and others},
year = {2024},
eprint = {2403.05530},
archivePrefix = {arXiv}
}
Gemini 2.0 — Gemini 2.0 Blog
Agentic multimodal model with native tool use and multimodal output; no standalone technical report
@misc{google2024gemini2,
title = {{Gemini 2.0: Our New AI Model for the Agentic Era}},
author = {{Google DeepMind}},
year = {2024},
url = {https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/},
note = {Accessed February 9, 2026}
}
Gemma — Gemma: Open Models Based on Gemini Research and Technology
Open-weights 2B/7B models derived from Gemini research
@misc{gemmateam2024gemma,
title = {{Gemma: Open Models Based on Gemini Research and Technology}},
author = {{Gemma Team} and Mesnard, T. and Hardin, C. and Dadashi, R. and Bhupatiraju, S. and Pathak, S. and others},
year = {2024},
eprint = {2403.08295},
archivePrefix = {arXiv}
}
Gemma 2 — Gemma 2: Improving Open Language Models at a Practical Size
Knowledge-distilled 2B/9B/27B models with improved efficiency
@misc{gemmateam2024gemma2,
title = {{Gemma 2: Improving Open Language Models at a Practical Size}},
author = {{Gemma Team} and Riviere, M. and Pathak, S. and Sessa, P. and Hardin, C. and Bhupatiraju, S. and others},
year = {2024},
eprint = {2408.00118},
archivePrefix = {arXiv}
}
Gemma 3 — Gemma 3 Technical Report
Multimodal 1B--27B models with 128K context, hybrid attention, and vision understanding
@misc{gemmateam2025gemma3,
title = {{Gemma 3 Technical Report}},
author = {{Gemma Team} and Kamath, A. and Ferret, J. and Pathak, S. and Vieillard, N. and Ramé, A. and others},
year = {2025},
eprint = {2503.19786},
archivePrefix = {arXiv}
}
Gemini 2.5 Pro — Gemini 2.5 Technical Report
State-of-the-art thinking model with sparse MoE architecture, 1M token context, and native multimodal support; excels at coding, reasoning, and complex multi-source problems
@misc{geminiteam2025gemini25,
title = {{Gemini 2.5 Technical Report}},
author = {{Gemini Team} and others},
year = {2025},
eprint = {2507.06261},
archivePrefix = {arXiv}
}
Gemini 2.5 Flash — Gemini 2.5 Flash Model Card
Hybrid reasoning model with controllable thinking budget for cost-efficient deployment; part of the 2.5 family
@misc{geminiteam2025gemini25flash,
title = {{Gemini 2.5 Flash Model Card}},
author = {{Google DeepMind}},
year = {2025},
url = {https://blog.google/products/gemini/gemini-2-5-model-family-expands/},
note = {Accessed February 9, 2026}
}
DeepSeek
DeepSeek-V2 — DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
236B MoE model (21B active) with Multi-head Latent Attention for efficient inference
@misc{deepseek2024v2,
title = {{DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}},
author = {{DeepSeek-AI} and Liu, A. and Feng, B. and Wang, B. and Wang, B. and Liu, B. and others},
year = {2024},
eprint = {2405.04434},
archivePrefix = {arXiv}
}
DeepSeek-Coder-V2 — DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Code-specialized MoE model competitive with GPT-4 Turbo on coding benchmarks
@misc{zhu2024deepseek,
title = {{DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence}},
author = {Zhu, Q. and Guo, D. and Shao, Z. and Yang, D. and Wang, P. and Xu, R. and others},
year = {2024},
eprint = {2406.11931},
archivePrefix = {arXiv}
}
DeepSeek-V3 — DeepSeek-V3 Technical Report
671B MoE model (37B active) trained on 14.8T tokens for only 2.8M H800 GPU hours, rivaling frontier closed models
@misc{deepseek2024v3,
title = {{DeepSeek-V3 Technical Report}},
author = {{DeepSeek-AI} and Liu, A. and Feng, B. and Xue, B. and Wang, B. and Wu, B. and others},
year = {2024},
eprint = {2412.19437},
archivePrefix = {arXiv}
}
DeepSeek-R1 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Reasoning LRM trained with pure RL, matching OpenAI o1 on math and code benchmarks. Seminal because it showed reinforcement learning alone can elicit chain-of-thought reasoning, opening the open-source reasoning-model paradigm
@misc{deepseek2025r1,
title = {{DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}},
author = {{DeepSeek-AI} and Guo, D. and Yang, D. and Zhang, H. and Song, J. and Zhang, R. and others},
year = {2025},
eprint = {2501.12948},
archivePrefix = {arXiv}
}
Mistral AI
Mistral 7B — Mistral 7B
Compact 7B model outperforming Llama 2 13B on all benchmarks using sliding window attention and GQA. Seminal because it proved small, well-engineered models could rival much larger ones and launched the European open-weights ecosystem
@misc{jiang2023mistral,
title = {{Mistral 7B}},
author = {Jiang, A. and Sablayrolles, A. and Mensch, A. and Bamford, C. and Chaplot, D. and de las Casas, D. and others},
year = {2023},
eprint = {2310.06825},
archivePrefix = {arXiv}
}
Mixtral 8x7B — Mixtral of Experts
Sparse MoE with 8 experts (12.9B active of 46.7B total), matching or beating GPT-3.5 and Llama 2 70B. Seminal because it brought sparse Mixture-of-Experts into the open-weights mainstream, making MoE the go-to efficiency architecture
@misc{jiang2024mixtral,
title = {{Mixtral of Experts}},
author = {Jiang, A. and Sablayrolles, A. and Roux, A. and Mensch, A. and Savary, B. and Bamford, C. and others},
year = {2024},
eprint = {2401.04088},
archivePrefix = {arXiv}
}
Mixtral 8x22B — Mixtral 8x22B Model Card
Scaled MoE model (39B active of 141B total); no standalone paper, cite Mixtral paper or model card
@misc{mistral2024mixtral8x22b,
title = {{Cheaper, Better, Faster, Stronger -- Mixtral 8x22B}},
author = {{Mistral AI}},
year = {2024},
url = {https://mistral.ai/news/mixtral-8x22b/},
note = {Accessed February 9, 2026}
}
Mistral Large 2 — Mistral Large 2 Blog
123B dense model with 128K context, strong on code and multilingual tasks; no technical paper
@misc{mistral2024large2,
title = {{Mistral Large 2}},
author = {{Mistral AI}},
year = {2024},
url = {https://mistral.ai/news/mistral-large-2407/},
note = {Accessed February 9, 2026}
}
Mistral Small 3 — Mistral Small 3 Blog
Efficient 24B model balancing latency and quality for edge/on-device use; no technical paper
@misc{mistral2025small3,
title = {{Mistral Small 3}},
author = {{Mistral AI}},
year = {2025},
url = {https://mistral.ai/news/mistral-small-3/},
note = {Accessed February 9, 2026}
}
Alibaba / Qwen
Qwen — Qwen Technical Report
First generation of Qwen models (1.8B--72B) with strong multilingual and tool-use capabilities
@misc{bai2023qwen,
title = {{Qwen Technical Report}},
author = {Bai, J. and Bai, S. and Chu, Y. and Cui, Z. and Dang, K. and Deng, X. and others},
year = {2023},
eprint = {2309.16609},
archivePrefix = {arXiv}
}
Qwen2 — Qwen2 Technical Report
Second generation (0.5B--72B) with GQA and expanded multilingual support
@misc{yang2024qwen2,
title = {{Qwen2 Technical Report}},
author = {Yang, A. and Yang, B. and Hui, B. and Zheng, B. and Yu, B. and Zhou, C. and others},
year = {2024},
eprint = {2407.10671},
archivePrefix = {arXiv}
}
Qwen2.5 — Qwen2.5 Technical Report
Flagship open-weights family (0.5B--72B + MoE), trained on 18T tokens, outperforming Llama 3 405B at 72B scale
@misc{qwen2024qwen25,
title = {{Qwen2.5 Technical Report}},
author = {{Qwen Team} and Yang, A. and Yang, B. and Hui, B. and Zheng, B. and Yu, B. and others},
year = {2024},
eprint = {2412.15115},
archivePrefix = {arXiv}
}
Qwen2.5-Coder — Qwen2.5-Coder Technical Report
Code-specialized series trained on 5.5T code tokens, matching GPT-4o on coding tasks
@misc{hui2024qwen25coder,
title = {{Qwen2.5-Coder Technical Report}},
author = {Hui, B. and Yang, J. and Cui, Z. and Yang, J. and Liu, D. and Zhang, L. and others},
year = {2024},
eprint = {2409.12186},
archivePrefix = {arXiv}
}
QwQ — QwQ: Reflect Deeply on the Boundaries of the Unknown
32B reasoning model derived from Qwen2.5 with extended chain-of-thought; no standalone paper
@misc{qwen2024qwq,
title = {{QwQ: Reflect Deeply on the Boundaries of the Unknown}},
author = {{Qwen Team}},
year = {2024},
url = {https://qwenlm.github.io/blog/qwq-32b-preview/},
note = {Accessed February 9, 2026}
}
Microsoft
Phi-1 — Textbooks Are All You Need
1.3B code model trained on synthetic "textbook-quality" data, achieving strong coding performance. Seminal because it proved data quality can substitute for scale, pioneering the synthetic-data paradigm that influenced nearly every small-model effort since
@misc{gunasekar2023textbooks,
title = {{Textbooks Are All You Need}},
author = {Gunasekar, S. and Zhang, Y. and Anber, J. and Bhaskar, R. and Celikyilmaz, A. and others},
year = {2023},
eprint = {2306.11644},
archivePrefix = {arXiv}
}
Phi-1.5 — Textbooks Are All You Need II: phi-1.5 Technical Report
1.3B model extending synthetic data approach to commonsense reasoning
@misc{li2023textbooks,
title = {{Textbooks Are All You Need II: phi-1.5 Technical Report}},
author = {Li, Y. and Bubeck, S. and Eldan, R. and Del Giorno, A. and Gunasekar, S. and Lee, Y. and others},
year = {2023},
eprint = {2309.05463},
archivePrefix = {arXiv}
}
Phi-3 — Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
3.8B model rivaling Mixtral 8x7B using curated data and innovative training recipes
@misc{abdin2024phi3,
title = {{Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone}},
author = {Abdin, M. and Jacobs, S. and Awan, A. and Aneja, J. and Awadallah, A. and Awadalla, H. and others},
year = {2024},
eprint = {2404.14219},
archivePrefix = {arXiv}
}
Phi-4 — Phi-4 Technical Report
14B model surpassing its GPT-4 teacher on STEM via strategic synthetic data throughout training
@misc{abdin2024phi4,
title = {{Phi-4 Technical Report}},
author = {Abdin, M. and Jacobs, S. and Awan, A. and Aneja, J. and Awadallah, A. and Awadalla, H. and others},
year = {2024},
eprint = {2412.08905},
archivePrefix = {arXiv}
}
Cohere
Command R — Command R Model Card
RAG-optimized 35B model with 128K context and strong tool-use; no technical paper
@misc{cohere2024commandr,
title = {{Command R: Retrieval-Augmented Generation at Scale}},
author = {Cohere},
year = {2024},
url = {https://cohere.com/blog/command-r},
note = {Accessed February 9, 2026}
}
Aya 23 — Aya 23: Open Weight Releases to Further Multilingual Progress
Open-weights 8B/35B multilingual models covering 23 languages
@misc{aryabumi2024aya,
title = {{Aya 23: Open Weight Releases to Further Multilingual Progress}},
author = {Aryabumi, V. and Dang, J. and Talupuru, D. and Dash, S. and Cairuz, D. and Lin, H. and others},
year = {2024},
eprint = {2405.15032},
archivePrefix = {arXiv}
}
Aya Expanse — Aya Expanse: Connecting Our World
Expanded multilingual model covering 23 languages with improved performance
@misc{dang2024aya,
title = {{Aya Expanse: Connecting Our World}},
author = {Dang, J. and Aryabumi, V. and Talupuru, D. and Dash, S. and Cairuz, D. and Lin, H. and others},
year = {2024},
eprint = {2412.04261},
archivePrefix = {arXiv}
}
Command A — Command A Model Card
111B parameter model with 256K context for complex agentic tasks; supports 23 languages and replaces Command R+
@misc{cohere2025commanda,
title = {{Command A}},
author = {Cohere},
year = {2025},
url = {https://cohere.com/blog/command-a},
note = {Accessed February 9, 2026}
}
AI21 Labs
Jamba — Jamba: A Hybrid Transformer-Mamba Language Model
Novel hybrid architecture combining Transformer and Mamba (SSM) layers with MoE. Seminal because it was the first production hybrid Transformer-SSM model, opening a new architectural design axis beyond pure Transformers
@misc{lieber2024jamba,
title = {{Jamba: A Hybrid Transformer-Mamba Language Model}},
author = {Lieber, O. and Lenz, B. and Bata, H. and Cohen, G. and Osin, J. and Dalmedigos, I. and others},
year = {2024},
eprint = {2403.19887},
archivePrefix = {arXiv}
}
Jamba 1.5 — Jamba 1.5: Hybrid Transformer-Mamba Models at Scale
Scaled hybrid SSM-Transformer models (Mini 12B active, Large 94B active) with 256K context
@misc{team2024jamba,
title = {{Jamba 1.5: Hybrid Transformer-Mamba Models at Scale}},
author = {{Jamba Team} and Bata, H. and Cohen, G. and Daoulas, I. and Dalmedigos, I. and Gera, A. and others},
year = {2024},
eprint = {2408.12570},
archivePrefix = {arXiv}
}
xAI
Grok-1 — Grok-1 Model Card
314B MoE model open-sourced under Apache 2.0; no technical paper published
@misc{xai2024grok1,
title = {{Grok-1}},
author = {{xAI}},
year = {2024},
url = {https://x.ai/blog/grok/model-card},
note = {Accessed February 9, 2026}
}
Grok-2 — Grok-2 Blog
Frontier-class model with strong reasoning and vision capabilities; no technical paper
@misc{xai2024grok2,
title = {{Grok-2}},
author = {{xAI}},
year = {2024},
url = {https://x.ai/blog/grok-2},
note = {Accessed February 9, 2026}
}
Grok-3 — Grok-3 Blog
Trained on 200K H100 GPUs (10x Grok-2 compute) with reasoning modes (Think, Big Brain, DeepSearch); outperforms GPT-4o and Gemini 2 Pro on AIME and GPQA
@misc{xai2025grok3,
title = {{Grok-3}},
author = {{xAI}},
year = {2025},
url = {https://x.ai/news/grok-3},
note = {Accessed February 9, 2026}
}
Grok-4 — Grok-4 Model Card
Advanced reasoning model with native tool use and real-time search; 128K context with deep domain knowledge across finance, healthcare, law, and science
@misc{xai2025grok4,
title = {{Grok-4 Model Card}},
author = {{xAI}},
year = {2025},
url = {https://x.ai/news/grok-4},
note = {Accessed February 9, 2026}
}
01.AI
Yi — Yi: Open Foundation Models by 01.AI
Bilingual (English/Chinese) 6B/34B models trained on 3T tokens with strong reasoning
@misc{young2024yi,
title = {{Yi: Open Foundation Models by 01.AI}},
author = {Young, A. and Chen, B. and Li, C. and Huang, C. and Zhang, G. and Zhang, G. and others},
year = {2024},
eprint = {2403.04652},
archivePrefix = {arXiv}
}
Technology Innovation Institute (TII)
Falcon — The Falcon Series of Open Language Models
Open-weights 7B/40B/180B models trained on curated web data (RefinedWeb)
@misc{almazrouei2023falcon,
title = {{The Falcon Series of Open Language Models}},
author = {Almazrouei, E. and Alobeidli, H. and Alshamsi, A. and Cappelli, A. and Cojocaru, R. and others},
year = {2023},
eprint = {2311.16867},
archivePrefix = {arXiv}
}
Falcon 2 — Falcon 2: An 11 Billion Parameter Large Language Model
11B model with vision variant, competitive with much larger open models
@misc{malartic2024falcon2,
title = {{Falcon 2: An 11 Billion Parameter Large Language Model}},
author = {Malartic, G. and Musik, C. and Music, L. and Music, L. and Musik, C. and others},
year = {2024},
eprint = {2407.14885},
archivePrefix = {arXiv}
}
Stability AI
Stable LM 2 — Stable LM 2 1.6B Technical Report
Efficient 1.6B model competitive with larger models on downstream tasks
@misc{bellagente2024stable,
title = {{Stable LM 2 1.6B Technical Report}},
author = {Bellagente, M. and Tow, J. and Mahan, D. and Phang, J. and others},
year = {2024},
eprint = {2402.17834},
archivePrefix = {arXiv}
}
NVIDIA
Nemotron-4 — Nemotron-4 340B Technical Report
340B model with synthetic data generation pipeline for alignment; used to create training data
@misc{adler2024nemotron,
title = {{Nemotron-4 340B Technical Report}},
author = {Adler, B. and Agarwal, N. and Aithal, A. and Anh, D. and Bhatt, P. and Choi, J. and others},
year = {2024},
eprint = {2406.11704},
archivePrefix = {arXiv}
}
Databricks
DBRX — DBRX Blog
Fine-grained 132B MoE model (36B active) outperforming Llama 2 70B and Mixtral; no arXiv paper
@misc{databricks2024dbrx,
title = {{Introducing DBRX: A New State-of-the-Art Open LLM}},
author = {Databricks},
year = {2024},
url = {https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm},
note = {Accessed February 9, 2026}
}
Amazon
Nova Pro — Amazon Nova Pro Model Card
Highly capable multimodal model for text, image, and video with strong accuracy-speed-cost balance; available through Amazon Bedrock
@misc{amazon2024novapro,
title = {{Amazon Nova: A New Generation of Foundation Models}},
author = {Amazon},
year = {2024},
url = {https://aws.amazon.com/ai/generative-ai/nova/},
note = {Accessed February 9, 2026}
}
Nova Lite — Amazon Nova Lite Model Card
Low-cost multimodal model optimized for fast processing of image, video, and text inputs
@misc{amazon2024novalite,
title = {{Amazon Nova: A New Generation of Foundation Models}},
author = {Amazon},
year = {2024},
url = {https://aws.amazon.com/ai/generative-ai/nova/},
note = {Accessed February 9, 2026}
}