TL;DR — Shows that 175B-parameter language models can perform few-shot learning across a wide range of NLP tasks without gradient updates.
Abstract
GPT-3 demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
Cite this paper
BibTeX
APA
MLA
Chicago
@inproceedings{brown2020language, title = {Language Models are Few-Shot Learners}, author = {Tom Brown and Benjamin Mann and Nick Ryder and et al.}, year = {2020}, booktitle = {NeurIPS 2020}, doi = {10.48550/arxiv.2005.14165}
}
Related papers
1
Attention Is All You Need
Vaswani et al.201747,291 citations
2
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al.201931,847 citations
3
Scaling Laws for Neural Language Models
Kaplan et al.20204,203 citations
From Bluesky
GL
Grace Lindqvist
@grace.ai.bsky.social
The GPT-3 paper is now 5 years old and the field has moved so fast that it reads almost like history. Wild how much ground has been covered.