Parameter counts and compute (forward pass) for one layer of the network, per token. The first row indicates the number of FLOPs when following the implementation in Section A.1.

Parameter counts and compute (forward pass) for one layer of the network, per token. The first row indicates the number of FLOPs when following the implementation in Section A.1.

Source publication
Preprint
Full-text available
Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadrati...

Context in source publication

Context 1
... also list the number of floating point operations per-token for the (parallel) LMU model in Table 2. ...

Similar publications

Preprint
Full-text available
After transformer is proposed, lots of pre-trained language models have been come up with and sentiment analysis (SA) task has been improved. In this paper, we proposed a method that uses an auxiliary sentence about aspects that the sentence contains to help sentiment prediction. The first is aspect detection, which uses a multi-aspects detection m...
Preprint
Full-text available
Dataflow/mapping decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. W...
Preprint
Full-text available
We study the performance of monolingual and multilingual language models on the task of question-answering (QA) on three diverse languages: English, Finnish and Japanese. We develop models for the tasks of (1) determining if a question is answerable given the context and (2) identifying the answer texts within the context using IOB tagging. Further...
Preprint
Full-text available
Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classificat...
Preprint
Full-text available
With increasing scale, large language models demonstrate both quantitative improvement and new qualitative capabilities, especially as zero-shot learners, like GPT-3. However, these results rely heavily on delicate prompt design and large computation. In this work, we explore whether the strong zero-shot ability could be achieved at a smaller model...