Scaling up
Dataset: MATH500 (following Let’s Verify Step by Step)

Pass@1024 of DeepSeekMath-7B models from DART-Math

pass@k accuracy of different DeepSeekMath (DSMath) models and temperatures () on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.
pass@k accuracy of different DeepSeekMath (DSMath) models and temperatures () on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

Pass@128/256 of Mistral-7B & DeepSeekMath-7B

notion image
adapter explanation: “adapter” for base text completion model on MATH500 as a QA task
  • Instruct: SFT and test with 0-shot; here we use SFT models from HF Hub listed below.
  • RL: RL and test with 0-shot; here we use RL models from HF Hub listed below.
ensemble_type explanation:
  • avg: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in .
  • maj: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.
  • any: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.

Cost of large-scale sampling

🔑
Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)
Dataset: MATH500
Framework: vLLM
Device: A800-PCIe(80GB) * 1
#Shots: 1 for base models, 0 for instruction-tuned models
Parameters:
Cost of large-scale sampling
Model
Adapter
#Samples per prompt
Total time
Time per prompt (s)
Time per sample (ms)
ICL
64
1:10:15
8.43
SFT
64
1:10:24
8.45
SFT
128
2:25:23
17.45
RL
64
1:30:31
10.86
ICL
64
1:17:36
9.31
SFT
64
32:45
3.93

Pass@64 of DeepSeekMath-7B from DeepSeekMath

notion image
 

  • Utterance

文章数:
29
访客数:

公众号/知乎/雪球同名