Scaling up
Dataset: MATH500 (following Let’s Verify Step by Step)
Pass@1024 of DeepSeekMath-7B models from DART-MathPass@128/256 of Mistral-7B & DeepSeekMath-7BCost of large-scale samplingPass@64 of DeepSeekMath-7B from DeepSeekMath
Pass@1024 of DeepSeekMath-7B models from DART-Math

pass@k
accuracy of different DeepSeekMath (DSMath) models and temperatures () on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.Pass@128/256 of Mistral-7B & DeepSeekMath-7B

adapter
explanation: “adapter” for base text completion model on MATH500 as a QA task- Few-Shot: few-shot ICL; here we use 8 shots following MAmmoTH.
- Instruct: SFT and test with 0-shot; here we use SFT models from HF Hub listed below.
- RL: RL and test with 0-shot; here we use RL models from HF Hub listed below.
ensemble_type
explanation:avg
: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in .
maj
: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.
any
: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.
Cost of large-scale sampling
Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)
Dataset: MATH500
Framework: vLLM
Device: A800-PCIe(80GB) * 1
#Shots: 1 for base models, 0 for instruction-tuned models
Parameters:
Cost of large-scale sampling
Model
Adapter
#Samples per prompt
Total time
Time per prompt (s)
Time per sample (ms)
Pass@64 of DeepSeekMath-7B from DeepSeekMath

- Utterance