Scaling up
Dataset: MATH500 (following Let’s Verify Step by Step)
Pass@1024 of DeepSeekMath-7B models from DART-MathPass@128/256 of Mistral-7B & DeepSeekMath-7BCost of large-scale samplingPass@64 of DeepSeekMath-7B from DeepSeekMath
Pass@1024 of DeepSeekMath-7B models from DART-Math
Pass@128/256 of Mistral-7B & DeepSeekMath-7B
adapter
explanation: “adapter” for base text completion model on MATH500 as a QA task- Few-Shot: few-shot ICL; here we use 8 shots following MAmmoTH.
- Instruct: SFT and test with 0-shot; here we use SFT models from HF Hub listed below.
- RL: RL and test with 0-shot; here we use RL models from HF Hub listed below.
ensemble_type
explanation:avg
: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in .
maj
: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.
any
: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.
Cost of large-scale sampling
Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)
Dataset: MATH500
Framework: vLLM
Device: A800-PCIe(80GB) * 1
#Shots: 1 for base models, 0 for instruction-tuned models
Parameters:
Cost of large-scale sampling
Model
Adapter
#Samples per prompt
Total time
Time per prompt (s)
Time per sample (ms)
Pass@64 of DeepSeekMath-7B from DeepSeekMath
- Utterance