Scaling up | 须臾所思

Dataset: MATH500 (following Let’s Verify Step by Step)

Pass@1024 of DeepSeekMath-7B models from DART-Math Pass@128/256 of Mistral-7B & DeepSeekMath-7B Cost of large-scale sampling Pass@64 of DeepSeekMath-7B from DeepSeekMath

Pass@1024 of DeepSeekMath-7B models from DART-Math

pass@k accuracy of different DeepSeekMath (DSMath) models and temperatures () on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries. — `pass@k` accuracy of different DeepSeekMath (DSMath) models and temperatures () on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

Pass@128/256 of Mistral-7B & DeepSeekMath-7B

adapter explanation: “adapter” for base text completion model on MATH500 as a QA task

Few-Shot: few-shot ICL; here we use 8 shots following MAmmoTH.

Instruct: SFT and test with 0-shot; here we use SFT models from HF Hub listed below.

RL: RL and test with 0-shot; here we use RL models from HF Hub listed below.

ensemble_type explanation:

avg: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in .

maj: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.

any: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.

Cost of large-scale sampling

🔑

Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)

Dataset: MATH500

Framework: vLLM

Device: A800-PCIe(80GB) * 1

#Shots: 1 for base models, 0 for instruction-tuned models

Parameters:

Cost of large-scale sampling

Model

Adapter

#Samples per prompt

Total time

Time per prompt (s)

Time per sample (ms)

DeepSeekMath-7B-Base

ICL

1:10:15

8.43

DeepSeekMath-7B-SFT

SFT

1:10:24

8.45

DeepSeekMath-7B-SFT

SFT

128