[perftest] core-to-core-latency
使用方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
core-to-core-latency
USAGE:
core-to-core-latency [OPTIONS] [ARGS]
ARGS:
<NUM_ITERATIONS> The number of iterations per sample [default: 1000]
<NUM_SAMPLES> The number of samples [default: 300]
OPTIONS:
-b, --bench <BENCH> Select which benchmark to run, in a comma delimited list, e.g., '1,3'
1: CAS latency on a single shared cache line.
2: Single-writer single-reader latency on two shared cache lines.
3: One writer and one reader on many cache line, using the clock.
[default: 1]
-c, --cores <CORES> Specify the cores by id that should be used, comma delimited. By default
all cores are used
--csv Outputs the mean latencies in CSV format on stdout
-h, --help Print help information
- -b: bench 的类型
- cas: CAS – compare and swap, 通过atomic swap 测试shared cache line 延迟 (下面会介绍代码细节)
- single-write single-reader latency:
- one writer and one reader on many cache line:
默认采用1:cas
- -c: 指定要测试的核心,默认测量所有核心
- –csv: 将测量结果以csv格式输出.
代码细节
(由于对rust不了解,这里大概展示下,三个bench的相关代码)
CAS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
impl super::Bench for Bench {
// The two threads modify the same cacheline.
// This is useful to benchmark spinlock performance.
// NOTE
// 两个threads修改同一个cacheliine
//
//- thread 1: pong thread
// - swap(PING->PONG)
//- thread 2: ping thread
// - swap(PONG->PING)
// - test duration of (PING->PONG->PING) time and
// record result
fn run(
&self,
(ping_core, pong_core): (CoreId, CoreId),
clock: &Clock,
num_round_trips: Count,
num_samples: Count,
) -> Vec<f64> {
let state = self;
crossbeam_utils::thread::scope(|s| {
//创建pong线程
let pong = s.spawn(move |_| {
core_affinity::set_for_current(pong_core);
//等待ping线程到达该点
state.barrier.wait();
//一共测量num_round_trips * num_samples
// num_samples: 表示进行几组测试
// num_round_trips: 表示每一组测试ping->pong->ping
// 的次数
for _ in 0..(num_round_trips*num_samples) {
while state.flag.compare_exchange(PING, PONG,
Ordering::Relaxed, Ordering::Relaxed).is_err() {}
}
});
//创建ping线程
let ping = s.spawn(move |_| {
core_affinity::set_for_current(ping_core);
let mut results = Vec::with_capacity(num_samples as usize);
//等到pong线程达到该点
state.barrier.wait();
//采集 num_samples 数据
for _ in 0..num_samples {
let start = clock.raw();
//测量num_round_trips 组数据
for _ in 0..num_round_trips {
while state.flag.compare_exchange(PONG, PING,
Ordering::Relaxed, Ordering::Relaxed).is_err() {}
}
//计算一组测试的时间差
let end = clock.raw();
let duration = clock.delta(start, end).as_nanos();
//获取每次内存访问的延迟, 由于测量的是PING->PONG->PING,中间
//执行了两次内存操作, 所这里要/2, 获取每次内存访问的延迟
results.push(duration as f64 / num_round_trips as f64 / 2.0);
}
results
});
pong.join().unwrap();
ping.join().unwrap()
}).unwrap()
}
}
This post is licensed under CC BY 4.0 by the author.