CPU Shared Cache를 잘 사용하는 방법

1.
LinkedIn에 올란 글중 Function Pointer와 관련한 글을 읽다가 찾은 글입니다.

이 글은 Software Techniques for Shared-Cache Multi-Core Systems을 요약 번역한 글입니다. 제가 이전에 정리하였던 IPC와 False sharing과 짝을 이루는 글입니다. 그래서 위의 글과 원문을 섞어서 소개합니다.

CPU에 최적화한 개발, 좋은 성능을 얻는 방법입니다. 다만 시간과 비용이 많이 들어갈 뿐!

2.
먼저 원문의 시작은 멀티코어 CPU의 공유캐쉬구조(Shared Cache Architecture)의 장점을 소개합니다.

(1)Efficient usage of the last-level cache
If one core idles, the other core takes all the shared cache
Reduces resource underutilization

(2)Flexibility for programmers
Allows more data-sharing opportunities for threads running on separate cores that are sharing cache
One core can pre-/post-process data for the other core
Alternative communication mechanisms between cores

(3)Reduce cache-coherency complexity
Reduced false sharing because of the shared cache
Less workload to maintain coherency, compared to the private cache architecture

(4)Reduce data-storage redundancy
The same data only needs to be stored once

(5)Reduce front-side bus traffic
Effective data sharing betw een cores allows data requests to be resolved at the shared-cache level instead of going all the way to the system memory

이제 Shared Cache를 활용하는 방법입니다.

첫째는 Processor Affinity입니다. 저도 여러번 정리했던 내용입니다. Low Latency와 CPU Affinity을 참고하세요. 앞서 소개한 Koosal님의 번역입니다.

이것은 어떤 core에 어떤 thread 를 할당할 것인가에 관한 얘기다.data의 공유없이 그저 각자의 일을 하는 thread들은 (contention-only threads) cache 를 share하지 않는 core에 넣어야 하고, 어떤 data를 공유하는 thread 들(data-sharing threads) 은 cache를 share 하는 곳에 넣어야 한다.만약 정말 밀접하게 연관되어 있는 thread 들이라면(closely tied to each other) 같은 core 안에 놓는 것이data를 공유할 때 cache miss 가 적게 발생해 훨씬 빠르다.

/* Get the CPU affinity for a task */
Linux* example:

extern int sched_getaffinity (
pid_t pid, size_t cpusetsize, cpu_set_t *cpuset);

/* Set the CPU affinity for a task */
extern int sched_setaffinity (
pid_t pid, size_t cpusetsize,
const cpu_set_t *cpuset);

/* Get the CPU affinity for a task */

Linux* example:

extern int sched_getaffinity (

pid_t pid, size_t cpusetsize, cpu_set_t *cpuset);

/* Set the CPU affinity for a task */

extern int sched_setaffinity (

pid_t pid, size_t cpusetsize,

const cpu_set_t *cpuset);

둘째는 cache blocking technique. (data tiling)입니다.

Koosal님의 요약입니다.

하나의 data set 이 있고, 이것의 size가 cache 의 size 보다 크고 이 data set을 사용하는 loop 이 있다. 이 때 A라는 작업(operation)에서 data set 을 전부 한 번씩 처리하고 나서,다시 처음부터 data set을 처음부터 건드리면서 B라는 작업을 한다면 계속해서 cache miss가 날 것이다.이것을 줄이기 위해 일단 cache size 만큼 data를 불러서 A, B operation 을 끝내고,그 다음 data를 불러서 다시 A, B operation 을 하는 방식을 하면 cache miss를 줄일 수 있다. 이것이 cache blocking technique의 한 방법이다.

cache blocking를 사용하지 않는 예제

    #define NUM_OP (400) /* Repeat 400 times for better measurement */
    #define TOTAL_SIZE_B (2097152) /* 2 MB data size */

    for (i = 0; i < NUM_OP; i++)
    {
    /* Writer */
    for (number = 0; number < TOTAL_SIZE_B; number++)
    {
    Process one byte data;
    }

    /* Reader */
    for (number = 0; number < TOTAL_SIZE_B; number++)
    {
    Read and send out one byte data;
    } /* End of one round of processing over all data */

    } /* End of test loop */

#define NUM_OP (400) /* Repeat 400 times for better measurement */

#define TOTAL_SIZE_B (2097152) /* 2 MB data size */

for (i = 0; i < NUM_OP; i++)

{

/* Writer */

for (number = 0; number < TOTAL_SIZE_B; number++)

{

Process one byte data;

}

/* Reader */

for (number = 0; number < TOTAL_SIZE_B; number++)

{

Read and send out one byte data;

} /* End of one round of processing over all data */

} /* End of test loop */

cache blocking를 사용한 예제

    #define NUM_OP (400) /* Repeat 400 times for better measurement */
    #define TOTAL_SIZE_B (2097152) /* 2 MB data size */
    #define CACHE_BLK_SZ_B (X) /* X = 4, 8, 16, 32, 64 KB */
    #define NUM_BLKS (TOTAL_SIZE_B/CACHE_BLK_SZ_B)

    for (i = 0; i &lt; NUM_OP; i++)
    {
    for (number = 0; number &lt; NUM_BLKS; number++)
    {
    /* Writer */
    for (j = 0; j &lt; CACHE_BLK_SZ_B ; j++)
    {
    Process one byte data;
    }

    /* Reader */
    for (j = 0; j &lt; CACHE_BLK_SZ_B ; j++)
    {
    Read and send one byte data;
    }
    } /* End of one round of processing over all data */
    } /* End of test loop */

#define NUM_OP (400) /* Repeat 400 times for better measurement */

#define TOTAL_SIZE_B (2097152) /* 2 MB data size */

#define CACHE_BLK_SZ_B (X) /* X = 4, 8, 16, 32, 64 KB */

#define NUM_BLKS (TOTAL_SIZE_B/CACHE_BLK_SZ_B)

for (i = 0; i < NUM_OP; i++)

{

for (number = 0; number < NUM_BLKS; number++)

{

/* Writer */

for (j = 0; j < CACHE_BLK_SZ_B ; j++)

{

Process one byte data;

}

/* Reader */

for (j = 0; j < CACHE_BLK_SZ_B ; j++)

{

Read and send one byte data;

}

} /* End of one round of processing over all data */

} /* End of test loop */

셋째는 Hold approach입니다.

다시 Koosal님의 요약입니다.

이것은 shared L2 cache에서 shared data 를 L2 cache에 계속 업데이트 하는 것을 낭비라 생각하고 이를 필요할 때만 update를 하는 식으로 바꾸는 것을 말한다.이것을 구현하는 방법중 하나는 modified data가 shared copy에 update가 될 때까지 tracking 을 위한 private data copy를 갖고 있는 것이다.이것의 실질적인 사용은 OpenMP* reduction clause(http://blog.empas.com/i5on9i/25398368) 를 이용하는 것이다.

넷째는 Delayed approach입니다.

만약 Thread 0 가 사용한 data를 Thread 1 에서 사용하는 routine을 가진 program이 있다고 하자.Thread 0 은 core 0 에서 사용되고, Thread 1 은 core 1 에서 동작하고 있다.
그러면,
1. Thread 0 가 data를 memory에서 L1 cache로 받아와서 data를 수정한다.
2. Thread 0 가 일을 끝내고 나서 data를 건드려도 좋다고 Thread 1 한테 신호를 보낸 것이다.
3. Thread 1은 L2 cache를 뒤져볼 것이다. 근데 아직 data가 L1 cache에서 L2 cache로 안넘어갔다.
4. 그러면 core 1에서는 cache miss 가 발생하고나서 data가 L2 cache 로 evicted 될 것이다.
5. Thread 1 이 L2 cache에서 data를 가져올 것이다.

이 과정에서 data가 L2 cache로 evicted 될 때까지 signal 을 보내는 것을 늦추는 것이다.그러면 Thread 1 이 signal 을 받는 순간에 이미 data가 L2 cache로 넘어왔기 때문에 cache miss 가 아닌 cache hit가 될 것이다.

다섯째는 false sharing입니다. 앞서 쓴 IPC와 False sharing을 같이 참고해주세요.

이것은 cache-coherency protocol 로 인해 발생한다. cache는 보통 block 단위로 data를 가져오는데 이것을 cache-line이라고 한다. 우연하게 이 block 에 thread 0 에서 쓰는 data도 있고, thread 1 에서 쓰는 data도 있다면 양쪽의 L1 cache 는 서로 같은 내용의 cache-line을 가지고 있다. 이 때 한쪽에서 data를 고치게 되면 cache-line 전체가 고쳐진것으로 인식하고,다른 한 쪽의 cache-line이 이제 쓸모없음을 알려서 cache-coherency를 유지한다. 이런 cache-coherency 때문에 공유되지 않는 data지만 같은 block 안에 묶여서 계속 cache miss가 발생한다. 이것이 false sharing 이다.

원문은 Cache-friendly design and performance-tuning을 마지막으로 끝납니다.

3 Comments

Steven Kim 5월 1, 2013 at 10:37 오후

예전에 인텔에서 강의 할 때 썼던 예제들을 재구성해서 포스팅 했는데, 비슷한 내용이 있습니다.

http://sunyzero.egloos.com/4227785

그런데 인텔의 VTune을 돌려보니 대부분 cache miss 문제는 너무 쉽게 찾을 수 있더군요. VTune 한번 써보세요. 정말 감탄이 나옵니다. ^^

Reply ↓
1. smallake (Post author)5월 2, 2013 at 6:15 오전
  
  정말로 좋은 의견 감사해요. 언제 시간날 때 한번 여의도로 와요. VTune 사용법을 가르쳐주면 어떨지..그리고 Linux도 가능한가요?
  
  덧붙여 요즘 Linux교육에 관심이 많은데 증권IT를 위한 프로그램이 가능할까요? 중급에서 고급으로.(^^)
  
  Reply ↓
  1. Steven Kim 5월 2, 2013 at 11:54 오전
    
    넵, 시간 될 때 찾아뵙겠습니다. ^^
    
    Reply ↓

CPU Shared Cache를 잘 사용하는 방법

이 글 공유하기:

3 Comments

Leave a Comment 응답 취소