자바로 나노초를 구현한 트레이딩프레임워크

1.
한동안 세상을 지배하였던 Java. 지금은 Javascript와 같은 언어에 자리를 물려준 듯 합니다. JS의 전성시대라는 생각을 합니다. 물론 Java 8에 이어 Java 9까지 Java는 계속 진화하고 기업환경에서는 유력한 언어로써의 위치를 유지하리라 생각합니다. 전사적 업무시스템을 개발하기 위한 언어로 각광을 받고 있는 Java를 이용하여 고성능, 저지연(Low Latency)에 도전하고자 하는 시도는 계속 있었습니다.

Java performance tuning tips or everything you want to know about Java performance in 15 minutes

자바를 이용하여 고성능 매매시스템을 구축하려고 하는 시도중 대표적인 프로젝트는 OpenHFT입니다. Java로 HFT를 구현하면?을 통해 소개하였습니다. 지금은 Low Level Java와 관련한 개발과 글쓰기를 주로 하고 있습니다.

Vanilla #Java Understanding how Core Java really works can help you write simpler, faster applications.

OpenHFT 보다 좀더 직접적으로 매매시스템을 개발하고자 하는 프로젝트가 새롭게 나타났습니다. SubmicroTrading입니다. Coding for Ultra Low Latency을 보면 SubmicroTrading Project의 목표를 명확히 알 수 있습니다. Low Latency를 고민할 때 기술적으로 무엇을 고민할지를 잘 정리했습니다.

Minimise synchronisation
The synchronized keyword used to be really slow and was avoided with more complex lock classes used in preference. But with the advent of under the cover lock spinning this is no longer the case. That said even if the lock was uncontended you still have the overhead of a read and write memory barrier. So use synchronized where its absolutely needed ie where you have real concurrency.
Key here is application design where you want components to be single threaded and achieve throughput via concurrent instances which are independent and require no synchronisation.
Minimise use of volatile variables
Understand how your building blocks work eg AtomicInteger, ConcurrentHashMap.
Only use concurrent techniques for the code that needs to be concurrent.
Minimise use of CAS operations
An efficient atomic operation bypassing O/S and implemented by CPU instruction. However to make it atomic and consistent will incur a memory barrier hitting cache effectiveness. So use it where needed and not where not !
Avoid copying objects unnecessarily
I see this A LOT and the overhead can soon mount up
Same holds true for mempcy’ing buffer to buffer between API layers (especially in socket code)
Avoid statics
Can be a pain for unit tests, but real issue comes from required concurrency of shared state across instances running in separate threads
Avoid maps
I have worked on several C++ and java systems where instead of a real object model, they used abstract concepts with object values stored in maps. Not only do these systems run slowly, but they lack compile time safety and are simple a pain. Use maps where they are needed … eg a map of books or a map of orders. SMT has a goal of at most one map lookup for each event.
Presize collections
Understand the cost of growing collections, eg a HashMap has to create new array double the size then rehash its elements, an expensive operation when the map is growing into hundreds of thousands. Make initial size configurable.
Reuse heuristics
At end of the day write out the size of all collections. Next time process is bounced resize to previous stored max.
Generate other metrics like number of orders created, hit percentage, max tick rate per second … figures that can be used to understand performance and give context to unexpected latency.
Use Object Orientation
Avoiding object orientation due to fear of the cost of vtable lookups seems wrong to me. I can understand it on a micro scale, but on a macro end to end scale whats the impact ?  In java all methods are virtual, but the JIT compiler knows what classes are currently loaded and can not only avoid a vtable lookup but can also inline the code. The benefit of object orientation is huge. Component reuse and extensibility make it easy to extend and create new strategies without swathes of cut and paste code.
Use final keyword everywhere
Help the JIT compiler optimise .. If in future a method or class needs extending then you can always remove the final keyword
Small Methods
Keep methods small and easy to understand. Big big methods will never be compiled, big complex methods may be compiled, but the compiler may end of recompiling and recompiling the method to try and optimise.  David Straker wrote “KISS” on the board and I never forgot it !  If the code is easy to understand that’s GOOD.
Avoid Auto Boxing
Stick to primitives and use long over Long and thus avoid any auto boxing overhead (stick the auto boxing warning on)
Avoid Immutables
Immutable objects are fine for long lived objects, but can cause GC for anything else … eg a trading system with market data would have GC every second if each tick creates an immutable POJO
Avoid String
String is immutable and is a big no-no for ultra low latency systems. In SMT I have a ZString immutable “string-like” interface. With ViewString and ReusableString concrete implementations.
Avoid Char
Use byte and byte[] and avoid translation between byte and char on every IO operation
Avoid temp objects
Objects take time to construct and initialise. Consider using instance variables for reuse instead (if instance is not used concurrently).
Facilitate object reuse by API
Where possible, pass into a method the object that needs to be populated. This allows invoking code to avoid object creation and reuse instances where appropriate
String str = order.toString();      // the api forces construction of temporary string
Versus
_str.reset();                                 // a reusable “working” instance var
Order.toString( _str );               // because buffer passed into method no temp objects required
Don’t make everything reusable
Just where otherwise the objects would cause GC
Object reuse comes with risk of corruption, a key goal of java was to avoid those nasty bugs.
Unfortunately for ultra low latency its not an option, you have to reuse objects (remember there are places in Java classes that already use pools and reuse)
Avoid finalize
Objects which hold resources such as files and sockets should all attempt to shutdown cleanly and not rely on finalisers. Add explicit open and close methods and add shutdown handlers to cleanly close if possible.
Avoid threadlocal
Every threadlocal call involves a map lookup for current thread so only use where really needed.
24 * 7
Design your systems to run 24 * 7 …. common in 80’s and 90’s less so now in finance.

 

이상의 결과로 아래와 같은 기능을 결합한 프레임워크입니다. 현재 SubMicroTrading on GitHub 을 통해 오픈소스로 공개하였습니다.

SubMicroTrading is a new open source component based trading framework. It has been designed from the ground up with core principles focused on minimal latency and maximum throughput. SubMicroTrading contains various components including:

Fix engine
Order Management System (OMS)
Market data handlers
Book management
Exchange trading adapters
Basic exchange simulator
Highly concurrent exchange agnostic strategy container
SubMicroTrading Open Source Ultra Low Latency Trading Framework

위의 글을 보면 Single Thread와 Multi Thread를 비교합니다. Single Thread 보다 Multi Thread가 저지연이라고 하면 아래의 사례를 소개합니다. T1부터 T12까지의 흐름을 보세요. 개발자는 ‘Hidden Latency’라고 합니다. 구현한 기능을 보면 훌륭합니다.

Application design is the single biggest factor to achieving ultra low latency. It cannot be achieved by attempting to tune a poorly designed system. Profiling at the nano second level just doesn’t work. The core principles are to minimise expensive operations such as object creation, memcpy, map lookups, synchronisation, try catch handlers and nested looping. Throughput during exchange busy periods requires concurrency. Concurrency requires discrete core thread affinity and spin locks along with thread multiplexing and careful mapping of cores to threads based on target hardware.

Hidden latency can also be caused by external factors like slow consumption on exchange side leaving packets waiting, enqueued in TCP send buffer. Figure 1 depicts backlog of UDP messages in a single threaded system as may be experienced during spikes where market data is generated faster than the single threaded process can process. The order generated off the back of T1 is in the Exchange Session Writer about to be encoded. Meanwhile ticks T2 to T12 are awaiting processing in the operating system buffers.

submicro-figure-2

submicro-figure-4

3.
한국에서 FIX는 여전히 홀대를 받고 있지만 해외는 아닙니다. 수많은 거래소와 매매를 하려고 할 때 비표준적인 기준을 적용하면 비용이 발생합니다. 그래서 FIX와 같은 표준은 무척이나 중요합니다. 아래는 영국 개발자가 공개한 FIX Parser입니다. C로 개발하였습니다.

Fast FIX (Financial Information Exchange) protocol parser [FFP]

FullFIX

3 Comments

  1. 정원석

    “Single Thread 보다 Multi Thread가 저지연이라고 하면 아래의 사례를 소개합니다. T1부터 T12까지의 흐름을 보세요. 개발자는 ‘Hidden Latency’라고 합니다”

    첨부하신 그림은 single thread로 할 경우 TCP send buffer 에서 lack이 걸리기 때문에 multi thread 로 구현해야된다는 것을 보여주는것 같은데 설명이 좀 잘못된듯 합니다. T1이 처리되는 속도보다 UDP market data 가 생성되는 속도가 빠르므로(일반적으로 시세는 모두 UDP) Single thread로 한큐에 처리하는 모델은 부적합하다고 애기하고있고, 여기서 hidden latency 는 명령어들의 나열에서 논리적으로 판단할 수 있는 latency의 합 외에 multi thread와 single thread 간의 성능차이에서 오는 상대적 latency를 애기하고 있는것 같습니다.

    Reply
    1. 정원석

      더불어 글 초반부에 나오는 표에서 thread 지양의 의미는 thread 간의 synchronisation 을 최소화해야된다는 의미이지 multi thread 에서 오는 추가적인 비용을 강조하는 뜻은 아닌것 같습니다. map 에 관해서 검색하다가 오게되었는데, 실제 장에서 map 을 사용해보면 overhead가 의외로 커서 많을때는 수ms 까지 나옵니다. 그 순간이 아마 memory 할당하는 순간이었겠죠. 덕분에 full hash 로 변경중입니다. 감사합니다.

      Reply
    2. smallake (Post author)

      그림 첫째가 Single Thread이고 Hidden Latency가 있다는 것을 보여주고 그림 둘째가 multi thread(concurrency)로 Hidden Latency를 줄인 그림입니다. 정원석님이 쓰신 내용이 맞습니다. 다만 제 표현이 부정확했네요.

      Reply

Leave a Comment

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다

이 사이트는 스팸을 줄이는 아키스밋을 사용합니다. 댓글이 어떻게 처리되는지 알아보십시오.