A few months ago I got pulled into reading about how HFT desks actually work, and the part that stuck with me was not the finance part. It was the engineering constraint. You have a model, usually a polynomial not a neural net, and you have to evaluate it in tens of nanoseconds, on hardware that cannot branch or cache-miss, fed by a radio link that is faster than fiber because of the refractive index of glass.
This is mostly me writing that down so I do not forget it. The shape of it: multivariate polynomials, FPGA fabric, and a microwave dish between Carteret and Aurora. The math is not hard. The part that is hard is believing the latency budget, and understanding what is legal and what is not.
the stakes, in nanoseconds
NASDAQ matches in Carteret, New Jersey. CME matches in Aurora, Illinois. If you are in the building, your order book is a few meters of fiber away. If you are not, you are somewhere across the country, and by the time you see a price, someone inside the building has already seen it and acted on it.
The difference is not milliseconds. It is hundreds of nanoseconds, sometimes tens. For scale: at a 10 Gbps interface one byte takes about 0.8 ns to serialize onto the wire. A cache miss to main memory on a fast CPU is around 100 ns. A branch mispredict costs you 15 to 20 cycles. In a software path you hit all of these. On an FPGA with a fixed pipeline you hit none of them.
where the polynomials live
People imagine HFT models as neural networks. Some are. But neural nets are awkward in fabric: the multiply-accumulate pattern is fine, the indirection and dynamic shapes fight you. The models that actually win on latency are cheap and closed-form, and a lot of them are polynomials.
A few places this shows up for real:
- Implied volatility surfaces. Given a grid of option prices across strikes
Kand maturitiesT, you fit a smooth surfaceσ(K,T) = Σ cᵢⱼ · fᵢ(K) · gⱼ(T). Gatheral's SVI and its variants are parametric. A polynomial basis (Laguerre or plain monomial) is what you reach for when you want to evaluate it in single-digit nanoseconds. You fit the coefficients offline. Online, as the underlying moves, you re-evaluate. - Cross-venue no-arbitrage conditions. The boundary where a set of quotes becomes arbitrageable is the zero set of a system of polynomial inequalities in the prices. Offline, you characterize the regions. Online, you evaluate which region you are in.
- Curve fitting for fixed income. Nelson-Siegel, cubic splines, cubic B-splines. All polynomials. Recomputing a discount curve as quotes move is polynomial evaluation.
- Greeks for risk. Many sensitivities are derived polynomials of the pricing polynomial. If price is a polynomial in
S, delta is a polynomial inSfor free.
The common shape: a multivariate polynomial p(x₁,…,xₙ) = Σ cᵢ · x₁^a · x₂^b … xₙ^z that you need to evaluate, very fast, with inputs that change every time the book moves.
solving is not the same as evaluating
A lot of pop-finance writing gets this wrong, so it is worth being precise about. Solving a multivariate polynomial system, finding the common zeros, inverting the map, is expensive. Grobner bases (Buchberger, F4, F5), resultants, homotopy continuation: beautiful and slow. NP-hard in the worst case and frequently dreadful in practice. None of this is happening in the hot path of a trading system.
What happens in the hot path is evaluation. Given fixed coefficients (computed offline, refreshed on a slower clock), plug in live inputs and get a number out. Evaluation is cheap and parallelizes well. The art is restructuring the polynomial so that evaluation is a fixed, pipelined computation with no data-dependent branching.
That restructuring has a name.
Horner, and why it loves silicon
Horner's method (which is apparently much older than Horner, but his name stuck) rewrites a univariate polynomial
p(x) = a₀ + a₁x + a₂x² + a₃x³ + a₄x⁴as the nested form
p(x) = a₀ + x·(a₁ + x·(a₂ + x·(a₃ + x·a₄)))Same value, but a degree-n polynomial is now n multiplies and n adds, with a dependency chain of length n. For multivariates you nest one variable at a time. The dependency chain is the catch. Each stage needs the previous result. But each stage is a single multiply-accumulate, and that is the operation silicon is best at.
x and adds the next coefficient. One MAC per cycle, one register between stages, deterministic latency. A Xilinx UltraScale+ DSP slice does this MAC in one cycle at hundreds of MHz.On a modern FPGA, say a Xilinx UltraScale+, the DSP48E2 slice is basically a 27x18-bit multiplier feeding an accumulator. It can do one MAC per clock cycle once pipelined. A degree-8 Horner chain is eight DSP slices in a row, eight pipeline registers, and a fixed eight-cycle latency from input to result. No cache. No predictor. No operating system. The latency is a number you can write down and trust.
Two more things make this fit:
- Fixed-point, not floating. Floating point on an FPGA is possible but burns resources and adds latency for normalization. Fixed-point with enough headroom is deterministic, cheap, and good enough for the precision a vol surface actually needs. I am told this is less true than it used to be, that modern FPGAs have hardened floating-point blocks, but the people I talked to still default to fixed-point for this use case.
- No data-dependent branches. Horner is a straight pipeline. The same cycles run every time. That is what makes the latency predictable, which matters more than making it low. A path that is usually 40 ns but occasionally 400 ns is useless here. A path that is always 48 ns is gold.
the pipeline in the fabric
Stitched together, the whole hot path lives on one chip:
- A kernel-bypass NIC (or the FPGA itself acting as the NIC) pulls raw ITCH bytes off the wire in hardware.
- A protocol decoder in fabric parses the multicast feed and emits order-book deltas.
- The order book is an FPGA block, usually a BRAM-backed sorted array of price levels.
- The book's top-of-book and a few depth levels feed the model: the Horner chain in the DSP slices, evaluating your polynomial in a handful of cycles.
- The model output crosses a threshold (also just a comparator) and emits an OUCH order, which goes straight back onto the wire.
At no point does a CPU see any of this. At no point does a cache miss happen. The entire loop from "the book moved" to "I have a new order on the wire" can be a few tens of nanoseconds. I have not built one of these. I have read the specs and talked to people who have, and the numbers I am quoting are second-hand. Take them as approximate.
the physical edge
Light in fiber travels at roughly 0.67c. The refractive index of silica is about 1.5, and v = c/n. A radio wave through air travels at 0.97c or better. So for the ~1,100 km between Carteret and Aurora, a fiber path takes roughly 5.5 ms one-way, and a microwave path along the same route, if you can build it tower by tower with line of sight, takes roughly 3.8 ms.
That 1.7 ms is not a margin. It is larger than the entire in-fabric pipeline by about four orders of magnitude. The reason HFT desks lease mountaintops and build dedicated millimeter-wave links across the Midwest is that no amount of silicon cleverness recovers a 1.7 ms physics deficit. Spread Networks spent roughly $300M burying a straighter fiber route through Pennsylvania in 2010 and still lost to the microwave people.
The FPGA is the second derivative. The dish is the first. Together they form a moat that is hard to copy: you cannot buy a faster law of physics, and you cannot get a county zoning board to approve your tower any faster.
the part nobody likes to say out loud
Two terms that get used interchangeably and should not be:
- Latency arbitrage is when a price on one venue moves and you act on a slower venue before that venue's quote updates. You are not ahead of anyone's order. You are ahead of a price. It is controversial, it is arguably a negative-sum tax on slower participants, but it is legal.
- Front-running is when you learn of a specific pending order, usually a client's, and trade ahead of it. That is illegal, and it should be. The hardware is identical. The information is different.
The FPGA-and-dish stack I described is the machinery of latency arbitrage. It is not, by itself, front-running. Conflating the two is either sloppy or dishonest. If you want to argue that latency arbitrage should be regulated, argue that, on its own terms, not by relabeling it.
My interest in this is not "how do I build a desk." It is that the design problem is unusually pure: a hard real-time budget, a closed-form mathematical core, and a physical layer you can reason about from first principles. I keep a folder of papers on this and I do not know if I will ever do anything with them. It is just interesting.
where else this shows up
The trading part is not the point. Whenever you have a computation that can be expressed as a fixed arithmetic DAG and must run with a hard, predictable latency, the CPU is the wrong answer and the FPGA is the right one. Radar DSP. Packet inspection. Any control loop where "usually fast" is not fast enough.
The restructuring, every time, is the one Horner teaches: rearrange the math to match the hardware, not the other way around. The polynomial does not change. Your appreciation of how it evaluates does.
references
- Horner, W. G. (1819). "A new method of solving numerical equations of all orders." Wikipedia summary (the original is hard to find online).
- Gatheral, J. (2004). "Arbitrage-free SVI volatility surfaces." arXiv:1004.1101 (the published version is later, this is the preprint people cite).
- Xilinx. "UltraScale Architecture DSP Slice User Guide" (UG579). AMD/Xilinx docs. The DSP48E2 section is what you want.
- NASDAQ. "ITCH Protocol Specification." nasdaqtrader.com. The raw format you parse in fabric.
- Lewis, M. (2014). Flash Boys. The pop-science version of this story. Not technically precise but the Spread Networks narrative is real.
- Various. The microwave network buildout is poorly documented in public. WSJ had a piece in 2015 that is the best lay overview I have found.