Tick Data Quackery

Back in September 2024 the CQF Institute ran the Machine Learning Quant Finance Conference 2024, with Edith Mandel’s talk “Doing More with Tick Data: A Machine Learning Approach to Intraday Signal Development” being one of the highlights for me.

Within the first minute of her talk there was a note in the slides that about one year of bond futures has in the region of five billion (5bn) records and that Python, their chosen analytics libraries, cannot be used.

Python’s adoption over the years is something that has surprised me, but then again, it should not be a surprise at all. Python is easy to learn, there are plenty of developers in the field, a well-supported set of libraries and many experts out there to answer questions. The big downside with Python is that its performance is usually subpar once functions are processing certain volumes of data. While some use binary formats to load data in, it can still take a long time to process and gain clarity from the raw data.

While we can throw bigger machines or more memory at this problem, performance at this level is still an issue engineers battle against.  I know there are plenty of other languages, especially compiled ones, to do the job, such as C++, Java and Go, which help but skills and adoption levels are different from Python; it is a tradeoff. Edith Mandel’s comments did not surprise me but at the same time got me wondering what the alternatives could be. There is one solution that I have been using a lot, but I do not hear much about in Quant circles, DuckDB.

DuckDB is an open source online analytical processing engine (OLAP). A column-type database engine that focuses on reads rather than writes, designed for big data loads. With its main query engine being based on Postgres’ Structured Query Language (SQL) it makes queries easy to do but it is extremely fast for querying data.  With DuckDB there are various import functions which means that CSV, JSON and other data formats are easy to load in as a table.

Once the data is in a DuckDB table then it’s trivial to export out to various data formats including the compressed and BigData friendly parquet format. As you would expect with any form of database you have all the standard functions to use. It also supports windowing functions and can easily summarise the mean, quartile ranges, standard deviation and so on. This means you can perform complex queries on the core data and get the basic summary of the results very quickly.  

While other BigData systems exist such as Impala, Druid and Apache Pinot (another OLAP database) they are often large-scale distributed systems and take skill and time to setup properly. The pleasant thing I’ve found with DuckDB is that I can do most things on a single laptop without any strain on the machine (a MacBook Air M3 with 8Gb of RAM).

Adding your DuckDB queries to an application is very straightforward, there are libraries in several languages, including Python, that means you can shift all the analytical processing to DuckDB and then use your language of choice to render or output the results. Plugins also exist so support for Excel can be performed directly from DuckDB. There are also cloud offerings.

If you want to try DuckDB out for yourself but don’t want to use sensitive data, I have created some Golang code to generate five billion of rows of tick data for you (please make sure you have plenty of disk space available).

There is never a one solution that fits all requirements, what are arriving though are more refined tools for analysing vast data collections in the most time efficient manner, speeding up our delivery and potential for better outcomes.

Note: I have no direct connections, shares, involvement, or commercial agreements with DuckDB or its cloud offering.

References

DuckDB: https://duckdb.org/

Synthetic Data Github repo: https://github.com/jasebell/wilmottduckdb

Edith Mandel: “Doing More with Tick Data: A Machine Learning Approach to Intraday Signal Development” –  https://www.cqfinstitute.org/content/doing-more-tick-data-machine-learning-approach-intraday-signal-development

Ducks photo by Andrew Wulf on Unsplash