# Fast Threshold Clustering Algorithm (FTCA)

Often it can be surprisingly difficult to improve upon a simple and time-tested recipe. During the summer of 2011, I worked with Corey Rittenhouse to develop algorithms for grouping asset classes. At the time, I did not have any familiarity with “clustering” algorithms that are often used in data mining research. The first algorithm that was created resulted from a desire to simplify the complex problem of grouping assets with very few steps, and also to make it computationally simple. As it turns out, ignorance was bliss. The **Fast Threshold Clustering Algorithm** **(FTCA)** has many desirable properties that traditional clustering algorithms do not: 1) it produces fairly stable clusters 2) it is fast and deterministic 3) it is easy to understand. When Michael Kapler and I conducted cluster research for our Cluster Risk Parity portfolio allocation approach with modern clustering methods, one of the biggest issues we both saw was that the resulting clusters changed too frequently– creating excessive turnover. Furthermore, highly correlated datasets such as the Dow 30, had more clusters than logic or rationale would tend to dictate. This results from the fact that most cluster algorithms function like an optimization routine that seeks to maximize inter-cluster dissimilarity and intra-cluster similarity. This can mean that clusters will change because of very small changes in the correlation matrix that are more likely to be a function of noise. Threshold clustering by comparison uses a logical correlation threshold to proxy “similar” versus “dissimilar.” In FTCA, I initially used a correlation threshold of .5 (approximately the level of statistical significance) to separate similar from dissimilar assets. The FTCA works similar to the Minimum Correlation Algorithm in that it uses the average correlation of each asset to all other assets as a means of determining how closely or distantly related an asset is to the universe of assets chosen. A graphic of how the FTCA creates clusters is presented below:

The pseudocode for FTCA is presented below:

*While there are assets that have not been assigned to a cluster*

**If only one asset remaining then****Add a new cluster****Only member is the remaining asset**

**Else****Find the asset with the Highest Average Correlation (HC) to all assets not yet been assigned to a Cluster****Find the asset with the Lowest Average Correlation (LC) to all assets not yet assigned to a Cluster****If Correlation between HC and LC > Threshold****Add a new Cluster made of HC and LC****Add to Cluster all other assets that have yet been assigned to a Cluster and have an Average Correlation to HC and LC > Threshold**

**Else****Add a Cluster made of HC****Add to Cluster all other assets that have yet been assigned to a Cluster and have a Correlation to HC > Threshold**

**Add a Cluster made of LC****Add to Cluster all other assets that have yet been assigned to a Cluster and have Correlation to LC > Threshold**

**End if**

**End if**

*End While*

It is interesting to look at the historical clusters using FTCA with a threshold of .5 on previously used datasets such as the 8 major liquid ETFs (GLD,TLT,SPY,IWM,QQQ,EFA,EEM,IYR), and the 9 sector Spyders (S&P500 sectors: XLP,XLV,XLU,XLK,XLI,XLE,XLB,XLY,XLF). The database was updated until May of 2013 and shows the historical allocations/last 12 months of clusters generated using a 252-day lookback for correlation with monthly rebalancing. First the 8 major liquid ETFs:

Notice that the clusters are very logical and do not change once within the 12 month period. The clusters essentially represent Gold, Bonds and Equities (note that a 60 month period shows very little change as well). Now lets take a look at the clusters generated on the famously noisy S&P500 sectors:

Again, the last 12 months of clustering shows very little change in cluster membership. Most of the time, the sectors are intuitively considered to be one cluster, while occasionally the utilities sector shows a lack of correlation to the rest of the group. The choice of threshold will change the number and stability of the clusters- with higher thresholds showing more clusters and a greater change in membership than lower thresholds. As much as I have learned about very sophisticated clustering methods in the last year, I often am drawn back to the simplicity and practicality of FTCA. From a portfolio management standpoint, it makes using clusters far more practical as well for tactical asset allocation or implementing cluster risk parity.

Hi David,

Thanks again for sharing ! This algo seems to produce awesome results compared to “standard” ones. Great contribution !

Best,

Pierre

Hi Pierre,

I was wondering whether you can share the code behind your shiny app at http://glimmer.rstudio.com/heuristica/clusters/

I’d like to test it on different sets of Securities/asset classes.

Best,

Paolo

hi Pierre, good to hear. happy to share. i assume you tried it with ERC or mincorr etc.

best

david

Hi David,

I tried it with a large sample of products, a threshold of 0.9 (not very aggressive), risk parity between clusters and equal weight inside them.

Aside from reduced turnover, negative outliers seem less frequent and results more linear than with other clustering algos.

Best,

Pierre

Hi David,

Thank you for yet another simple but extremely powerful algorithm. I managed to write it in AmiBroker, my platform of choice, in about 80 lines. Now I have to test how effective it is in 2 basic portfolio management systems.

You mentioned using clusters for RP and TAA.

To test in a RP system, I will probably try assigning weights inversely proportional with the asset & cluster volatility.

However, I am very interested in TAA and clusters, which you mentioned above. Anything about them together would be greatly appreciated.

TIA.

Stefan

hi stephan, thank you- if you are interested in sharing the code as open source for readers I will be happy to post it or link to it. I will probably discuss how to implement cluster risk parity with FTCA and other applications over the next few blogs. In general for TAA, one easy approach is to just treat the cluster as a single asset for use in either momentum, absolute momentum etc.

best

david

David, thank you for the reply. I will try the suggested approach for TAA. Using each cluster as an “asset class” for momentum calculations is a great idea, and is indeed what I was looking for, to leverage the resulting clusters. The next step is to find the optimal way of including the cluster momentum as a factor in the final weights calculation.

About the clusters code: I would say that the resulting code is quite tight with AB, but not trivial, as it has to generate different clusters sets at each bar included in the backtest. I appreciate the offer to post it and I’ll consider that.

Hi, David im new in programming (just a few years of studing). Im living in Russia, so dont blame me for my english =). I implemented your FTCA alghorithm in simple .dll at C# and can share it with anyone who want it. I have one simple question, why it’s called Fast? Because i don see it’s perfect name for alghorithm who use a bunch of time on over 100 000 of data and simple correlation function. Is there some trick’s to make this algho work on millions of data? I thinked about it, and wanted to implement matrix of all correlation values, but 1.000.000^2 floats ~ 100 GB of memory, and +1 element will cost around of million calculations…

artem, sorry for such a late response— this is not a fast algorithm for ultra-large data sets like you are describing since all matrix values need to be computed in advance. i called it fast, because it is deterministic and faster than many existing cluster algorithms (ie hiearchical clustering). if you would like to share something that would be much appreciated and i would be happy to cite you and create a link to any material/blog you may have. please send me an email at dmvaradi@gmail.com

best

david