# Z-Rank Beta: Identifying Outliers With A Simple Calculation

Most conventional quantitative methods from forecasting to optimization suffer from the existence of large outliers in the data. There are many responses to remedy this problem, from using bootstrap/re-sampling techniques to winsorization. In either case these solutions are either computationally intensive or somewhat arbitrary in nature. Sampling intensive procedures are slow and cumbersome, while many winsorization procedures entail cropping the the data to compute trimmed means or replacing observations at the extremes with the 95th/5th values. More extensive winsorization techniques are also computationally intensive.

In general outliers can hurt both the bias (accuracy of actual versus predicted) and the variance (sensitivity to the sample data) of the predictor. The bias/variance tradeoff is analogous to trying to find the best model without over-fitting the sample data. Outliers cause a change in both the nature of the model selected and also the responsiveness of the variance of the model to a more normal sample. As a consequence, outliers need to be dealt with to avoid mis-specification. Once the outliers can be identified and influence of these outliers are reduced, it is possible to construct a better model of the data. In finance, volatility, correlations, covariances and other measures are already being applied to noisy time-series data and it is even more important to address these issues.

There is a simple calculation I would like to call the “Z-Rank Beta” that looks at the relationship between the normalized data and a cumulative distribution. Essentially this is the slope of the probability distribution (z-score of a normal distribution converted to a probability) of input values and the percentile ranking of input values. The difference between the two measures is that the z-score is not bounded and symmetric, while the percentile ranking is always bounded and symmetric. Thus to the extent the z-score differs substantially from the percentile ranking, the beta will be considerably lower than 1, while a perfect match would generally be closer to 1. The best way to test the statistic is to run a regression to derive the beta and use the p-value to identify whether the deviation is significant. To compute the Z-Rank Beta the calculation would be:

1) Use an array of at least 20 values

2) Find the percentile ranking of each value in the array

3) Compute the z-score of each value in the array: (x-average(x1,x2…))/stdeva(x1,x2…)

4) Convert the z-scores to probabilities–in excel: NORMSDIST(z1,z2….)

5) Compute the Z-Rank Beta as:

covariance(percentile rankings,probabilities)/variance(percentile rankings)

As a general guideline, beta values less than .95 indicate the possible presence of an outlier (or more than one), and values below .9 show a definite mismatch. Using a regression is superior to be able to test the significance and also identify the specific residuals that are the largest.

any progress on the Minimum Correlation Portfolio whitepaper?

hi gerd, that paper has been mostly compiled. it will be released at some point in the spring of this year. we are just in the process of reconciling results and data.

best

david

David,

It’s great to see you posting on the blog again! I’ll most certainly run some tests on what you’ve outlined, as you’ve peaked my interest with this one. Your post brings up a ‘somewhat’ related question that I’ve recently been facing. I have a friend, a prop trader, who takes seemingly ANY data/indicator and measures it’s z score, then uses the cornish fisher expansion technique to normalize the z score by it’s skew and kurtosis. Have you yourself ever tested with the aforementioned methodology? And if so, are you willing to share your results/thoughts? From my own testing (albeit, very limited and rudimentary; CFA exam prep is occupying the majority of my spare time) I do not see any significant advantage to the method. I can provide an example for you in excel if needed. Would love to hear what you have to say. Thanks again for sharing your thoughts. Hope all is going well with you.

hi chris, thank you…i will try my best to maintain a more regular pace. i have heard of the the corner fisher and it is often used in a modified var or sharpe framework. i haven’t tested it yet but am familiar with the math of the calculation. to date i have found little use for skew and kurtosis in testing, and using more empirical and less deterministic measures such as omega and actual var have proven to be more useful. this reflects the non-normality of financial data in my opinion. nonetheless i should do some tests and perhaps will present on the blog using the corner-fisher in the future.

best

david

Great to have you back David! Could you post a spreadsheet example?

hi john, thank you and hope you are well also. i will try to put up a spreadsheet this week. good suggestion.

best

david