Index estimation


Real-data example with R code

We will step through a worked example of index estimation using data from the Land Registry and code in a github repository. The example is ‘minimal’ in the sense it is stripped to the raw essentials, but shows some key features of the the problem at hand. Running the code will take a minute or two.

Before starting you will need to complete the following - a machine with around 8GB of RAM is sufficient:

Install the (minimal) package with the functions we’ll use, which lives here if you want to look at it first: . It has a handful of package dependencies - these will be duly installed for you from CRAN if you respond affirmatively.


Next the estimation. Using Ordinary Least Squares regression leads to a noisy estimate. Conditioning the regression with a ‘smoothness prior’ is a type of ‘shrinkage’ that reduces the volatility when data is sparse. It also reduces the in-sample fit, the \(R^2_{in.sample}\). However our objective is to maximise the out of sample fit \(R^2_{out.sample}\). This is a classic ‘variance/bias tradeoff’ problem, and the canonical solution is k-fold cross-validation - we illustrate the application here.

Run the code as follows. The postcode area in this example is AL (St Albans) but you can run it on any area or indeed on a an arbitrary grouping of postcodes using a suitable regular expression. A quick comment on style: the code may look odd (or even horrid) because we seem to be breaking the ‘functional’ form of R, not assigning return values to objects but calling them for a side-effect. In the case of R data.table this is a good way to work with larger tables (there’s a thread on SO from the author) due to the pass by reference design of data.table. Each function at the end assigns a single data.table in the global environment - to see the name, read the code.

f1(pcx='AL',fnam = "path-to-file/pp-complete.csv") #read records for one postcode area to data.table
f2() #regular postcodes
f3() #clean up and generate identifier
f4() #repeat sales
f5() #dates series
f6() #shrinkage prior
f7() #estimate, simple cross-validation
f9() #graphic

The graphic (figure 1) shows three solutions with a slider selecting the level of shrinkage. It is intuitively clear that the low-shrinkage version is over-fitted and contains high-frequency noise, and the high-shrinkage version one suspects is over-smoothed.

A sum of squares metric of goodness of fit (figure 2) confirms that whilst the overfitted ‘low shrinkage’ solution has lower squared error in-sample, it has much higher error out of sample, and the heavily shrunk version performs nearly as well out of sample. An intermediate shrinkage performs better than either extreme, out of sample. This is exactly what cross-validation is designed for. So in crude terms the wiggles on [tuned-(high shrinkage)] contain more signal than noise; the wiggles on [(low shrinkage)-tuned] contain more noise than signal.

One benefit of this method is that it generalises to other forms of filtering such as a PCA spatial filter, and its effect is stronger where data is sparse, such as the final dates. It is tempting now to hurtle onwards, but before going any further it makes sense to do seasonal adjustments pre-estimation, and compensate for the heterogeneous size of postcode districts so they can sensibly be treated uniformly. More on this later.

For now, the purpose is served - we have a prototype ‘proof in principle’ of the estimation.

Figure 1: Estimated index for 3 shrinkage levels

Figure 2: Error as a function of shrinkage