Friday, 31 December 2010

OS DEM, why why why ?

With 20 odd years experience in IT, well it wasn’t called IT in the early days, I have plenty of experience in reading data files. One must read from the top to the bottom of the file and generally one processes the first things one reads first and tries to avoid reading and processing any data item more than once. If data is stored in a matrix of rows and columns, the first row read usually becomes row zero and the first column usually column zero.

The OS DEM ASCII data isn’t like that. I am not suggesting it’s wrong or bad, though I was cursing it when I was trying to load it, but the flimsy accompanying document leaves a lot to be desired. That’s quite common though for technical documents, technicians generally make poor communicators, they fail to realise or maybe forget how much prior knowledge they actually have and hence how much the audience might not have. One documents go public the knowledge level of the audience tumbles.

Back to the OS DEM ASCII data; each ASCII file represents a 20x20km square part of the 100x100km grid square denoted by a 2 character code such as SE. Thus there are up to 25 files for each 100km square arranged in five rows of five and they are named to represent their position in the 100km square and hence the data they contain. Using square SE as an example, because I live in that square, file se00.asc is at the bottom left, se20.asc to its right and se02.asc above it. See how the first numerical digit represents the column (eastings) and the second the row (northings). Notice that there are no odd numbers, that’s because the files represent 20 x 20km squares or tiles. The data contained in each one is referenced from the bottom left hand corner of the matrix , the header rows contain the Grid Reference of this point, for example, tile se00 starts at 400000, 400000 and contains data for 20km in the east and north directions sampled every 50 m so we have 401 points in each direction and hence contains data for 400000-420000 in each direction. If you are still thinking at this point you’ll notice that 400000-420000 at 50m intervals will give 401 points and that the adjacent tile to the right/east and above/north starts at 420000 and 420000 respectively giving a one row/column overlap.

Now the data itself seems to be arranged in rows with CR/LF at the end of each row and with 401 data rows, 5 header rows which are not very useful given the meaning file names. Each row contains 401 data values representing the elevation for each point along the x (easterly) axis. So if one printed the document, don’t, it’s huge, one would see the matrix of elevation values for each 50m x 50m square. I wish they’d told me this up front, with such large datasets, manual checking of retrieved values is difficult. So, if you need to calculate the matrix cell for a grid reference make sure you reference it from the bottom left and make sure you placed the first line that you read on the top row and subsequent line beneath it.

The ASCII format is very efficient, only data that cannot be derived is stored, it’s not too difficult in Java to read it into memory in an efficient way either. I read it as a character stream and once I encounter the delimiters, space, CR/LF I convert the number to an int, none of the decimal values of the data are anything other than zero it seems, not in the sample I looked at anyway and for my application accuracy to the nearest metre is good enough. I read 100km squares on demand, that’s 25 files, even if I only need one point. My application will need thousands of points at a time but hopefully all in the same or a few 100km squares. To speed things up in the future I have then serialized my 100km square object into a single file. Subsequent invocations of my application can then load the entire square in about one eighth of the time, less than 500ms on my computer. The data size on disk is slightly larger but with serialized objects we have the opportunity to compress them into a jar file and use the class loader to locate and load them. This adds only a small amount of time to the load of uncompressed file and presents a neat method of transporting and storing static data. To keep things simple, each tile object maintains a map of data, each element in the map represent the data from one of the ASCII files. Accessing it is simply a question of using the Grid Reference to determine first which 100km square then which 20km subdivision and finally which cell within it. In other words I have an object for a 100km square then each of these has a map which has a 2 dimensional array of int to store the elevation data for each cell.