Tuesday, 14 September 2010

Too many elevations?


Having to manage elevation data sets with hundreds of thousands or more of records is becoming a standard, but not easy task, even with the most powerful GIS software. This can be due both to the difficulty of interpolating large data sets, and to irregular spatial distributions of data that can lead to produce DEM with artifacts, particularly evident when deriving morphological parameters (slope, aspect and so on).

As a practical example, we consider a data set of elevations for a sector of Antarctica, deriving from ICESat satellite data measured over a period from 2003 to 2008. It consists of nearly 2,800,000 data showed in the map below. Data density along satellite tracks is high (every 172 meters), with several different tracks nearly covering each other, while the spacing across track is much higher, for instance in the top-left of the map this spacing is between five and ten kilometers. In many subareas of this zone, the local topography presents periodic height variations (not visible in this map), due to the presence of glacial megadune fields. Megadunes have wavelengths of several kilometers, amplitudes of several meters and lateral continuity of tens of km. Clearly, DEM deriving from these data are much more affected by local structures in correspondence of tracks.


A possible remedy for this situation is the use of spatial filters that reduce and uniform as much as possible the data density. A straightforward methodology is the use of a grid-based filter to the values of height. For each grid cell, a filter calculates a single height value, based on some functions that process the records falling in the cell. In general algorithms calculate the average of records, whether of height or other analysed properties. Usually the position attributed to the calculated value is at the cell center. Although results calculated in this way can be considered fundamentally correct, we note three negative aspects: the use of the mean introduces a smoothing of the values; erroneous data affects the average value; positioning the resulting value at the cell center can change the variographic properties of the data set.

I present a short program in C + + command line (without GUI), which filters the data in a grid, determines the median values (when there are three or more values) and preserves the original location of the median value. Program (compiled with CodeBlocks in Windows), source code and parameter and elevation example files are available here. Session parameters are in a text file (example: 'param.txt'), and the program loads this file to read the parameters. This program processes ICESat data, but obviously it can be applied to any elevation data, provided that they share the same, basic textual input structure, i.e. with space-separated record id (integer), x, y, z, structured in a column format (see example in file 'elevations.txt'). 

The use of the median avoids the smoothing of elevations, as well as reducing the impact of inaccurate, extreme values. The median record maintains its original position, obviously falling within the borders of the cell in question. 

The algorithm creates two textual outputs. The former is a grid in the ESRI ASCII format that stores the number of records in each grid cell. Open source GIS like Saga or Quantum Gis can read this format. The latter is a file in table format that stores the median elevation values together with the original spatial location of the median value.

Running this program on the previous Antarctic data set with 1 km cell size reduces the amount of data to 70,988 from 2,783,551, i.e. with a 97.4% reduction of the data number, without substantially changing the available data structure, as the map below, representing the filtered data, evidences.


As another example, if we use a 250 meters cell size, the data number reduces to 444,532 (about 16% of the original amount). The filter also eliminates many spikes in the frequency histogram (below, blue: original data, orange: filtered data), probably corresponding to artifacts related the track oversampling of local areas, since no natural reasons for their existence.



1 comment:

markusN said...

Besides data reduction, also interpolation might be of interest. See here for methods in GRASS GIS which easily handles such amounts of data: Interpolation in GRASS GIS