Data Cleaning by Interpolation for Optical Satellite Imagery Timeseries


A simple algorithm to correct ephemeral spikes in spectral trajectories

By Levi Keay

Context: breakpoint detection in satellite imagery timeseries.

Breakpoint detection in optical satellite imagery timeseries is a promising area of development, especially with the rise of commercial cubesat imaging platforms such as Planet. In this process, a series of satellite images is acquired over an area of interest, a spectral index is computed, and the area represented by a single pixel is tracked using this index over time. By looking at the resulting timeseries of each pixel, meaningful temporal-spatial insights can be gained. Because of the vast amounts of data being generated by satellite imaging missions, it is important to develop fully automated methods for processing in order to take full advantage of the dataset.

There are multiple processing steps needed before timeseries analysis can be performed, such as calibrating and aligning the images before computing their pixel time-series. Another key step in the processing of optical images is the elimination of clouds from the imagery. This can be performed with various methods, but these methods are imperfect and some cloudy pixels inevitably remain in the imagery stack. With the goal of full automation of the breakpoint detection pipeline, it is important to have a back up algorithm to deal with persisting cloud in the timeseries.

Clouds and NDVI

The Normalized Difference Vegetation Index (NDVI) is a popular spectral index which is a metric of plant health. It is commonly used in timeseries analysis approaches, and it is applicable to cubesat data because it has simple band requirements, using only the Red and Near-Infrared (NIR) bands:

\[NDVI = \frac{(NIR - Red)}{(NIR + Red)}\]

Healthy vegetation appears bright, and clouds appear dark in NDVI, and so a pixel area which normally images vegetation but is obstructed by cloud on one observation will appear as a downwards spike in its NDVI timeseries. This is simulated in the animations below.

Despiking CubeSat timeseries

In the past, timeseries despiking as described has been done by comparing each value to the mean of its neighbours. This works well for datasets which have regularly spaced observations, as the mean value of neighbouring points is equivalent to interpolating between them. However, high cadence cubesat data is irregularly spaced, or sparse, and an extra few lines of code are required to calculate the linearly interpolated values of each data point’s neighbours.

Here is an animation of the algorithm in progress, despiking some artificial data, with the large downwards spike meant to represent a cloud in the timeseries:

despike_upperenvelope2
Despiking a timeseries to its upper envelope, to account for cloudy observations

The blue line represents the spectral trajectory, and the interpolated values are shown as red dots. An iterative approach is applied, with the largest spike being removed at each pass:

A spike is defined as a point on the timeseries whose value is less than its interpolated value by more than the set threshold.

1) compute the interpolation value from the left and right neighbours for each point
2) compare each point to its interpolation value
3) check if a spike is present:
    3.a) spike is present:
         replace the spike's timeseries value with the interpolated value
         return to step 1 with updated timeseries
    3.b) no spike is present:
         exit loop

See the bottom of this page for a Python implementation.

Setting an appropriate threshold:

It is important to consider the threshold of despiking. This parameter sets the tolerance of the algorithm. On each iteration, the points of the timeseries are compared to the interpolated values of their neighbours, and the point of largest discrepancy between timeseries value and neighbour-interpolation is despiked only if it is larger than the despiking threshold.

Setting the threshold too high, and no spikes will be removed as they will fall within the tolerance parameter.

Set the threshold too low, and you will remove/alter valuable information from the timeseries by over-despiking. An example of this is illustrated in the following animation, where the threshold is set to 0.5. Keep in mind that further smoothing and interpolation will be applied to the timeseries, and that this step is only meant to correct cloud content in the timeseries.

despike_toomuch
Using a despike threshold that is too low. Smaller features in the timeseries are altered, not just cloudy observations.

Algorithm in Python:

Here is a python function that implements this method for a single pixel. To apply to a large area, I use stack the imagery into a 3-dimensional array (Lat, Long, Time), and use np.apply_along_axis() along the time axis.

def despike(pixel, doys, thresh = 0.05):
    """
    Iteratively despike a timeseries using linear interpolation and taking the upper envelope.
    
    Parameters
    ----------
    pixel : array
        The Timeseries' y-values
    doys : array
        Day of year, ie The timeseries' x-values
    thresh : float
        The despiking threshold. Spikes are defined as points on the timeseries
        which differ from their interpolation by more than this value.
        
   Returns
   -------
   pixel2 : array
       Updated and despiked timeseries y-values
   Doys2 : 
       Updated timeseries x-values (nodata and NaN values removed)
        
        
    """
    first_day, last_day = np.min(doys), np.max(doys)
    valid_inds = np.where((pixel != 0) & np.isfinite(pixel))
    doys2 = doys[valid_inds]
    pixel2 = pixel[valid_inds]
    iterations = 0
    max_dif = 100
    while max_dif > thresh and iterations < 1000: #new_thresh:
        # create left and right verisons of arrays:
        y1 = np.roll(pixel2, -1)
        y2 = np.roll(pixel2, 1)
        x1 = np.roll(doys2, -1)
        x2 = np.roll(doys2, 1)
        # handle edges:
        x1[0], y1[0], x2[-1], y2[-1] = doys2[0], pixel2[0], doys2[-1], pixel2[-1] 

        # caluculate neighbour interpolation for each point:
        interp = ((y2-y1)/(x2-x1))*(doys2 - x1)+y1
        
        # set edges of interpolation as mean of closest 2 points:
        interp[0], interp[-1] = np.mean(pixel2[1:3]), np.mean(pixel2[-3:-1])
        diff = interp- pixel2   # difference between value and interpolation
        max_dif = np.max(diff)
        # remove spike if over thresh:
        if max_dif > thresh:
            bigspike = np.argmax(diff)
            pixel2[bigspike] = interp[bigspike]
        iterations +=1
    return pixel2, doys2