
A simple algorithm to correct ephemeral spikes in spectral trajectories
By Levi Keay
Breakpoint detection in optical satellite imagery timeseries is a promising area of development, especially with the rise of commercial cubesat imaging platforms such as Planet. In this process, a series of satellite images is acquired over an area of interest, a spectral index is computed, and the area represented by a single pixel is tracked using this index over time. By looking at the resulting timeseries of each pixel, meaningful temporal-spatial insights can be gained. Because of the vast amounts of data being generated by satellite imaging missions, it is important to develop fully automated methods for processing in order to take full advantage of the dataset.
There are multiple processing steps needed before timeseries analysis can be performed, such as calibrating and aligning the images before computing their pixel time-series. Another key step in the processing of optical images is the elimination of clouds from the imagery. This can be performed with various methods, but these methods are imperfect and some cloudy pixels inevitably remain in the imagery stack. With the goal of full automation of the breakpoint detection pipeline, it is important to have a back up algorithm to deal with persisting cloud in the timeseries.
The Normalized Difference Vegetation Index (NDVI) is a popular spectral index which is a metric of plant health. It is commonly used in timeseries analysis approaches, and it is applicable to cubesat data because it has simple band requirements, using only the Red and Near-Infrared (NIR) bands:
\[NDVI = \frac{(NIR - Red)}{(NIR + Red)}\]Healthy vegetation appears bright, and clouds appear dark in NDVI, and so a pixel area which normally images vegetation but is obstructed by cloud on one observation will appear as a downwards spike in its NDVI timeseries. This is simulated in the animations below.
In the past, timeseries despiking as described has been done by comparing each value to the mean of its neighbours. This works well for datasets which have regularly spaced observations, as the mean value of neighbouring points is equivalent to interpolating between them. However, high cadence cubesat data is irregularly spaced, or sparse, and an extra few lines of code are required to calculate the linearly interpolated values of each data point’s neighbours.
Here is an animation of the algorithm in progress, despiking some artificial data, with the large downwards spike meant to represent a cloud in the timeseries:
![]() |
|---|
| Despiking a timeseries to its upper envelope, to account for cloudy observations |
The blue line represents the spectral trajectory, and the interpolated values are shown as red dots. An iterative approach is applied, with the largest spike being removed at each pass:
A spike is defined as a point on the timeseries whose value is less than its interpolated value by more than the set threshold.
1) compute the interpolation value from the left and right neighbours for each point
2) compare each point to its interpolation value
3) check if a spike is present:
3.a) spike is present:
replace the spike's timeseries value with the interpolated value
return to step 1 with updated timeseries
3.b) no spike is present:
exit loop
See the bottom of this page for a Python implementation.
It is important to consider the threshold of despiking. This parameter sets the tolerance of the algorithm. On each iteration, the points of the timeseries are compared to the interpolated values of their neighbours, and the point of largest discrepancy between timeseries value and neighbour-interpolation is despiked only if it is larger than the despiking threshold.
Setting the threshold too high, and no spikes will be removed as they will fall within the tolerance parameter.
Set the threshold too low, and you will remove/alter valuable information from the timeseries by over-despiking. An example of this is illustrated in the following animation, where the threshold is set to 0.5. Keep in mind that further smoothing and interpolation will be applied to the timeseries, and that this step is only meant to correct cloud content in the timeseries.
![]() |
|---|
| Using a despike threshold that is too low. Smaller features in the timeseries are altered, not just cloudy observations. |
Here is a python function that implements this method for a single pixel. To apply to a large area, I use stack the imagery into a 3-dimensional array (Lat, Long, Time), and use np.apply_along_axis() along the time axis.
def despike(pixel, doys, thresh = 0.05):
"""
Iteratively despike a timeseries using linear interpolation and taking the upper envelope.
Parameters
----------
pixel : array
The Timeseries' y-values
doys : array
Day of year, ie The timeseries' x-values
thresh : float
The despiking threshold. Spikes are defined as points on the timeseries
which differ from their interpolation by more than this value.
Returns
-------
pixel2 : array
Updated and despiked timeseries y-values
Doys2 :
Updated timeseries x-values (nodata and NaN values removed)
"""
first_day, last_day = np.min(doys), np.max(doys)
valid_inds = np.where((pixel != 0) & np.isfinite(pixel))
doys2 = doys[valid_inds]
pixel2 = pixel[valid_inds]
iterations = 0
max_dif = 100
while max_dif > thresh and iterations < 1000: #new_thresh:
# create left and right verisons of arrays:
y1 = np.roll(pixel2, -1)
y2 = np.roll(pixel2, 1)
x1 = np.roll(doys2, -1)
x2 = np.roll(doys2, 1)
# handle edges:
x1[0], y1[0], x2[-1], y2[-1] = doys2[0], pixel2[0], doys2[-1], pixel2[-1]
# caluculate neighbour interpolation for each point:
interp = ((y2-y1)/(x2-x1))*(doys2 - x1)+y1
# set edges of interpolation as mean of closest 2 points:
interp[0], interp[-1] = np.mean(pixel2[1:3]), np.mean(pixel2[-3:-1])
diff = interp- pixel2 # difference between value and interpolation
max_dif = np.max(diff)
# remove spike if over thresh:
if max_dif > thresh:
bigspike = np.argmax(diff)
pixel2[bigspike] = interp[bigspike]
iterations +=1
return pixel2, doys2