Clustering of weather data using k-means

Published in

Web Mining [IS688, Spring 2021]

5 min readMay 7, 2021

This assignment is done to learn about the clustering of data using python. I used the minute_weather data set from Kaggle. Using clustering analysis, I am trying to generate a big picture model of local station weather. The data set has millions of data and I am creating 12 clusters from them.

Importing the Libraries

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates%matplotlib inline

Data Description

data.shape

The minute_weather data set contains raw sensor measurements captured at an interval of one minute. The weather local station is San Diego, California. The sensors capture measurements such as air temperature, air pressure, and relative humidity. The data set consist of data from September 2011 to September 2014 and is sufficient enough to provide information on all weather conditions and climate.

Each row in the data set consist of weather measurements captured for a one-minute interval of time. The variables in the data set are:

rowID: It is a unique integer representing each row.
hpwren_timestamp: The timestamp of the recording time of the measurement ( yyyy-mm-dd hh:mm:ss)
air_pressure: Air pressure measured at the timestamp in hectopascals.
air_temp: Air temperature measured at the timestamp in degree Fahrenheit.
avg_wind_direction: The averaged wind direction over the minute before the timestamp in degrees (0 denotes North and increasing clockwise)
avg_wind_speed: The average wind speed over the minute before the timestamp in meters per second.
max_wind_direction: The highest wind direction over the minute before the timestamp in degrees (0 denotes North and increasing clockwise)
max_wind_speed: The highest wind speed over the minute before the timestamp in meters per second.
min_wind_direction: The smallest wind direction over the minute before the timestamp in degrees (0 denotes North and increasing clockwise)
min_wind_speed: The smallest wind speed over the minute before the timestamp in meters per second.
rain_accumulation: The amount of rain accumulated at the timestamp in millimetres.
rain_duration: The length of time the rain has fallen measured at the timestamp in seconds.
relative_humidity: The relative humidity measured at the timestamp in percentage.

The data is then sampled down by taking every tenth row.

sampled_df = data[(data['rowID'] % 10) == 0]

Then I did some visualizations on this sampled data. I created three functions to perform this. One for a histogram bar graph, a correlation matrix and a scatter plot.

def plotColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'b')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()

The statistics on the sampled data is as follows

sampled_df.describe().transpose()

Then I dropped the empty rain_accumulation and rain_duration rows. 46 rows were dropped. But then I decided not to take rain_duration and rain_accumualation for clustering analysis. Hence my features of interest are

The next step is scaling the sampled data using StandardScaler

X = StandardScaler().fit_transform(select_df)
X

Now, using k-means clustering

kmeans = KMeans(n_clusters=12)
model = kmeans.fit(X)
print("model\n", model)

The centres of the 12 clusters formed are

centers = model.cluster_centers_
centers

Then, by creating some functions, I plotted the dry days, warm days and cool days and plots are shown below.

The clustered data is as follows

P = pd_centers(features, centers)
P

Dry Days

A dry day is defined as days with lower relative air humidity.

parallel_plot(P[P['relative_humidity'] < -0.5])

Here we can see the 9th cluster has the highest dry days. Cluster 3 shows comparatively stable dry days.

Warm Days

As per weather forecasters, a warm day is when the day’s temperature is between 77˚F and 95˚F.

parallel_plot(P[P['air_temp'] > 0.5])

Cool Days

A cool day is defined as a day when its temperature is below 68˚F

parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])

Clustering of weather data using k-means

Written by Saleena John