Apparent temperature prediction for weather history dataset

Avarjana Panditha
7 min readSep 8, 2022

--

The dataset includes an hourly/daily summary for Szeged, Hungary area, between 2006 and 2016.

Source: https://www.kaggle.com/budincsevity/szeged-weather

Data available in the hourly response:

  • time: Date and Time of the data collected
  • summary: Summary as a sentence
  • precipType: Rain or Snow (Categorical)
  • temperature: Temperature in C (Numerical)
  • apparentTemperature: Apperant temperature in C (Numerical)
  • humidity: Himidity value (Numerical)
  • windSpeed: Wind speed in km/h (Numerical)
  • windBearing: Wind baring in degrees (Numerical)
  • visibility: Visibility range in km (Numerical)
  • loudCover: Numerical value
  • pressure: Pressure in millibars (Numerical)
  • daily summary: Daily summary description (Categorical)

We are going to predict the Apparent Temperature from the given set of features. Let’s move to the dataset analysis.

For the complete source code (Python Notebook): Source Code

What’s coming up

  • Drop nulls and duplicates
  • Preprocessing numerical columns
  • Preprocessing categorical columns
  • Split the dataset for training and testing then standardize
  • Identify correlations in the training data
  • Principal Component Analysis (PCA) to reduce the feature count
  • Fit the model
  • Model evaluation
  • Further thoughts about the relationships and predictions
  • Humidity and Apparent temperature?

Drop nulls and duplicates

First, we need to identify if there are null values and duplicate values in the dataset. If there are such values, we need to get rid of them by either dropping or other imputation method. Removing the null values since there are small fraction of null values present.

print(df_original.isnull().sum())output:
Formatted Date 0
Summary 0
Precip Type 517
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
df_original = df_original.dropna(axis = 0, how ='any')output:
Formatted Date 0
Summary 0
Precip Type 0
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0

Now we can move on to duplicates and remove them. There were 24 duplicates in the dataset.

df_original = df_original.drop_duplicates()

Preprocessing numerical columns

First off we have to identify what are the numerical data columns. Of course, we can use the above dataset information to filter the numerical columns but why bother. Just one line of code can do it for us.

numerical_df = df_original.select_dtypes(include=np.number)

This will provide us with a data frame that contains only the numerical data. When you run info() on the above data frame, it should look like this.

Numerical data columns

Now we can use the describe() to identify if there are unwanted columns present.

Dataframe describe()

We can see that the loud cover column is all 0 values. Therefore it can be removed. Moving on to Boxplot, Histogram and Q-Q plots to identify outliers and skewness in the data.

Outliers and invalid data

Observe the following,

Humidity
Wind speed
Pressure

Humidity ‘0’ values and pressure ’0' values are invalid. Wind speed over 60 can be identified as an outlier.

Skewness

From the plots, it can be seen that Wind speed is right-skewed while the Humidity is left-skewed. We will use exp and log to do the transformations to reduce the skewness. we are using exp (for left-skewed) and log transformations (for right-skewed). After applying the transformations, the plots are as follows.

Wind speed after transformation
Humidity

Preprocessing categorical columns

The same method is used to identify the categorical columns in the dataset.

cat_df = df_original.select_dtypes(exclude=np.number)

From the categorical columns, I decided to remove the daily summary column and keep the summary column because it has more summarized information regarding the summary at the moment. Following are the categorical columns after.

Categorical columns

After looking at the number of unique values in Summary column, we can see that it has more than 3 unique values present. The ‘Precip Type’ column has only two values. We can drop the date column since we are not interested in any time series patterns. Now, we will use the One Hot Encoding for the Smmary column. This coding will introduce new columns and we will drop the original column.

Unique value counts

After the encoding we have the following shape,

After categorical encoding

Now we merge the two data frames numerical and categorical to get the dataset ready for further analysis.

Split the dataset for training and testing then standardize

Since the initial transformations are done with the dataset, we can split the dataset into training and testing sets. This is done at this point to avoid any data leaks that might occur after the scaling of the features.

x_train, x_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2, random_state = 101)
x_train=x_train.reset_index(drop=True)
x_test=x_test.reset_index(drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)

Then the standardization will be carried out for both testing and training data features and targets with numerical values. One hot encoded values will not be scaled here. Using the standard scalar function for numerical columns existing in the dataset.

scaler = StandardScaler()scaler.fit(x_train_std)
x_train_scaled = scaler.transform(x_train_std)
x_test_scaled = scaler.transform(x_test_std)
scaler.fit(y_train_std)
y_train_scaled = scaler.transform(y_train_std)
y_test_scaled = scaler.transform(y_test_std)

Now we can move forward with the testing dataset to identify correlations.

Identify correlations in the training data

I will consider the numerical variables with the target y to evaluate the correlations.

There is a strong correlation between temperature and apparent temperature. I will not drop the temperature value since we are going to predict the apparent temperature value which depends on the combined effects of air temperature, relative humidity and wind speed. I will evaluate the model without temperature as well at the end and compare the results with accuracy.

Principal Component Analysis (PCA) to reduce the feature count

PCA analysis is done with 10 maximum components which cover approximately 94% for the training of the model.

PCA components

Fit the Model

Now we fit the model,

lm = linear_model.LinearRegression()
model = lm.fit(x_train_pca,y_train)

We can do some predictions to get an understanding about how the model works.

predictions = lm.predict(x_test_pca)
y_hat = pd.DataFrame(predictions, columns=["Predicted Apparent Temperature (C)"])

print(y_test.head(10))
print(y_hat.head(10))
OUTPUT
Apparent Temperature (C)
0 -1.173712
1 0.177357
2 0.021824
3 -0.829983
4 1.214249
5 -0.407968
6 1.105375
7 0.547528
8 -0.046093
9 1.112634
Predicted Apparent Temperature (C)
0 -1.090460
1 0.036800
2 -0.218561
3 -0.776356
4 1.184802
5 -0.457118
6 1.060102
7 0.570514
8 -0.191566
9 1.199049

Checking on the coefficients to ensure that the model is not overfitted,

print(lm.coef_)[[-0.39348664 -0.23035125 -0.01810984  0.25100567  0.54006384  0.04059997 -0.06618654  0.00419057 -1.10881734 -0.19454081]]

Not bad I would say but we need to make sure that the model is performing well mathematically as well. Let’s move onto that.

Model evaluation

We can draw the prediction against the actual value graph to see if there are any visible problems in the model.

Prediction vs Actual

It seems like the model is working almost perfectly for the predictions.

Other evaluation metrics,

  • MSE (Mean Squared Error) : 0.0096
  • Root MSE : 0.098
  • Accuracy Score : 99.023%
  • K fold cross validation scores (with 8 folds):
    0.99028117
    0.99020645
    0.9903912
    0.99007462
    0.99017325
    0.99041776
    0.99041285
    0.99028938
  • Cross-validation score : 99.028%

Woah! the model is working too perfectly. Let’s talk more about that in the next section.

Further thoughts about the relationships and predictions

We have decided to include the Temperature which has a 0.99 correlation value with the Apparent Temperature for the training. What if we don’t include that.

Without Temperature

Other evaluation metrics,

  • MSE (Mean Squared Error) : 0.3707
  • Root MSE : 0.6088
  • Accuracy Score : 62.319%
  • Cross-validation score (8 folds) : 62.654%

There we go. Since we have removed the main relationship between Temperature and Apparent Temperature, we cannot get the same amount of accuracy but a more generalized model.

Humidity and Apparent temperature?

The same scenario goes for the humidity and apparent temperature. We cannot get a very high accuracy from considering such correlations. There exist a relationship. But the prediction is not accurate when there’s less amount of data available. If we have at least the three components we need to evaluate the Apparent Temperature (Temperature, Wind Speed, Humidity), then we can build a good prediction model.

--

--