TL;DR Learn how to classify Time Series data from accelerometer sensors using LSTMs in Keras
Can you use Time Series data to recognize user activity from accelerometer data? Your phone/wristband/watch is already doing it. How well can you do it?
We’ll use accelerometer data, collected from multiple users, to build a Bidirectional LSTM model and try to classify the user activity. You can deploy/reuse the trained model on any device that has an accelerometer (which is pretty much every smart device).
This is the plan:
Our data is collected through controlled laboratory conditions. It is provided by the WISDM: WIreless Sensor Data Mining lab.
The data is used in the paper: Activity Recognition using Cell Phone Accelerometers. Take a look at the paper to get a feel of how well some baseline models are performing.
Let’s download the data:
!gdown --id 152sWECukjvLerrVG2NUO8gtMFg83RKCF --output WISDM_ar_latest.tar.gz !tar -xvf WISDM_ar_latest.tar.gz
The raw file is missing column names. Also, one of the columns is having an extra ”;” after each value. Let’s fix that:
column_names = [ 'user_id', 'activity', 'timestamp', 'x_axis', 'y_axis', 'z_axis' ] df = pd.read_csv( 'WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt', header=None, names=column_names ) df.z_axis.replace(regex=True, inplace=True, to_replace=r';', value=r'') df['z_axis'] = df.z_axis.astype(np.float64) df.dropna(axis=0, how='any', inplace=True) df.shape
The data has the following features:
user_id- unique identifier of the user doing the activity
activity- the category of the current activity
z_axis- accelerometer data for each axis
What can we learn from the data?
We have six different categories. Let’s look at their distribution:
Walking and jogging are severely overrepresented. You might apply some techniques to balance the dataset.
We have multiple users. How much data do we have per user?
Most users (except the last 3) have a decent amount of records.
How do different types of activities look like? Let’s take the first 200 records and have a look:
Sitting is well, pretty relaxed. How about jogging?
This looks much bouncier. Good, the type of activities can be separated/classified by observing the data (at least for that sample of those 2 activities).
We need to figure out a way to turn the data into sequences along with the category for each one.
The first thing we need to do is to split the data into training and test datasets. We’ll use the data from users with id below or equal to 30. The rest will be for training:
df_train = df[df['user_id'] <= 30] df_test = df[df['user_id'] > 30]
Next, we’ll scale the accelerometer data values:
scale_columns = ['x_axis', 'y_axis', 'z_axis'] scaler = RobustScaler() scaler = scaler.fit(df_train[scale_columns]) df_train.loc[:, scale_columns] = scaler.transform( df_train[scale_columns].to_numpy() ) df_test.loc[:, scale_columns] = scaler.transform( df_test[scale_columns].to_numpy() )
Note that we fit the scaler only on the training data. How can we create the sequences? We’ll just modify the
create_dataset function a bit:
def create_dataset(X, y, time_steps=1, step=1): Xs, ys = ,  for i in range(0, len(X) - time_steps, step): v = X.iloc[i:(i + time_steps)].values labels = y.iloc[i: i + time_steps] Xs.append(v) ys.append(stats.mode(labels)) return np.array(Xs), np.array(ys).reshape(-1, 1)
We choose the label (category) by using the mode of all categories in the sequence. That is, given a sequence of length
time_steps, we’re are classifying it as the category that occurs most often.
Here’s how to create the sequences:
TIME_STEPS = 200 STEP = 40 X_train, y_train = create_dataset( df_train[['x_axis', 'y_axis', 'z_axis']], df_train.activity, TIME_STEPS, STEP ) X_test, y_test = create_dataset( df_test[['x_axis', 'y_axis', 'z_axis']], df_test.activity, TIME_STEPS, STEP )
Let’s have a look at the shape of the new sequences:
(22454, 200, 3) (22454, 1)
We have significantly reduced the amount of training and test data. Let’s hope that our model will still learn something useful.
The last preprocessing step is the encoding of the categories:
enc = OneHotEncoder(handle_unknown='ignore', sparse=False) enc = enc.fit(y_train) y_train = enc.transform(y_train) y_test = enc.transform(y_test)
Done with the preprocessing! How good our model is going to be at recognizing user activities?
We’ll start with a simple Bidirectional LSTM model. You can try and increase the complexity. Note that the model is relatively slow to train:
model = keras.Sequential() model.add( keras.layers.Bidirectional( keras.layers.LSTM( units=128, input_shape=[X_train.shape, X_train.shape] ) ) ) model.add(keras.layers.Dropout(rate=0.5)) model.add(keras.layers.Dense(units=128, activation='relu')) model.add(keras.layers.Dense(y_train.shape, activation='softmax')) model.compile( loss='categorical_crossentropy', optimizer='adam', metrics=['acc'] )
The actual training progress is straightforward (remember to not shuffle):
history = model.fit( X_train, y_train, epochs=20, batch_size=32, validation_split=0.1, shuffle=False )
How good is our model?
Here’s how the training process went:
You can surely come up with a better model/hyperparameters and improve it. How well can it predict the test data?
~88% accuracy. Not bad for a quick and dirty model. Let’s have a look at the confusion matrix:
y_pred = model.predict(X_test)
Our model is confusing the Upstairs and Downstairs activities. That’s somewhat expected. Additionally, when developing a real-world application, you might merge those two and consider them a single class/category. Recall that there is a significant imbalance in our dataset, too.
You did it! You’ve build a model that recognizes activity from 200 records of accelerometer data. Your model achieves ~88% accuracy on the test data. Here are the steps you took:
You learned how to build a Bidirectional LSTM model and classify Time Series data. There is even more fun with LSTMs and Time Series coming next :)