Machine Learning 1: Data Preprocessing

Getting Help

Even if you’ve missed the session, you’ll be able to ask questions about whatever, whenever on our Discord. Feel free to drop your question in the #bootcamp channel or just message an instructor directly.

Recording

This session has an accompanying blog post on the main Bionics blog that was written after this recording took place. As a result, there may be small improvements present in the written version of this lesson that are missing from the video recording.

The Big Picture

As even the most sophisticated machine learning algorithms are nothing more than a set of equations, they are inherently fickle and require data to be carefully pre-processed and “shaped” before they can do much meaningful with it. As data scientists / computer programmers, it’s our job to frame the problems we want to solve in an ML friendly way.

As discussed in the previous Summer of ML post, we’re looking to classify EMG patterns into certain “gestures”. Ultimately, we need to provide our ML algorithm with a set of “good examples” for it to learn from. In our case, this means generating pairs of raw EMG data and gesture classifications. However, before we go any further, we should take a closer look at how our unprocessed data is structured.

Our Data Set

Our data is split in two ways: first into gestures and then into repeats of the same gesture. Each unique gesture is given its own CSV (Comma-Separated Values) file, and the repeats therein are separated by divider lines.

In the end, we recorded 8 gestures stored in the following files:

data
├── emgData-G0.csv
├── emgData-G1.csv
├── emgData-G2.csv
├── emgData-G3.csv
├── emgData-G4.csv
├── emgData-G5.csv
├── emgData-G6.csv
└── emgData-G7.csv

Each file contains a header, then a series of data blocks separated by a divider containing the word “Repeat” and the number of the repeat:

time,emg0,emg1,emg2,emg3,emg4,emg5,emg6,emg7,quat0,quat1,quat2,quat3,acc0,acc1,acc2,gyro0,gyro2,gyro2
19,68,115,38,30,36,98,147,64,12304,-1895,2349,-10389,395,-417,1822,19,-125,37
36,66,132,39,29,39,94,152,60,12297,-1853,2436,-10385,-943,-988,1932,-20,119,-63
...
2389,79,98,27,32,40,136,187,95,12392,-1777,2419,-10289,-370,-575,1484,71,-132,-34
2412,84,110,29,32,42,135,188,100,12397,-1780,2394,-10289,-115,-495,1133,116,-125,-1
3,Repeat 2
26,92,122,28,37,38,98,169,91,12399,-1816,2312,-10299,346,-542,1004,97,-75,-1
48,94,119,25,36,40,97,171,90,12394,-1766,2406,-10292,-1182,-881,1299,75,35,-76
... and so on

Much like a spreadsheet, the first line of a CSV file stores a list of named columns, and each line that follows it represents a row of data in the table.

If you’d like a closer look at our data, feel free to view or download it here.

Loading Everything Into Pandas

The first step of data processing is almost always loading the data! Normally this is a trivial task in Python, but because we’ve split our data into several files, we’ll need to do some extra work.

Let’s start with a couple of imports!

import pandas as pd
import numpy as np
import os
import re

Let’s go through these imports one at a time and talk a little bit about why we need them to load our data.

Pandas is an incredibly popular Python library replicating the “DataFrames” found in the statistical programming language R. DataFrames are very similar to spreadsheets in structure – storing data in a series of (optionally named) columns and rows. Pandas provides us with not only DataFrames, but also several ways of easily creating them (like the read_csv() function we’ll be using later). There is a lot you can learn about Pandas, so it might be worth your time to check out the User Guide at some point.

Numpy is another commonly-used Python library for working with numerical data. Most relevant to us are the computationally-efficient arrays that ML libraries like TensorFlow often expect as inputs. If you’re looking to do any sort of scientific computing in Python, you’ll likely run into Numpy somewhere – the Learn Numpy page could come in handy.

The os module of the Python standard library (available on all Python installs) helps in writing platform-independent code. Different operating systems (Linux, macOS, Windows, etc) can have different ways of accessing files, running programs, or querying system information – the os module provides a single, consistent interface for these actions across all platforms. We’ll be primarily using this module to list and load files, but check out the Module Documentation if you’d like to know more.

Finally, we have the re import. This once again comes from the standard library, but now provides us with “regular expressions”. In a nutshell, regular expressions (sometimes called “regex” for short) are carefully constructed strings encoding some sort of search pattern. While regex strings can be intimidating (the string '^\(*\d{3}\)*( |-)*\d{3}( |-)*\d{4}$' might not look much like a phone number), we’ll only be using some very basic features to extract a gesture number from a filename. Regex101 is a wonderful place to learn and test regular expressions, and the Python Module Documentation can fill you in on the specifics.

With that out of the way, let’s get to writing some real code! We’ll start by defining a function to load a single gesture from one of our CSV files.

def read_repeats(file):
    return pd.read_csv(file)

Now let’s test our super-simple function by pointing it to our data:

# My data is stored in the `data/final/` directory
>>> read_repeats('data/final/emgData-G0.csv')
       time emg0   emg1  emg2  emg3  ...   acc1    acc2  gyro0  gyro2  gyro2.1
0        19   68  115.0  38.0  30.0  ... -417.0  1822.0   19.0 -125.0     37.0
1        36   66  132.0  39.0  29.0  ... -988.0  1932.0  -20.0  119.0    -63.0
2        56   53  108.0  35.0  29.0  ... -988.0  1932.0  -20.0  119.0    -63.0
3        72   52  104.0  31.0  27.0  ... -502.0  2451.0  -64.0  -77.0     49.0
4        87   63  119.0  27.0  29.0  ... -375.0  1670.0    7.0  -91.0     62.0
...     ...  ...    ...   ...   ...  ...    ...     ...    ...    ...      ...
3833  24413   17   24.0  18.0  20.0  ... -663.0  1771.0   -6.0   -9.0     11.0
3834  24436   17   25.0  19.0  22.0  ... -680.0  1775.0  -22.0    1.0     15.0
3835  24453   16   22.0  20.0  21.0  ... -684.0  1776.0  -33.0    4.0     17.0
3836  24473   17   22.0  19.0  23.0  ... -684.0  1787.0  -43.0    9.0     16.0
3837  24496   15   20.0  20.0  24.0  ... -707.0  1774.0  -53.0   11.0     13.0

[3838 rows x 19 columns]

That’s certainly a good start, but we’ve got a bit of a problem: we’re loading in data we don’t need! In addition to just EMG signals, the Myo armband also records orientation and acceleration information that are currently cluttering our DataFrame.

Luckily for us, Pandas contains the loc[] accessor that we can use to select certain rows and columns from our DataFrame – let’s use this to keep only the EMG data:

def read_repeats(file):
    # `:` implicitly selects everything between the first and last rows
    # 'emg0':'emg7' just selects the range of columns containing the EMG data
    return pd.read_csv(file).loc[:, 'emg0':'emg7']

Let’s test this new data-loading function!

>>> read_repeats('data/final/emgData-G0.csv')
     emg0   emg1  emg2  emg3  emg4  emg5   emg6  emg7
0      68  115.0  38.0  30.0  36.0  98.0  147.0  64.0
1      66  132.0  39.0  29.0  39.0  94.0  152.0  60.0
2      53  108.0  35.0  29.0  40.0  79.0  141.0  64.0
3      52  104.0  31.0  27.0  37.0  70.0  135.0  62.0
4      63  119.0  27.0  29.0  36.0  57.0  120.0  63.0
...   ...    ...   ...   ...   ...   ...    ...   ...
3833   17   24.0  18.0  20.0  25.0  34.0   57.0  50.0
3834   17   25.0  19.0  22.0  26.0  34.0   55.0  51.0
3835   16   22.0  20.0  21.0  22.0  29.0   52.0  45.0
3836   17   22.0  19.0  23.0  25.0  28.0   51.0  46.0
3837   15   20.0  20.0  24.0  25.0  29.0   50.0  41.0

[3838 rows x 8 columns]

That’s much more manageable! If we take a bit of a closer look though, we still have a problem… Let’s look at some of the rows around 121:

>>> read_repeats('data/final/emgData-G0.csv')[117:126]
         emg0   emg1  emg2  emg3  emg4   emg5   emg6   emg7
117        72   98.0  29.0  34.0  41.0  144.0  208.0  106.0
118        70  101.0  30.0  33.0  41.0  139.0  192.0  100.0
119        79   98.0  27.0  32.0  40.0  136.0  187.0   95.0
120        84  110.0  29.0  32.0  42.0  135.0  188.0  100.0
121  Repeat 2    NaN   NaN   NaN   NaN    NaN    NaN    NaN
122        92  122.0  28.0  37.0  38.0   98.0  169.0   91.0
123        94  119.0  25.0  36.0  40.0   97.0  171.0   90.0
124        82  125.0  26.0  33.0  37.0   87.0  154.0   90.0
125        89  108.0  32.0  36.0  42.0   94.0  153.0  104.0

Uh oh, our repeat-dividers seem to be causing us a bit of trouble! We’re going to have to track them down and manually split up our data!

Splitting Up Repeats

Note that this style of divider could easily be considered a design flaw on my part. If I’d instead added a column to every row of the DataFrame that indicated which repeat the data belonged to, I could have easily separated things using the .groupby() and .get_group() methods in Pandas.

With that being said, let’s make the best of a poor situation and find all of the dividers using boolean indexing. One way to identify these divider lines is to look for NaN values in the emg1:emg7 columns:

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    # `np.isnan(df['emg'])` marks every row as either a divider or a normal row
    # Divider rows (marked with `True`) are selected by the outer `df[...]`
    return df[np.isnan(df['emg1'])]

Testing things:

>>> read_repeats('data/final/emgData-G0.csv')
           emg0  emg1  emg2  emg3  emg4  emg5  emg6  emg7
121    Repeat 2   NaN   NaN   NaN   NaN   NaN   NaN   NaN
236    Repeat 3   NaN   NaN   NaN   NaN   NaN   NaN   NaN
356    Repeat 4   NaN   NaN   NaN   NaN   NaN   NaN   NaN
...         ...   ...   ...   ...   ...   ...   ...   ...
3512  Repeat 29   NaN   NaN   NaN   NaN   NaN   NaN   NaN
3637  Repeat 30   NaN   NaN   NaN   NaN   NaN   NaN   NaN
3778  Repeat 31   NaN   NaN   NaN   NaN   NaN   NaN   NaN

That’s looking great! We certainly don’t need all of the NaNs and “Repeat” strings though, so let’s return only the indices:

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    # `.index` returns only the row numbers of the dividers – not the whole row
    return df[np.isnan(df['emg1'])].index
>>> read_repeats('data/final/emgData-G0.csv')
Int64Index([ 121,  236,  356,  478,  622,  742,  877, 1019, 1180, 1308, 1415,
            1530, 1668, 1773, 1890, 2006, 2133, 2267, 2389, 2515, 2644, 2762,
            2890, 3010, 3138, 3263, 3390, 3512, 3637, 3778],
           dtype='int64')

Nearly there! Just one last thing to do before we start chunking up our data – we need to add the implicit repeat-boundary at position -1 (since index 0 starts the first repeat).

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    # `.insert(0,-1)` inserts the value -1 at the start of the list (index 0)
    return df[np.isnan(df['emg1'])].index.insert(0,-1)
>>> read_repeats('data/final/emgData-G0.csv')
Int64Index([  -1,  121,  236,  356,  478,  622,  742,  877, 1019, 1180, 1308,
            1415, 1530, 1668, 1773, 1890, 2006, 2133, 2267, 2389, 2515, 2644,
            2762, 2890, 3010, 3138, 3263, 3390, 3512, 3637, 3778],
           dtype='int64')

The last thing to do is pair up these boundaries so that we have a start and stop index for each repeat.

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    # `zip()` will take two lists (the bounds list and the bounds list shifted
    # by one) and turn them into a single list of tuples. `list()` is
    # temporarily here to let us view the result
    return list(zip(bounds, bounds[1:]))
>>> read_repeats('data/final/emgData-G0.csv')
[(-1, 121),
 (121, 236),
 (236, 356),
 (356, 478),
 (478, 622),
 (622, 742),
 (742, 877),
 (877, 1019),
 (1019, 1180),
 (1180, 1308),
 (1308, 1415),
 (1415, 1530),
 (1530, 1668),
 (1668, 1773),
 (1773, 1890),
 (1890, 2006),
 (2006, 2133),
 (2133, 2267),
 (2267, 2389),
 (2389, 2515),
 (2515, 2644),
 (2644, 2762),
 (2762, 2890),
 (2890, 3010),
 (3010, 3138),
 (3138, 3263),
 (3263, 3390),
 (3390, 3512),
 (3512, 3637),
 (3637, 3778)]

Here we can see why adding the implicit boundary at -1 was necessary! It’s also worth noting that we’ve not added a boundary at the end of the data. This is an intentional omission so that we don’t collect the final repeat of each file (which, due to our collection methodology, does not actually contain a gesture). Peculiarities like these are why it’s important to know the dataset you are working with!

Finally, let’s actually split the data on these boundaries (using a list comprehension):

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    # `.iloc[]` allows for slicing DataFrame rows by their indices – the +1 is
    # to avoid including the boundary itself and the `end` bound is exclusive
    return [df.iloc[start+1:end] for start,end in zip(bounds, bounds[1:])]
>>> read_repeats('data/final/emgData-G0.csv')
[    emg0   emg1  emg2  emg3  emg4   emg5   emg6   emg7
 0     68  115.0  38.0  30.0  36.0   98.0  147.0   64.0
 1     66  132.0  39.0  29.0  39.0   94.0  152.0   60.0
 2     53  108.0  35.0  29.0  40.0   79.0  141.0   64.0
 3     52  104.0  31.0  27.0  37.0   70.0  135.0   62.0
 4     63  119.0  27.0  29.0  36.0   57.0  120.0   63.0
 ..   ...    ...   ...   ...   ...    ...    ...    ...
 116   72  126.0  33.0  35.0  43.0  147.0  230.0  102.0
 117   72   98.0  29.0  34.0  41.0  144.0  208.0  106.0
 118   70  101.0  30.0  33.0  41.0  139.0  192.0  100.0
 119   79   98.0  27.0  32.0  40.0  136.0  187.0   95.0
 120   84  110.0  29.0  32.0  42.0  135.0  188.0  100.0
 
 [121 rows x 8 columns],

... a lot of stuff here ...

      emg0   emg1  emg2  emg3   emg4   emg5   emg6   emg7
 3638  304  230.0  59.0  71.0  121.0  486.0  601.0  141.0
 3639  337  232.0  66.0  81.0  141.0  513.0  644.0  167.0
 3640  327  258.0  65.0  79.0  137.0  459.0  611.0  171.0
 3641  342  251.0  62.0  79.0  132.0  419.0  589.0  170.0
 3642  393  253.0  73.0  89.0  146.0  441.0  711.0  212.0
 ...   ...    ...   ...   ...    ...    ...    ...    ...
 3773  159  169.0  38.0  35.0   63.0  288.0  336.0   83.0
 3774  174  195.0  40.0  35.0   62.0  292.0  313.0   81.0
 3775  169  202.0  44.0  36.0   60.0  294.0  342.0   79.0
 3776  163  171.0  43.0  36.0   50.0  253.0  325.0   80.0
 3777  177  177.0  41.0  35.0   50.0  282.0  347.0   74.0
 
 [140 rows x 8 columns]]

Nice! Looks like a perfect split! Finally, let’s wrap things up into one big summary DataFrame where each repeat is a single row:

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    repeats = [df.iloc[start+1:end] for start,end in zip(bounds, bounds[1:])]
    # This creates a new DataFrame with a single column ('emg') and one row for
    # each item (repeat) in our list
    return pd.DataFrame({'emg': repeats})
>>> read_repeats('data/final/emgData-G0.csv')
                                                  emg
0       emg0   emg1  emg2  emg3  emg4   emg5   emg...
1       emg0   emg1  emg2  emg3  emg4   emg5   emg...
2       emg0   emg1  emg2  emg3  emg4   emg5   emg...
3       emg0   emg1   emg2  emg3   emg4   emg5   e...
4       emg0   emg1  emg2  emg3  emg4   emg5   emg...
5       emg0   emg1  emg2  emg3  emg4   emg5   emg...
6       emg0   emg1  emg2  emg3  emg4   emg5   emg...
7        emg0   emg1  emg2  emg3  emg4   emg5   em...
8        emg0   emg1  emg2  emg3  emg4   emg5   em...
9        emg0   emg1  emg2  emg3  emg4   emg5   em...
10       emg0   emg1   emg2  emg3   emg4   emg5   ...
11       emg0   emg1  emg2  emg3   emg4   emg5   e...
12       emg0   emg1  emg2  emg3   emg4   emg5   e...
13       emg0   emg1  emg2  emg3  emg4   emg5   em...
14       emg0   emg1  emg2  emg3   emg4   emg5   e...
15       emg0   emg1  emg2  emg3  emg4   emg5   em...
16       emg0   emg1   emg2  emg3   emg4   emg5   ...
17       emg0   emg1   emg2  emg3   emg4   emg5   ...
18       emg0   emg1  emg2  emg3  emg4   emg5   em...
19       emg0   emg1  emg2  emg3  emg4   emg5   em...
20       emg0   emg1  emg2  emg3  emg4   emg5   em...
21       emg0   emg1  emg2  emg3  emg4   emg5   em...
22       emg0   emg1  emg2  emg3  emg4   emg5   em...
23       emg0   emg1  emg2  emg3  emg4   emg5   em...
24       emg0   emg1  emg2  emg3  emg4   emg5   em...
25       emg0   emg1  emg2  emg3  emg4   emg5   em...
26       emg0   emg1   emg2  emg3  emg4   emg5   e...
27       emg0   emg1   emg2  emg3  emg4   emg5   e...
28       emg0   emg1   emg2  emg3  emg4   emg5   e...
29       emg0   emg1  emg2  emg3   emg4   emg5   e...

Err… That’s not terribly helpful, is it? Let’s take a bit of a closer look at what’s happening here…

>>> # Let's look at the first row of the 'emg' column using a "new" method
>>> read_repeats('data/final/emgData-G0.csv')['emg'][0]
    emg0   emg1  emg2  emg3  emg4   emg5   emg6   emg7
0     68  115.0  38.0  30.0  36.0   98.0  147.0   64.0
1     66  132.0  39.0  29.0  39.0   94.0  152.0   60.0
2     53  108.0  35.0  29.0  40.0   79.0  141.0   64.0
3     52  104.0  31.0  27.0  37.0   70.0  135.0   62.0
4     63  119.0  27.0  29.0  36.0   57.0  120.0   63.0
..   ...    ...   ...   ...   ...    ...    ...    ...
116   72  126.0  33.0  35.0  43.0  147.0  230.0  102.0
117   72   98.0  29.0  34.0  41.0  144.0  208.0  106.0
118   70  101.0  30.0  33.0  41.0  139.0  192.0  100.0
119   79   98.0  27.0  32.0  40.0  136.0  187.0   95.0
120   84  110.0  29.0  32.0  42.0  135.0  188.0  100.0

Ah, that makes a bit more sense – we’ve been nesting our DataFrames! While you can certainly make something like this work, there is little point in keeping the column names around (they don’t mean anything to our machine-learning algorithms, which only look at the position of the data). Luckly, we can easily convert our DataFrame to a Numpy array – removing the superfluous nesting:

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    # The `.to_numpy()` method dumps the data of our DataFrame as a 2D array
    repeats = [df.iloc[start+1:end].to_numpy()
               for start,end in zip(bounds, bounds[1:])]
    return pd.DataFrame({'emg': repeats})
>>> read_repeats('data/final/emgData-G0.csv')
                                                  emg
0   [[68, 115.0, 38.0, 30.0, 36.0, 98.0, 147.0, 64...
1   [[92, 122.0, 28.0, 37.0, 38.0, 98.0, 169.0, 91...
2   [[95, 160.0, 47.0, 46.0, 74.0, 281.0, 421.0, 1...
3   [[166, 175.0, 39.0, 48.0, 69.0, 363.0, 521.0, ...
4   [[149, 207.0, 67.0, 49.0, 67.0, 293.0, 474.0, ...
5   [[86, 126.0, 39.0, 43.0, 56.0, 193.0, 259.0, 1...
6   [[177, 246.0, 92.0, 42.0, 68.0, 279.0, 391.0, ...
7   [[101, 174.0, 37.0, 46.0, 61.0, 233.0, 282.0, ...
8   [[48, 147.0, 30.0, 35.0, 45.0, 113.0, 138.0, 5...
9   [[316, 100.0, 32.0, 33.0, 30.0, 76.0, 102.0, 1...
10  [[261, 102.0, 39.0, 37.0, 46.0, 76.0, 119.0, 2...
11  [[158, 188.0, 36.0, 40.0, 72.0, 285.0, 403.0, ...
12  [[205, 271.0, 75.0, 91.0, 164.0, 643.0, 803.0,...
13  [[147, 104.0, 29.0, 30.0, 27.0, 88.0, 118.0, 8...
14  [[106, 184.0, 41.0, 43.0, 73.0, 290.0, 478.0, ...
15  [[171, 206.0, 33.0, 36.0, 52.0, 204.0, 416.0, ...
16  [[226, 197.0, 55.0, 53.0, 75.0, 296.0, 397.0, ...
17  [[239, 245.0, 97.0, 74.0, 133.0, 620.0, 844.0,...
18  [[227, 176.0, 44.0, 49.0, 75.0, 319.0, 322.0, ...
19  [[270, 167.0, 45.0, 45.0, 57.0, 229.0, 353.0, ...
20  [[145, 166.0, 40.0, 37.0, 53.0, 204.0, 297.0, ...
21  [[54, 99.0, 46.0, 29.0, 40.0, 94.0, 135.0, 47....
22  [[155, 180.0, 42.0, 37.0, 55.0, 201.0, 316.0, ...
23  [[208, 171.0, 44.0, 55.0, 86.0, 371.0, 520.0, ...
24  [[84, 159.0, 35.0, 40.0, 59.0, 236.0, 422.0, 1...
25  [[198, 236.0, 61.0, 49.0, 69.0, 300.0, 475.0, ...
26  [[163, 140.0, 45.0, 43.0, 60.0, 214.0, 317.0, ...
27  [[154, 185.0, 70.0, 44.0, 67.0, 306.0, 471.0, ...
28  [[179, 236.0, 122.0, 47.0, 64.0, 245.0, 449.0,...
29  [[304, 230.0, 59.0, 71.0, 121.0, 486.0, 601.0,...

That’s looking much better! We have 30 repeats of 2D EMG arrays! Oh, and just in case you’d like to see one of those arrays, here you are:

>>> read_repeats('data/final/emgData-G0.csv')['emg'][0]
array([['68', 115.0, 38.0, 30.0, 36.0, 98.0, 147.0, 64.0],
       ['66', 132.0, 39.0, 29.0, 39.0, 94.0, 152.0, 60.0],
       ['53', 108.0, 35.0, 29.0, 40.0, 79.0, 141.0, 64.0],
       ['52', 104.0, 31.0, 27.0, 37.0, 70.0, 135.0, 62.0],
       ['63', 119.0, 27.0, 29.0, 36.0, 57.0, 120.0, 63.0],
        ... a lot of stuff here ...
       ['72', 126.0, 33.0, 35.0, 43.0, 147.0, 230.0, 102.0],
       ['72', 98.0, 29.0, 34.0, 41.0, 144.0, 208.0, 106.0],
       ['70', 101.0, 30.0, 33.0, 41.0, 139.0, 192.0, 100.0],
       ['79', 98.0, 27.0, 32.0, 40.0, 136.0, 187.0, 95.0],
       ['84', 110.0, 29.0, 32.0, 42.0, 135.0, 188.0, 100.0]], dtype=object)

Ruh roh… It’s a good idea we took a bit of a closer look at that! It looks like, while most of our values are “floats” (numbers with a decimal point), our entire first column seems to be made of strings (as evidenced by the quotes)! Our ML algorithm certainly won’t take kindly to that!

If you’re curious why this is, it’s because our original dataset did indeed contain strings in the first column! Remember our good friend “Repeat N”? This is yet another argument in favour of distinguishing between repeats in a different way. Learn from my mistakes!

There are a lot of ways to convert the types here, but the most convenient is probably reusing the .to_numpy() method we’re already calling:

def read_repeats(file):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    # The option `dtype` argument allows a Numpy datatype to be specified
    repeats = [df.iloc[start+1:end].to_numpy(dtype='int32')
               for start,end in zip(bounds, bounds[1:])]
    return pd.DataFrame({'emg': repeats})
>>> read_repeats('data/final/emgData-G0.csv')['emg'][0]
array([[  68,  115,   38,   30,   36,   98,  147,   64],
       [  66,  132,   39,   29,   39,   94,  152,   60],
       [  53,  108,   35,   29,   40,   79,  141,   64],
       [  52,  104,   31,   27,   37,   70,  135,   62],
       [  63,  119,   27,   29,   36,   57,  120,   63],
        ... a lot of stuff here ...
       [  72,  126,   33,   35,   43,  147,  230,  102],
       [  72,   98,   29,   34,   41,  144,  208,  106],
       [  70,  101,   30,   33,   41,  139,  192,  100],
       [  79,   98,   27,   32,   40,  136,  187,   95],
       [  84,  110,   29,   32,   42,  135,  188,  100]], dtype=int32)

That looks much better! As a quick note before we go on though, let’s break down int32. The first part, the int means we’re converting to our values to integers (numbers without a decimal point). Why is that? Weren’t they floats before? Indeed they were, but a closer inspection reveals that all of those “numbers with decimals” just had a .0 at the end! Pandas loads numbers from CSV as floats by default, but we have no need for fractional values here.

Why bother converting to integers even if we don’t need floats? Well, floats are limited in their precision on computers, so avoiding them wherever they aren’t needed is usually a good idea. The loss of precision can be seen in the following expression, which should evaluate to 0 but doesn’t:

>>> 1 - (1/3) - (1/3) - (1/3)
1.1102230246251565e-16

The 32 in int32 indicates that these are 32-bit integers. How values are represented in computer memory probably isn’t something you’re used to thinking about in Python (as the language normally hides these sorts of things), but Numpy is primarily programmed in a language called C (where you do need to think about how many bits you are using)! Long-story short, the 32 means that 32 1s-and-0s are used to represent your integer, giving you 2^32 = 4294967296 possible values.

Tangents aside, let’s think about the last thing we need to add to our read_repeats() function! Currently, we’re just loading one file (and consequentially one gesture) at a time, but we’ll be extending things shortly to load all of the gestures at once. To keep track of which gesture each repeat belongs to, let’s add a new column called gesture to the result of rand_repeats() – for now, we’ll just pass the gesture number as an argument to the function:

# Our final `read_repeats()` function
def read_repeats(file, gesture):
    df = pd.read_csv(file).loc[:, 'emg0':'emg7']
    bounds = df[np.isnan(df['emg1'])].index.insert(0,-1)
    repeats = [df.iloc[start+1:end].to_numpy(dtype='int32')
               for start,end in zip(bounds, bounds[1:])]
    # We've added a second column containing the gesture name / number
    return pd.DataFrame({'emg': repeats, 'gesture': gesture})
>>> # Note we are calling `read_repeats()` with a second argument now
>>> read_repeats('data/final/emgData-G0.csv', 0)
                                                  emg  gesture
0   [[68, 115, 38, 30, 36, 98, 147, 64], [66, 132,...        0
1   [[92, 122, 28, 37, 38, 98, 169, 91], [94, 119,...        0
2   [[95, 160, 47, 46, 74, 281, 421, 107], [87, 16...        0
3   [[166, 175, 39, 48, 69, 363, 521, 137], [168, ...        0
4   [[149, 207, 67, 49, 67, 293, 474, 99], [166, 1...        0
5   [[86, 126, 39, 43, 56, 193, 259, 126], [72, 12...        0
6   [[177, 246, 92, 42, 68, 279, 391, 115], [172, ...        0
7   [[101, 174, 37, 46, 61, 233, 282, 104], [102, ...        0
8   [[48, 147, 30, 35, 45, 113, 138, 59], [47, 153...        0
9   [[316, 100, 32, 33, 30, 76, 102, 177], [319, 1...        0
10  [[261, 102, 39, 37, 46, 76, 119, 285], [284, 1...        0
11  [[158, 188, 36, 40, 72, 285, 403, 104], [125, ...        0
12  [[205, 271, 75, 91, 164, 643, 803, 187], [208,...        0
13  [[147, 104, 29, 30, 27, 88, 118, 82], [146, 10...        0
14  [[106, 184, 41, 43, 73, 290, 478, 113], [110, ...        0
15  [[171, 206, 33, 36, 52, 204, 416, 101], [170, ...        0
16  [[226, 197, 55, 53, 75, 296, 397, 127], [160, ...        0
17  [[239, 245, 97, 74, 133, 620, 844, 221], [220,...        0
18  [[227, 176, 44, 49, 75, 319, 322, 85], [226, 1...        0
19  [[270, 167, 45, 45, 57, 229, 353, 101], [256, ...        0
20  [[145, 166, 40, 37, 53, 204, 297, 86], [133, 1...        0
21  [[54, 99, 46, 29, 40, 94, 135, 47], [47, 100, ...        0
22  [[155, 180, 42, 37, 55, 201, 316, 103], [195, ...        0
23  [[208, 171, 44, 55, 86, 371, 520, 139], [199, ...        0
24  [[84, 159, 35, 40, 59, 236, 422, 101], [74, 16...        0
25  [[198, 236, 61, 49, 69, 300, 475, 105], [212, ...        0
26  [[163, 140, 45, 43, 60, 214, 317, 108], [138, ...        0
27  [[154, 185, 70, 44, 67, 306, 471, 97], [152, 1...        0
28  [[179, 236, 122, 47, 64, 245, 449, 89], [178, ...        0
29  [[304, 230, 59, 71, 121, 486, 601, 141], [337,...        0

Looking good! 30 repeats of well-formatted EMG data and the gesture they come from! Next up is loading all of the gestures at once!

Loading All Gestures

Here is where the os and re imports from earlier come into play! First up, let’s define a new function called load_emg_from_folder() and list all of the files in the data directory.

def load_emg_from_folder(path):
    # `listdir()` from the `os` module will list all files in a directory
    return os.listdir(path)
>>> load_emg_from_folder('data/final')
['gestures.txt',
 'emgData-G0.csv',
 'emgData-G1.csv',
 'emgData-G2.csv',
 'emgData-G4.csv',
 'emgData-G5.csv',
 'emgData-G6.csv',
 'emgData-G7.csv',
 'emgData-G3.csv']

That’s a good start! It’s listing all of the files in the data/final directory (including the gestures.txt file which gives each gesture an English name so we could keep track of which was which). While gestures.txt is helpful to keep around for us programmers, it’s not really something we need to load into Python, so the first thing to do is filter that out! This is where regex comes into play!

def load_emg_from_folder(path):
    # `match` returns whether something matches the given regex pattern or not
    # The list comprehension then only includes the files that did match
    return [file for file in os.listdir(path)
            if re.match(r'emgData-G(\d).csv', file)]

Hmm, okay, before we test that out, it’s worth explaining what is going on with that whole r'emgData-G(\d).csv' thing.

First of all, strings preceded by r are “raw” strings. In normal Python strings, backslashes (\) have a special meaning – they are used to include some whitespace characters like a newline (\n), tab (\t), etc. This also means that, in normal strings, if you want to include a literal backslash, you need to “escape” that with a second backslash, like this: \\. Since regex makes heavy use of \, it’s preferable to use “raw” strings that don’t assign any special meaning to backslashes. If we wanted to write the simple regex above using a regular Python string, we would need to write 'emgData-G(\\d).csv'.

Just to drive the point home:

>>> print('This is one line\nThis is the next!')

Prints:

This is one line
This is the next!

But

>>> print(r'This is one line\nThis is the next!')

Prints:

This is one line\nThis is the next!

As for the rest of the regex, the emgData-G and .csv bits just match like a literal search string, requiring every match to share that general structure, and the \d will match any numerical digit (0-7 in our case). The parentheses around \d form something called a “capture group”, which can be used to extract specific parts of a match; here we are interested in capturing the digit, as that tells us which gesture the file contains. More on capture groups in a moment, first we need to see if our regex has worked:

>>> load_emg_from_folder('data/final')
['emgData-G0.csv',
 'emgData-G1.csv',
 'emgData-G2.csv',
 'emgData-G4.csv',
 'emgData-G5.csv',
 'emgData-G6.csv',
 'emgData-G7.csv',
 'emgData-G3.csv']

Looking good! We’ve got all of the same files as before, but gestures.txt has been filtered out! The next step is extracting the gesture number, let’s take a quick look at re.match() in isolation.

>>> # When there is no match, nothing is returned (a false result)
>>> re.match(r'emgData-G(\d).csv', 'gestures.txt')
>>> # When there is a match, a match object is returned
>>> re.match(r'emgData-G(\d).csv', 'emgData-G0.csv')
<re.Match object; span=(0, 14), match='emgData-G0.csv'>

That’s interesting! Python treats no return as a False result and the re.Match object as True. The re.Match object shows us a little bit of information by default, but let’s take a look at something we are more interested in: the capture groups!

>>> # `.groups()` returns strings for all of the captured groups in the match
>>> re.match(r'emgData-G(\d).csv', 'emgData-G0.csv').groups()
('0',)

It looks like Python is returning a tuple of all of the captured groups (you can have as many as you’d like in any given regex) and it contains our digit '0'! We can index tuples just like lists and get the first capture group all on its own:

>>> # Getting just the first group...
>>> re.match(r'emgData-G(\d).csv', 'emgData-G0.csv').groups()[0]
'0'

Let’s take a bit of a leap and try to load all of our gestures in one go (don’t worry, they’ll be a full explanation right after):

def load_emg_from_folder(path):
    # We're now calling `rand_repeats` on each file and naming the gesture
    # according to the captured digit from our regex
    return [read_repeats(file, matches.groups()[0])
            for file in os.listdir(path)
            # We've also changed this line to store the `re.Match` object
            # in the `matches` variable
            if (matches := re.match(r'emgData-G(\d).csv', file))]

Okay, there are a couple of changes to talk about there – the biggest being the introduction of the walrus operator (:=). In short, this operator allows us to assign to a variable and return the new value in a single step. In an example:

>>> # Assignment outside of an expression is fine
>>> walrus = False
>>> print(walrus)
False
>>> not walrus
True
>>> # Normal assignment inside of an expression is not allowed
>>> print(walrus = True)
TypeError: 'walrus' is an invalid keyword argument for print()
>>> # The walrus operator is the solution!
>>> print(walrus := True)
True
>>> not walrus
False

After we’ve saved any capture groups in the matches variable, we can pass then to read_repeats()! Let’s see how that code from earlier actually works:

>>> load_emg_from_folder('data/final')
FileNotFoundError: [Errno 2] No such file or directory: 'emgData-G0.csv'

Ruh roh round two… That’s not quite right! It seems that Python can’t find the file that we’re passing to read_repeats()! Fortunately though, this is something to be expected. We’re listing the files in the data/final folder, but read_repeats() is still looking for things in the current directory. It looks like we’ll have to combine our path and file names! Luckily, the os module has just what we need:

def load_emg_from_folder(path):
    # `os.path.join()` sticks together filepaths while avoiding all of the
    # pitfalls inherent in building paths (do you need to join things with
    # a `/`? Maybe you'll need to use a `\` on Windows?)
    return [read_repeats(os.path.join(path, file), matches.groups()[0])
            for file in os.listdir(path)
            if (matches := re.match(r'emgData-G(\d).csv', file))]
>>> load_emg_from_folder('data/final')
[                                                  emg gesture
 0   [[68, 115, 38, 30, 36, 98, 147, 64], [66, 132,...       0
 1   [[92, 122, 28, 37, 38, 98, 169, 91], [94, 119,...       0
 2   [[95, 160, 47, 46, 74, 281, 421, 107], [87, 16...       0
 3   [[166, 175, 39, 48, 69, 363, 521, 137], [168, ...       0
 4   [[149, 207, 67, 49, 67, 293, 474, 99], [166, 1...       0
     ... a lot of stuff here ...
 25  [[83, 116, 25, 32, 36, 79, 103, 69], [68, 98, ...       3
 26  [[27, 68, 20, 20, 22, 32, 55, 37], [37, 73, 22...       3
 27  [[91, 128, 38, 34, 26, 36, 74, 66], [92, 118, ...       3
 28  [[47, 46, 49, 31, 27, 31, 64, 58], [46, 62, 36...       3
 29  [[26, 19, 22, 23, 20, 26, 45, 36], [27, 19, 22...       3]

Excellent! It looks like we have a list of DataFrames, each containing all of the repeats for a single gesture (as you can see from the ‘gesture’ column). Now all that’s left to do is to stick all of these together into a single DataFrame! Luckily, Pandas has our back again:

def load_emg_from_folder(path):
    all_gestures = [read_repeats(os.path.join(path, file), matches.groups()[0])
                    for file in os.listdir(path)
                    if (matches := re.match(r'emgData-G(\d).csv', file))]
    # `concat()` takes a list of DataFrames and squishes all of them into one
    return pd.concat(all_gestures)
>>> load_emg_from_folder('data/final')
                                                  emg gesture
0   [[68, 115, 38, 30, 36, 98, 147, 64], [66, 132,...       0
1   [[92, 122, 28, 37, 38, 98, 169, 91], [94, 119,...       0
2   [[95, 160, 47, 46, 74, 281, 421, 107], [87, 16...       0
3   [[166, 175, 39, 48, 69, 363, 521, 137], [168, ...       0
4   [[149, 207, 67, 49, 67, 293, 474, 99], [166, 1...       0
..                                                ...     ...
25  [[83, 116, 25, 32, 36, 79, 103, 69], [68, 98, ...       3
26  [[27, 68, 20, 20, 22, 32, 55, 37], [37, 73, 22...       3
27  [[91, 128, 38, 34, 26, 36, 74, 66], [92, 118, ...       3
28  [[47, 46, 49, 31, 27, 31, 64, 58], [46, 62, 36...       3
29  [[26, 19, 22, 23, 20, 26, 45, 36], [27, 19, 22...       3

[231 rows x 2 columns]

Overall this is looking really good! We have 231 rows (showing that all of our repeats from all eight gestures were loaded in) and two columns (for ‘emg’ and ‘gesture’ data). There’s just one small problem: our indexes only seem to go as high as 29. Let’s take a closer look at one of the gesture boundaries:

>>> load_emg_from_folder('data/final')[25:35]
                                                  emg gesture
25  [[198, 236, 61, 49, 69, 300, 475, 105], [212, ...       0
26  [[163, 140, 45, 43, 60, 214, 317, 108], [138, ...       0
27  [[154, 185, 70, 44, 67, 306, 471, 97], [152, 1...       0
28  [[179, 236, 122, 47, 64, 245, 449, 89], [178, ...       0
29  [[304, 230, 59, 71, 121, 486, 601, 141], [337,...       0
0   [[21, 23, 19, 24, 25, 29, 68, 61], [21, 23, 18...       1
1   [[59, 108, 43, 45, 71, 204, 314, 125], [56, 11...       1
2   [[51, 108, 42, 41, 72, 322, 387, 106], [53, 12...       1
3   [[52, 90, 47, 37, 65, 297, 408, 116], [53, 86,...       1
4   [[62, 122, 52, 42, 56, 204, 315, 80], [64, 122...       1

Oh dear! When we called pd.concat(), it looks like it kept all of the old indices, so it restarts our numbering every 30 rows! It’s often helpful to keep the original indices, especially if they were strings, but in this case we can renumber things by passing an additional argument to pd.concat() asking it to ignore the original indices:

def load_emg_from_folder(path):
    all_gestures = [read_repeats(os.path.join(path, file), matches.groups()[0])
                    for file in os.listdir(path)
                    if (matches := re.match(r'emgData-G(\d).csv', file))]
    # `ignore_index` throws away the original indices and renumbers the rows
    return pd.concat(all_gestures, ignore_index=True)
>>> load_emg_from_folder('data/final')
                                                   emg gesture
0    [[68, 115, 38, 30, 36, 98, 147, 64], [66, 132,...       0
1    [[92, 122, 28, 37, 38, 98, 169, 91], [94, 119,...       0
2    [[95, 160, 47, 46, 74, 281, 421, 107], [87, 16...       0
3    [[166, 175, 39, 48, 69, 363, 521, 137], [168, ...       0
4    [[149, 207, 67, 49, 67, 293, 474, 99], [166, 1...       0
..                                                 ...     ...
226  [[83, 116, 25, 32, 36, 79, 103, 69], [68, 98, ...       3
227  [[27, 68, 20, 20, 22, 32, 55, 37], [37, 73, 22...       3
228  [[91, 128, 38, 34, 26, 36, 74, 66], [92, 118, ...       3
229  [[47, 46, 49, 31, 27, 31, 64, 58], [46, 62, 36...       3
230  [[26, 19, 22, 23, 20, 26, 45, 36], [27, 19, 22...       3

[231 rows x 2 columns]

Aced it! Now things are numbered correctly and we can move on to some finishing touches!

Shaping Our EMG Data

Since most machine-learning algorithms rely on inputs being the same size every time (this enables vectorization optimizations), we’ll need to make sure all of our repeats are the same size. We can break this down into two main steps:

  1. Find the median length of all of our repeats
  2. Pad or truncate all repeats so that they are the same (median) size

To do this, we’ll need a couple more imports:

from tensorflow.keras.preprocessing.sequence import pad_sequences
from statistics import median

Dealing with those in reverse-order: median does what it says on the tin and finds the median of some numbers. pad_sequences is a special function from our machine-learning library, TensorFlow, that will truncate a sequence if it’s too long, or pad it with some number (zero by default) if it’s too short.

Let’s start by trying to get the shape of our EMG repeats:

# Takes a DataFrame from either `read_repeats()` or `load_emg_from_folder()`
def shape_ml_data(df):
    # Returns the shape (dimensions) of every array in the 'emg' column
    return [x.shape for x in df['emg']]
>>> df = load_emg_from_folder('data/final')
>>> shape_ml_data(df)
[(121, 8),
 (114, 8),
 (119, 8),
 (121, 8),
 (143, 8),
  ...
 (115, 8),
 (119, 8),
 (125, 8),
 (114, 8),
 (115, 8)]

We can see that all of our EMG data is 8 “wide” (which corresponds to the 8 EMG channels of the Myo), but it’s length certainly varies! We can pick out just the first of these shape values (since the 8 never changes), then find the median as follows:

def shape_ml_data(df):
    # Indexing the shape tuple and calculating the median from the list
    return median([x.shape[0] for x in df['emg']])
>>> shape_ml_data(df)
118

That looks great, we can pad / trim all of our sequences to be 118 long! But there is a hidden danger lurking here (since Python makes no guarantee about the types returned by a function). Let’s see what happens if we have an even number of repeats:

>>> # We currently have an odd number of repeats (231)
>>> df.shape
(231, 2)
>>> # We can drop one by slicing the DataFrame
>>> df[1:].shape
(230, 2)
>>> # Then recalculate the median
>>> shape_ml_data(df[1:])
117.5

Oh dear… Because of the way the median is calculated, even-length datasets involve the averaging of the “middle” values, potentially resulting in a non-integer result. Normally this is fine, but in our case it makes no sense to pad our EMG sequence to some half-value. We can simply convert back to an integer using the int() function:

def shape_ml_data(df):
    # `int()` ensures that the result is a whole number
    return int(median([x.shape[0] for x in df['emg']]))
>>> shape_ml_data(df[1:])
117

Much better! Now lets actually pad and trim things using pad_sequences():

def shape_ml_data(df):
    middle = int(median([x.shape[0] for x in df['emg']]))
    # Generate a list of our repeats with each padded to the `maxlen`
    return [pad_sequences(x, maxlen = middle) for x in df['emg']]

Now we’ll look at the shape of our first few repeats:

>>> [x.shape for x in shape_ml_data(df)[:5]]
[(121, 118), (114, 118), (119, 118), (121, 118), (143, 118)]

Oh dear… That’s not quite what we were looking for… It looks like it’s padded the wrong dimension! Somewhat annoyingly, at the time of writing, pad_sequences() does not allow the user to specify an “axis” to pad along, so we’ll need to do something a little tricky: we can transpose the array (swapping the length and width), pad the transposed array, then transpose the result to get back to our original shape. Let’s play around with that idea a little:

>>> # Looking at how transposing (`.T`) affects shape
>>> df['emg'][0].shape
(121, 8)
>>> df['emg'][0].T.shape
(8, 121)
>>> # Transposing, padding, and untransposing:
>>> emg = df['emg'][0]
>>> pad_sequences(emg.T, maxlen = 118).T.shape
(118, 8)

Let’s apply this to our function:

def shape_ml_data(df):
    middle = int(median([x.shape[0] for x in df['emg']]))
    # Applying the transpose trick!
    return [pad_sequences(x.T, maxlen = middle).T for x in df['emg']]

Again looking at the shape of our first few repeats:

>>> [x.shape for x in shape_ml_data(df)[:5]]
[(118, 8), (118, 8), (118, 8), (118, 8), (118, 8)]

That’s looking much better! As a final processing step for the EMG data, we can take this list of 2D Numpy arrays and just turn it into a single 3D Numpy array (which is something we can directly feed into TensorFlow).

def shape_ml_data(df):
    middle = int(median([x.shape[0] for x in df['emg']]))
    # `stack` turns a list of 2D arrays into a single 3D array
    return np.stack([pad_sequences(x.T, maxlen = middle).T for x in df['emg']])
>>> shape_ml_data(df).shape
(231, 118, 8)

And that’s that! 231 repeats of EMG data, each 118 values long with data from 8 channels. The EMG data is all ready to go, but we’ve got one last thing to do with our gesture classifications!

One-Hot Encoding Our Gestures

When tackling a categorization problem, a single, continuous output variable (like our ‘gesture’ column, which ranges from 0-7) can cause some issues. No ML algorithm can ever be 100% certain about its classification, so outputs are often expressed as probabilities on a sliding-scale. In this case then, you could interpret a 6.5 as some gesture that looks half like gesture 6 and half like gesture 7.

That works fine in the simplest case, but what if some gesture looks half like 3 and half like 7? Do you then output something in the middle like 5? How do you distinguish this from something that just looks like gesture 5 on it’s own?

We can resolve this ambiguity by using something called one-hot encoding. This encoding creates a new column for every value that the ‘gesture’ column holds – turning our single ‘gesture’ column into eight. Each of these columns then holds either a 1 or a 0, indicating which gesture the repeat belongs to. This can be thought of as a probability distribution where a 1 in the ‘gesture_4’ column indicates a given repeat unambiguously belongs to gesture 4.

Let’s one-hot encode our ‘gesture’ column using a purpose-built function from Pandas:

>>> # First we'll try a slicing with a step, so we can see things clearly
>>> df['gesture'][::30]
0      0
30     1
60     2
90     4
120    5
150    6
180    7
210    3
Name: gesture, dtype: object
>>> # Now let's try one-hot encoding (identical in this case to dummy encoding)
>>> pd.get_dummies(df['gesture'][::30])
     0  1  2  3  4  5  6  7
0    1  0  0  0  0  0  0  0
30   0  1  0  0  0  0  0  0
60   0  0  1  0  0  0  0  0
90   0  0  0  0  1  0  0  0
120  0  0  0  0  0  1  0  0
150  0  0  0  0  0  0  1  0
180  0  0  0  0  0  0  0  1
210  0  0  0  1  0  0  0  0
>>> # Finally, we can encode the whole column and get just the values
>>> pd.get_dummies(df['gesture']).values
array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

Perfect! We now have an unambiguous way to represent which gesture is which (even when the ML algorithm is a little unsure about things). Better yet, this one is a Numpy array just like our processed EMG data – ready to be fed directly into TensorFlow.

Finally, let’s wrap this into our shape_ml_data function and return both the EMG data and gesture data together in a tuple:

def shape_ml_data(df):
    middle = int(median([x.shape[0] for x in df['emg']]))
    data = np.stack([pad_sequences(x.T, maxlen = middle).T for x in df['emg']])
    labels = pd.get_dummies(df['gesture']).values
    # Returning things as a tuple allows us to destructure them easily later
    return (data, labels)

We can check that our shapes are in order by running the following:

>>> emg, gestures = shape_ml_data(df)
>>> emg.shape
(231, 118, 8)
>>> gestures.shape
(231, 8)

And there we are! All done! I’ll leave you with a couple of final notes then get out of here.

Parting Notes

Sometimes ML requires a lot of data preprocessing, but you can greatly reduce the amount that’s needed by changing the way your data is collected. As I mentioned before, recording all of our data in a single file with two additional columns to track which repeat and gesture each row of data belonged to would have vastly simplified the preprocessing process.

With that being said, you’re not always in control of data collection (if you’d just downloaded a dataset from the internet, for example), so these sort of processing skills are important to have.

That’s all for this session, but I’ll see you in the next session where we’ll be building some deep-learning networks with TensorFlow and exploring the differences between LSTMs and CNNs!