Drug Activity Classification and Prediction
Drug Activity Classification and Prediction on Numeric Sequential Data
This is a secondary intro to data mining project that explores more classification techniques and feature extraction and vectorization techniques on numeric data. First for this project, I import all the python libraries used.
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
from math import isnan
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
After the python libraries are imported, I import the data from .txt sources and save them in a pandas dataframe.
file = io.open('train_data.txt', mode='r', encoding='utf-8')
text = file.read()
text = text.split('\n')
data = [line for line in text]
train_data_frame = pd.DataFrame(data, columns=['label'], dtype=pd.StringDtype())
train_data_frame[['label', 'activity']] = train_data_frame['label'].str.split('\t', n=1, expand=True)
samples = [tuple(np.fromstring(x, dtype=int, sep=' ')) for x in train_data_frame['activity']]
train_data_frame['activity'] = samples
train_data_frame
label | activity | |
---|---|---|
0 | 0 | (96, 183, 367, 379, 387, 1041, 1117, 1176, 132... |
1 | 0 | (31, 37, 137, 301, 394, 418, 514, 581, 671, 72... |
2 | 0 | (169, 394, 435, 603, 866, 1418, 1626, 1744, 17... |
3 | 0 | (72, 181, 231, 275, 310, 355, 369, 379, 400, 5... |
4 | 0 | (37, 379, 453, 503, 547, 611, 684, 716, 794, 8... |
... | ... | ... |
795 | 1 | (39, 120, 345, 400, 412, 558, 729, 1153, 1176,... |
796 | 0 | (43, 51, 280, 356, 378, 543, 557, 640, 666, 70... |
797 | 0 | (63, 232, 360, 405, 433, 447, 474, 751, 1069, ... |
798 | 0 | (83, 159, 290, 462, 505, 509, 531, 547, 737, 9... |
799 | 0 | (91, 432, 433, 509, 559, 578, 1082, 1153, 1220... |
800 rows × 2 columns
Now we can observe what are data is and explain what it represents. Each data point is a hypothetical amino chain of molecules represented as numbers, and the labels represent if the chain indicates the presence of a drug that the user is on. If the label is 0, there is no drug activity. If the label is 1, the amino chain indicates the presence of the drug. This project is about exploring feature extraction techniques that create a reliable classifier for this problem.
From this data, I can tell that there is an uneven distribution of negative to positive samples. There are far more amino chains that do not indicate drug activity. Therefore, another challenging aspect of this problem is the imbalanced data.
The first feature I want to observe is the length of each chain.
lengths = [len(x) for x in train_data_frame['activity']]
train_data_frame['length'] = lengths
train_data_frame
label | activity | length | |
---|---|---|---|
0 | 0 | (96, 183, 367, 379, 387, 1041, 1117, 1176, 132... | 732 |
1 | 0 | (31, 37, 137, 301, 394, 418, 514, 581, 671, 72... | 865 |
2 | 0 | (169, 394, 435, 603, 866, 1418, 1626, 1744, 17... | 760 |
3 | 0 | (72, 181, 231, 275, 310, 355, 369, 379, 400, 5... | 1299 |
4 | 0 | (37, 379, 453, 503, 547, 611, 684, 716, 794, 8... | 925 |
... | ... | ... | ... |
795 | 1 | (39, 120, 345, 400, 412, 558, 729, 1153, 1176,... | 770 |
796 | 0 | (43, 51, 280, 356, 378, 543, 557, 640, 666, 70... | 1835 |
797 | 0 | (63, 232, 360, 405, 433, 447, 474, 751, 1069, ... | 710 |
798 | 0 | (83, 159, 290, 462, 505, 509, 531, 547, 737, 9... | 864 |
799 | 0 | (91, 432, 433, 509, 559, 578, 1082, 1153, 1220... | 864 |
800 rows × 3 columns
The first attempts at feature extraction have to do with learning about the chain as a whole, by calculating values such as sum, average, minimum, and maximum.
sums = [sum(x) for x in train_data_frame['activity']]
train_data_frame['sum'] = sums
train_data_frame
label | activity | length | sum | |
---|---|---|---|---|
0 | 0 | (96, 183, 367, 379, 387, 1041, 1117, 1176, 132... | 732 | 37144141 |
1 | 0 | (31, 37, 137, 301, 394, 418, 514, 581, 671, 72... | 865 | 43919301 |
2 | 0 | (169, 394, 435, 603, 866, 1418, 1626, 1744, 17... | 760 | 38952818 |
3 | 0 | (72, 181, 231, 275, 310, 355, 369, 379, 400, 5... | 1299 | 62965542 |
4 | 0 | (37, 379, 453, 503, 547, 611, 684, 716, 794, 8... | 925 | 44947027 |
... | ... | ... | ... | ... |
795 | 1 | (39, 120, 345, 400, 412, 558, 729, 1153, 1176,... | 770 | 39040873 |
796 | 0 | (43, 51, 280, 356, 378, 543, 557, 640, 666, 70... | 1835 | 91516378 |
797 | 0 | (63, 232, 360, 405, 433, 447, 474, 751, 1069, ... | 710 | 35450063 |
798 | 0 | (83, 159, 290, 462, 505, 509, 531, 547, 737, 9... | 864 | 43486402 |
799 | 0 | (91, 432, 433, 509, 559, 578, 1082, 1153, 1220... | 864 | 44785955 |
800 rows × 4 columns
mins = [min(x) for x in train_data_frame['activity']]
train_data_frame['min'] = mins
maxs = [max(x) for x in train_data_frame['activity']]
train_data_frame['max'] = maxs
averages = [np.average(x) for x in train_data_frame['activity']]
train_data_frame['average'] = averages
train_data_frame
label | activity | length | sum | min | max | average | |
---|---|---|---|---|---|---|---|
0 | 0 | (96, 183, 367, 379, 387, 1041, 1117, 1176, 132... | 732 | 37144141 | 96 | 99875 | 50743.362022 |
1 | 0 | (31, 37, 137, 301, 394, 418, 514, 581, 671, 72... | 865 | 43919301 | 31 | 99932 | 50773.758382 |
2 | 0 | (169, 394, 435, 603, 866, 1418, 1626, 1744, 17... | 760 | 38952818 | 169 | 99875 | 51253.707895 |
3 | 0 | (72, 181, 231, 275, 310, 355, 369, 379, 400, 5... | 1299 | 62965542 | 72 | 99956 | 48472.318707 |
4 | 0 | (37, 379, 453, 503, 547, 611, 684, 716, 794, 8... | 925 | 44947027 | 37 | 99990 | 48591.380541 |
... | ... | ... | ... | ... | ... | ... | ... |
795 | 1 | (39, 120, 345, 400, 412, 558, 729, 1153, 1176,... | 770 | 39040873 | 39 | 99907 | 50702.432468 |
796 | 0 | (43, 51, 280, 356, 378, 543, 557, 640, 666, 70... | 1835 | 91516378 | 43 | 99991 | 49872.685559 |
797 | 0 | (63, 232, 360, 405, 433, 447, 474, 751, 1069, ... | 710 | 35450063 | 63 | 99955 | 49929.666197 |
798 | 0 | (83, 159, 290, 462, 505, 509, 531, 547, 737, 9... | 864 | 43486402 | 83 | 99829 | 50331.483796 |
799 | 0 | (91, 432, 433, 509, 559, 578, 1082, 1153, 1220... | 864 | 44785955 | 91 | 99945 | 51835.596065 |
800 rows × 7 columns
Now that some features have been calculated from the amino chain, next I try to visualize relationships between the positive and negative samples according to their features.
inactive_filter = train_data_frame['label'] == '0'
active_filter = train_data_frame['label'] == '1'
inactive_x = [x for x in train_data_frame['min'].where(inactive_filter)]
active_x = [x for x in train_data_frame['min'].where(active_filter)]
inactive_y = [y for y in train_data_frame['length'].where(inactive_filter)]
active_y = [y for y in train_data_frame['length'].where(active_filter)]
plt.plot(inactive_x, inactive_y, 'o')
plt.plot(active_x, active_y, 'o')
plt.xlabel('Min')
plt.ylabel('Length')
plt.show()
This plot does not reveal anything insightful about the features other than displaying our data imbalance.
Instead, I take a different approach to try and learn about specific numbers in the sequence. First I try to collect the most frequent numbers in active versus inactive samples.
active_samples = [x for x in train_data_frame['activity'].where(active_filter)]
inactive_samples = [x for x in train_data_frame['activity'].where(inactive_filter)]
def term_frequencies(rows):
dic = {}
for row in rows:
try:
isnan(row)
except:
for num in row:
if num in dic:
dic[num] = dic[num] + 1
else:
dic[num] = 1
return dic
active_frequencies = term_frequencies(active_samples)
inactive_frequencies = term_frequencies(inactive_samples)
sorted_active_freqs = sorted(active_frequencies, key=active_frequencies.get, reverse=True)
sorted_inactive_freqs = sorted(inactive_frequencies, key=inactive_frequencies.get, reverse=True)
most_common_active_terms = set()
most_common_inactive_terms = set()
print('Most frequent numbers in active samples')
for freq in sorted_active_freqs:
if (active_frequencies[freq] > 25):
print(freq, active_frequencies[freq])
most_common_active_terms.add(freq)
print('Most frequent numbers in inactive samples')
for freq in sorted_inactive_freqs:
if (inactive_frequencies[freq] > 25):
print(freq, inactive_frequencies[freq])
most_common_inactive_terms.add(freq)
Most frequent numbers in active samples
412 46
81610 39
2526 38
25762 38
80131 35
92539 35
75364 33
44380 30
45474 29
50184 29
33876 28
28052 28
50522 26
Most frequent numbers in inactive samples
36005 187
9015 157
1176 149
14216 147
75393 142
32199 141
31192 134
54021 130
9197 128
55283 127
...
46635 26
83031 26
23619 26
91547 26
Using this new approach, I analyzed the frequent terms in amino chains for negative and positive samples. The results appear promising but in order to double check, I make sure that there is no overlap between terms in the negative versus positive samples.
shared_terms = most_common_active_terms.intersection(most_common_inactive_terms)
print(shared_terms)
set()
This is a very promising result because it means that there is no overlap between these frequent terms. My next approach is to use flags that indicate whether one of these most active terms is present, and to use that for my features.
train_data_frame['inactive_flags'] = train_data_frame['activity'].apply(lambda x : [1 if y in most_common_inactive_terms else 0 for y in x])
train_data_frame['active_flags'] = train_data_frame['activity'].apply(lambda x : [1 if y in most_common_active_terms else 0 for y in x])
train_data_frame
label | activity | length | sum | min | max | average | inactive_flags | active_flags | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | (96, 183, 367, 379, 387, 1041, 1117, 1176, 132... | 732 | 37144141 | 96 | 99875 | 50743.362022 | [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
1 | 0 | (31, 37, 137, 301, 394, 418, 514, 581, 671, 72... | 865 | 43919301 | 31 | 99932 | 50773.758382 | [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
2 | 0 | (169, 394, 435, 603, 866, 1418, 1626, 1744, 17... | 760 | 38952818 | 169 | 99875 | 51253.707895 | [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
3 | 0 | (72, 181, 231, 275, 310, 355, 369, 379, 400, 5... | 1299 | 62965542 | 72 | 99956 | 48472.318707 | [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
4 | 0 | (37, 379, 453, 503, 547, 611, 684, 716, 794, 8... | 925 | 44947027 | 37 | 99990 | 48591.380541 | [1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 1 | (39, 120, 345, 400, 412, 558, 729, 1153, 1176,... | 770 | 39040873 | 39 | 99907 | 50702.432468 | [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ... | [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
796 | 0 | (43, 51, 280, 356, 378, 543, 557, 640, 666, 70... | 1835 | 91516378 | 43 | 99991 | 49872.685559 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
797 | 0 | (63, 232, 360, 405, 433, 447, 474, 751, 1069, ... | 710 | 35450063 | 63 | 99955 | 49929.666197 | [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
798 | 0 | (83, 159, 290, 462, 505, 509, 531, 547, 737, 9... | 864 | 43486402 | 83 | 99829 | 50331.483796 | [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
799 | 0 | (91, 432, 433, 509, 559, 578, 1082, 1153, 1220... | 864 | 44785955 | 91 | 99945 | 51835.596065 | [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
800 rows × 9 columns
Now with the flag arrays that represent indicators of unique terms, I use that as my primary feature set to vectorize and process with standard scaler and pca.
analyzer = PCA(n_components=24)
train_data_frame['activity'] = train_data_frame['activity'].apply(np.array)
train_data_frame['activity'] = train_data_frame['activity'].apply(lambda x : [int(i) for i in x])
longest_array = max(train_data_frame.activity, key=len)
train_data_frame['activity'] = train_data_frame['activity'].apply(lambda x : np.pad(x, (0,len(longest_array)-len(x))))
scaler = StandardScaler()
scale_data = scaler.fit_transform(train_data_frame['activity'].tolist())
scale_data
pca_features = analyzer.fit_transform(scale_data)
pca_features
array([[-10.36465598, 18.79967525, 11.73937013, ..., 0.03207514,
-0.08383894, 0.39913604],
[ -4.67671729, 4.21623949, -0.64990726, ..., 0.41990727,
-2.77600859, 1.1100237 ],
[-10.04249496, 17.86268763, 10.78045247, ..., 0.08256247,
-2.14138044, 0.39456551],
...,
[-10.67743532, 19.67628736, 12.57411854, ..., -0.06186581,
4.21956071, -1.47219242],
[ -3.86758125, 2.36106499, -1.94348142, ..., 0.34873355,
-2.77256988, 1.15395057],
[ -6.23909079, 7.8045524 , 1.87718687, ..., 0.35595152,
-2.40324284, 1.0816283 ]])
Now the features have been vectorized. The next thing to do is to reassociate labels with these new vectors.
labels_list = train_data_frame['label'].tolist()
pca_features_and_labels = [(labels_list[x], pca_features[x]) for x in range(len(pca_features))]
print(pca_features_and_labels)
[('0', array([-10.36465598, 18.79967525, 11.73937013, 1.81933651,
7.27358896, -5.80877878, -4.31990938, 2.73344537,
-3.02004001, -1.59814027, -0.27481898, -0.0948921 ,
0.69592384, 0.8420426 , 2.43257244, 3.90940913,
3.10264324, 1.27839418, -1.08279989, 0.94475757,
-0.5470349 , 0.03207514, -0.08383894, 0.39913604])), ('0', array([-4.67671729, 4.21623949, -0.64990726, -1.00412236, -4.85204401,
8.60841715, 7.81671304, -5.06293038, 4.11716955, -1.93997368,
-1.37986821, 2.3723817 , -7.1080325 , 3.53054804, -1.85543256,
-0.19286261, 4.66653681, 2.41240371, -2.18328938, -1.63377641,
-0.5722072 , 0.41990727, -2.77600859, 1.1100237 ])), ('0', array([-10.04249496, 17.86268763, 10.78045247, 1.527854 ,
6.02937338, -3.65470238, -1.8321143 , 0.86734826,
-0.25574267, 0.79712192, 0.4920303 , 0.31215812,
0.7108608 , 7.9150896 , 6.99519093, 2.92097218,
-0.48475597, -1.24301066, 0.98831884, -0.45757902,
2.17370783, 0.08256247, -2.14138044, 0.39456551])), ('0', array([ 12.5628318 , -44.17725869, -42.11343878, -7.17389992,
-26.42842946, -8.94056735, -15.58471053, 8.28463734,
4.44573853, 13.06904922, 2.12784488, 2.67605414,
-9.64159435, -2.76619091, -4.6875101 , 7.96764679,
-5.68832149, -2.20872894, 0.88256378, 4.35534194,
-6.18535239, 0.95892322, -7.68800049, 0.42093145])), ('0', array([ -0.81098155, -5.6262406 , -8.84508712, -2.7074571 ,
-11.89155445, 13.66556523, 8.93018706, -4.65655744,
-0.17734991, -7.70399762, -2.87586543, 1.71759709,
-4.89703584, -1.66939535, 0.18598824, 6.10895544,
0.59341955, -1.86562714, 2.43520771, -2.94574134,
3.93265687, -0.209659 , 1.34401507, -2.43006705])), ('1', array([ -1.58861769, -3.77503755, -7.42700029, -2.44561844,
...
7.21469499, 6.67026091, -4.3179458 , 2.64865874, -3.86834647,
-1.94259869, 3.37725032, -9.10113455, 9.1813779 , 1.39846791,
0.53853341, 5.11972101, 2.34457677, -2.46843791, -1.47527444,
-1.17093341, 0.35595152, -2.40324284, 1.0816283 ]))]
Finally, we put it all into a new pandas dataframe.
post_proc_data = pd.DataFrame(data=pca_features_and_labels, columns=['label', 'features'])
post_proc_data
label | features | |
---|---|---|
0 | 0 | [-10.364655978244212, 18.799675253550777, 11.7... |
1 | 0 | [-4.676717288247908, 4.216239487999718, -0.649... |
2 | 0 | [-10.042494961017002, 17.86268763325594, 10.78... |
3 | 0 | [12.562831802181654, -44.177258687291676, -42.... |
4 | 0 | [-0.8109815485241729, -5.62624059871616, -8.84... |
... | ... | ... |
795 | 1 | [-8.12381395241008, 13.32973675271126, 7.43386... |
796 | 0 | [33.50260621816489, -97.28450642715435, -51.81... |
797 | 0 | [-10.677435318611677, 19.67628735657472, 12.57... |
798 | 0 | [-3.86758125123529, 2.3610649940127746, -1.943... |
799 | 0 | [-6.239090794032579, 7.804552395906523, 1.8771... |
800 rows × 2 columns
In order to handle the data imbalance, we can use either undersampling techniques which will limit the number of samples in the majority class, or use SMOTE to create synthetic samples of the minority class. Here is how you could do either approach.
from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler()
data_resampled, labels_resampled = under_sampler.fit_resample(post_proc_data['features'].tolist(), post_proc_data['label'].tolist())
training_data, testing_data, training_labels, testing_labels = train_test_split(data_resampled, labels_resampled, test_size=0.25, shuffle=True)
print(testing_labels)
from imblearn.over_sampling import SMOTE
smote_sampler = SMOTE()
data_resampled, labels_resampled = smote_sampler.fit_resample(post_proc_data['features'].tolist(), post_proc_data['label'].tolist())
training_data, testing_data, training_labels, testing_labels = train_test_split(data_resampled, labels_resampled, test_size=0.25, shuffle=True)
print(testing_labels)
Now that the data has been balanced, I perform the train test split of the samples.
training_data, testing_data, training_labels, testing_labels = train_test_split(post_proc_data['features'].tolist(), post_proc_data['label'].tolist(), test_size=0.25, shuffle=True)
print(testing_labels)
Now that the data is prepared, I try out a variety of classifiers on the data.
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
dt_classifier = DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=55, min_samples_split=5, min_samples_leaf=5, max_features=10, class_weight='balanced')
nb_classifier = GaussianNB()
dt_classifier.fit(training_data, training_labels)
predicted_labels = dt_classifier.predict(testing_data)
predicted_accuracy = accuracy_score(testing_labels, predicted_labels)
print(predicted_accuracy)
0.755
This means the first classifier that was tested was a decision tree classifier with a accuracy score of 75.5% on the test set.
nb_classifier.fit(training_data, training_labels)
predicted_labels = nb_classifier.predict(testing_data)
predicted_accuracy = accuracy_score(testing_labels, predicted_labels)
print(predicted_accuracy)
0.905
The naive bayes classfier gives an accuracy of 90.5% on the test set.
from sklearn.svm import SVC
svc_classifier = SVC(kernel='linear')
svc_classifier.fit(training_data, training_labels)
predicted_labels = svc_classifier.predict(testing_data)
predicted_accuracy = accuracy_score(testing_labels, predicted_labels)
print(predicted_accuracy)
0.905
This next classifier is a support vector machine classifier, which also gives 90.5% accuracy.
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(training_data, training_labels)
predicted_labels = rf_classifier.predict(testing_data)
predicted_accuracy = accuracy_score(testing_labels, predicted_labels)
print(predicted_accuracy)
0.9
The random forest classifier is an ensemble of decision trees, and by using an ensemble, the accuracy increases to 90% compared to standalone decision tree. While this shows that ensembling is an effective technique for improving decision trees, it still isn’t quite as good as naive bayes or svm. This shows that the patterns in the data are more statistical rather than rule based.This is because naive bayes and svm classfiers are better at picking up on statistical patterns compared to decision trees. Specifically sequential data is more likely to be a data type that carries these patterns within the sequence rather than as standalone sets of features.
To wrap up this project, this project shows the importance of data exploration, because inital efforts may not always reveal key insights about the data. This project also shows the importance of data balancing, and of exploring multiple different classifier machines to determine the best one for the data.