Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# In this notebook we present a solution to the [Personalized Medicine: Redefining Cancer Treatment](https://www.kaggle.com/c/msk-redefining-cancer-treatment) competition on Kaggle\n",
"\n",
"## ** The problem: ** \n",
"\n",
" #### <ol> We are provided with \"training\" data which contains many genes, variants, and so called classes, and along with this information we are given a text file, which usually appears to be a scientific paper. We are then tasked with building a classifier which can determine the class of a given gene, mutation and corresponding text file. The trouble is that we are not told what the classes mean so we have to determine a classifier which can determine a likelihood of what the classes mean by looking at the provided data associated with each class. We are also given \"test\" data which consists of a gene, variation, and text. The test data provided is quite different from the training data. \n",
" \n",
"\n",
"## **Our Approach:**\n",
"\n",
" #### <ol> We approach the problem using a random forest classifier. For a discussion on how this classifier works please see [classifier summary](https://github.uconn.edu/cow17005/Personalized-Medicine-Redefining-Cancer-Treatment-Challenge/blob/master/classifier.pdf)\n",
" \n",
"## **Results:**\n",
"\n",
" #### <ol> After submitting our solution to Kaggle (which scored the submission based on a subset of the data) we received a log-loss score of 5.085, which is only slightly worse than we received on our \"test\" data, where we knew the answer beforehand. The log-loss is calculated by taking the average of the negative of the sum of the log values of the predicted probability in the correct class. That means that predicting a high probability for the correct class will add only a small amount to the log-loss whereas predicting a low probability in the correct class will add a larger amount to the log-loss. For example if in row 5 (out of 968) one were to predict the probability of class 3 to be 0.2 then -log(0.2) = 1.609 would be added to log-loss, and 1 would be added to the denominator, so predicting a low probability in the correct class will add a bigger impact to log-loss. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import sklearn"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenization took: 2.14 ms\n",
"Type conversion took: 21.38 ms\n",
"Parser memory cleanup took: 4.68 ms\n"
]
}
],
"source": [
"variants_train=pd.read_csv('/Users/cory/Desktop/Kaggle/msk/training_variants.txt',index_col=0,verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenization took: 0.12 ms\n",
"Type conversion took: 0.65 ms\n",
"Parser memory cleanup took: 0.00 ms\n"
]
}
],
"source": [
"variants_test= pd.read_csv('/Users/cory/Desktop/Kaggle/all/stage2_test_variants.csv',index_col=0,verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Filled 5 NA values in column ID,Text\n"
]
}
],
"source": [
"text_train = pd.read_table('/Users/cory/Desktop/Kaggle/msk/training_text.txt',sep='\\|\\|',engine='python',verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"text_train.columns=['Text']\n",
"text_test= pd.read_table('/Users/cory/Desktop/Kaggle/all/stage2_test_text.csv',sep='\\|\\|',engine='python',verbose=True)\n",
"text_test.index.names=['Id']\n",
"text_test.columns = ['Text']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"text_test = text_test.replace(np.nan,'',regex=True)\n",
"text_train = text_train.replace(np.nan,'',regex=True)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"Train=variants_train.join(text_train)\n",
"Test=variants_test.join(text_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We take a glance at the training data and observe that there are 3321 genes in the data but only 264 of them are unique. In comparison, of the 3321 variations 2996 of them are unique. We also observe that of the 3321 text articles only 1921 are unique. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Class</th>\n",
" <th>Text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3321</td>\n",
" <td>3321</td>\n",
" <td>3321.000000</td>\n",
" <td>3321</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>264</td>\n",
" <td>2996</td>\n",
" <td>NaN</td>\n",
" <td>1921</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>BRCA1</td>\n",
" <td>Truncating Mutations</td>\n",
" <td>NaN</td>\n",
" <td>The PTEN (phosphatase and tensin homolog) phos...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>264</td>\n",
" <td>93</td>\n",
" <td>NaN</td>\n",
" <td>53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.365854</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2.309781</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>7.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>9.000000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gene Variation Class \\\n",
"count 3321 3321 3321.000000 \n",
"unique 264 2996 NaN \n",
"top BRCA1 Truncating Mutations NaN \n",
"freq 264 93 NaN \n",
"mean NaN NaN 4.365854 \n",
"std NaN NaN 2.309781 \n",
"min NaN NaN 1.000000 \n",
"25% NaN NaN 2.000000 \n",
"50% NaN NaN 4.000000 \n",
"75% NaN NaN 7.000000 \n",
"max NaN NaN 9.000000 \n",
"\n",
" Text \n",
"count 3321 \n",
"unique 1921 \n",
"top The PTEN (phosphatase and tensin homolog) phos... \n",
"freq 53 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Train.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To see how the classes are distributed we make a bar chart and observe that class 7 appears the most frequently while class 8 appears the least frequently. Further we see that classes 1, 2, and 4 also appear relatively frequently and classes 3, 5, 6, and 9 appear much less frequently. This could play a large role in our classifier predicting the classes which appear frequently more often."
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"<BarContainer object of 9 artists>"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt; plt.rcdefaults()\n",
"\n",
"plt.bar([1,2,3,4,5,6,7,8,9], c.values(), color=[\"blue\", \"red\", \"green\", \"purple\", \"magenta\", \"orange\", \"navy\", \"yellow\", \"grey\"], tick_label=[1,2,3,4,5,6,7,8,9])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We see with the test data that there are 986 genes, with 279 unique. 945 unique variations, and 874 unique text articles. This marks a difference from the training data set where about 7.9% of the genes were unique and only slightly more than half of the text articles were unique."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>986</td>\n",
" <td>986</td>\n",
" <td>986</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>279</td>\n",
" <td>945</td>\n",
" <td>874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>TP53</td>\n",
" <td>Truncating Mutations</td>\n",
" <td>Among the best-studied therapeutic targets in ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>40</td>\n",
" <td>18</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gene Variation \\\n",
"count 986 986 \n",
"unique 279 945 \n",
"top TP53 Truncating Mutations \n",
"freq 40 18 \n",
"\n",
" Text \n",
"count 986 \n",
"unique 874 \n",
"top Among the best-studied therapeutic targets in ... \n",
"freq 24 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Test.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We now import most of the tools we use to build the classifier. Recall that we use a random forest classifier which is a part of pythons scikit-learn package."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"import sklearn.ensemble\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.datasets import make_classification\n",
"\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"from sklearn import metrics\n",
"\n",
"np.set_printoptions(threshold=np.nan)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"\n",
"text_clf = Pipeline([('vect', CountVectorizer()),\n",
" ('tfidf', TfidfTransformer()),\n",
" ('clf', RandomForestClassifier()),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### We take a random 70% of the training data and use a grid search to find the best parameters to create our classifier. By taking a random 70% of the training data to build our classifier on we can use the remaining 30% of the data to test our classifier on to determine how good it is before we classify the actual test data. If we find that the classifier is not as good as we'd like it to be then we can fine-tune it and repeat."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test = train_test_split(Train, train_size=.7)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.grid_search import GridSearchCV\n",
"\n",
"parameters = {'vect__ngram_range': [(1, 1), (1, 2)],\n",
" 'tfidf__use_idf': (True, False),\n",
" 'clf__max_depth': (None, 11,12)\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"gs_clf = gs_clf.fit(X_train['Text'], X_train['Class'])"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 81, 4, 2, 29, 7, 12, 26, 0, 0],\n",
" [ 10, 59, 0, 7, 4, 0, 60, 0, 0],\n",
" [ 2, 0, 6, 9, 2, 0, 12, 0, 0],\n",
" [ 34, 4, 1, 147, 5, 3, 18, 0, 0],\n",
" [ 19, 1, 3, 8, 28, 4, 12, 0, 0],\n",
" [ 9, 4, 0, 4, 5, 54, 10, 0, 0],\n",
" [ 6, 23, 2, 11, 5, 1, 230, 0, 0],\n",
" [ 1, 1, 0, 1, 0, 0, 0, 1, 1],\n",
" [ 1, 0, 0, 2, 0, 0, 3, 0, 3]])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predicted = gs_clf.predict(X_test['Text'])\n",
"\n",
"metrics.confusion_matrix(X_test['Class'], predicted)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To interpret the confusion matrix we note that each row corresponds to a class. The rows are the actual classes and the columns are the predicted classes. That means that in row 3 column 7 we observe that there were 12 items in class 3 that the classifier predicted to be in class 7. We also have that every number along the diagonal means that the classifier guessed the class correctly. We see that classes 1, 2, 4, and 7 are predicted frequently compared to classes 3, 5, 6, 8, and 9. This should make sense since we saw that these are the classes that appear the most in the data. Further note that class 8 was only predicted once and it turned out to be correct. \n",
"\n",
"\n",
"\n",
"### As discussed in the introduction, Kaggle scores the predictions based on log-loss. We can calculate the log-loss on our \"test\" sample which we pulled from 30% of the training data. As can be seen below we have a log-loss of 4.065."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4.065640641565067"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metrics.log_loss(X_test['Class'], gs_clf.predict_proba(X_test['Text']))"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"source": [
"Answer = gs_clf.predict_proba(Test['Text'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### This was submitted on Kaggle to find that the log-loss on some subset of the data was calculated to be 5.085, which is only slightly higher than the log-loss we calculated on the known data. \n",
"\n",
"\n",
"# Conclusion\n",
"\n",
"#### <ol> By using a random forest classifier and a grid search consisting of 3 parameters with no more than 3 subparameters to choose from each we predicted the probability of classes of some subset of 986 genes / variations based on associated text data with a log-loss of 5.085. This is only slightly higher than our log-loss score on the testing subset of the training data. This of course is not a great log-loss score but considering the different nature of the actual test data it is good to see that it scores similar to the data where we did know the class. Read below for what we could have done to fine tune our algorithm if we were to work more on this problem.\n",
" \n",
"# **Alternative Approaches:**\n",
"\n",
" #### <ol> After reading more discussion of the problem it appeared as if the actual test data was different enough from the training data so as to make the problem not necessarily the most well-posed. For that reason we did not put into practice the following approaches. If we were to continue working on this problem some approaches that would be interesting to consider would be to look at several different classifiers such as a support vector machine, which is also very popular for text classification. We could also have fine tuned our classifier with a stronger gridsearch, which is computationally expensive. Further we could have taken stronger consideration of the gene / variation into our classifier. For example we could have cut each text document to contain only 100 words to the left and right of any given mention of the gene / variation in consideration for that article. This may have made the computations less expensive and cut out some less useful information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}