Dealing with medical codes in Data Science
Having medical data and working with it can be tough. There may be times in healthcare where we would like to classify patients based on free text data we have for them. Sometimes we would like to predict likely outcome based on free text clinical notes.Using free text requires methods NLP.
Here we start with one of the simplest techniques — ‘bag of words’.
In a ‘bag of words’ free text is reduced to a vector that represent the number of times a word is used in the text we are given. It is also possible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.
A classic ‘toy problem’ used to help teach or develop methods is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).
Here we will use almost 50,000 records to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model. We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.
In order to complete it we will need following steps:
1) Load data from internet
2) Clean data — remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).
3) Split data into training and test data sets
4) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.
5) Train a logistic regression model on the tr-idf transformed word vectors.
6) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model
Define Function to preprocess data
Our function will work on raw text strings, and:
1) changes to lower case
2) tokenizes (breaks down into words
3) removes punctuation and non-word text
4) finds word stems
5) removes stop words
6) rejoins meaningful stem words
And after having function ready we can apply it.
This is just initial part of the project and I will share more in coming articles. Please reach out to me with any comments/suggestions you may have (@kristinelpetrosyan@gmail.com).