Using text analytics to learn about disease
Medical research and pharmacology is difficult; drawing conclusions often depends on various expressions of modality. People experience disease and treatment differently. To account for this difference in human experience, pharmacologists and medical researchers rely on text analytics software, like Semantria. This software structures and analyzes natural language text data relevant to a given subject. This text data might take the form of research papers, news media articles, social media posts, patient survey responses, and more. The resulting analysis allows the researcher to draw insights from massive data sets, informing on the patients’ experience with disease as well as their response to treatment.
Decoding the public perception of ADHD and its treatment
Take ADHD as a case study to illustrate the effectiveness of text analytics. The goal is to perform text analytics on as complete a data set as possible. Information from thousands of pages from blogs, news websites, and medical research paper databases can be gathered relatively fast. Gathering the data, however, is the easy part. The hard part is scouring through everything that has been gathered and interpreting the massive amount of text in a meaningful way. Without text analytics, compiling an even quasi-complete picture about public perception of a particular disorder will take years.
What is Semantria?
Semantria leverages text analytics and sentiment analysis to determine what people are talking about and how they feel about it. It extracts, summarizes and analyzes key information across thousands of data points. These data points can include blog posts, news articles and scientific research papers. Through its natural language processing (NLP) elements, Semantria uncovers consensus and discord in the dialogue surrounding a topic.
How the analysis works
We examined the differences within the human experience of Attention Deficit Hyperactivity Disorder (ADHD). Lexalytics, an InMoment company, scraped data from Reddit, multiple ADHD blogs, news websites, and scientific papers sourced from the PubMed and HubMed databases. Once the data was collected, we executed an analysis on the text data set using the Semantria natural language processing API with queries built on top of the Healthcare/Pharma industry pack.
The results modeled the conversation surrounding ADHD by linking different topics of discussions together across these varying channels. The four primary channels were medical research papers, news media articles, Reddit and blogs. It allowed us to draw key insights about how different channels view ADHD. While there were expected overlaps, there were also topics unique to each of the channels.
Topics Unique to Each Channel
Scientific papers
The conversation surrounding prevention of the disorder was unique to the scientific papers. Emphasis regarding prevention centered on chemicals present in the environment at the time of pregnancy and nutrient/mineral deficiencies in the mother.
News media
In the news media, ways of treating the disorder were in the spotlight. Drug prescriptions and other methods of decreasing symptoms, such as eating fewer processed foods, were the focus. Along with treatments, the media mostly discussed topics already generally known, like the danger of drinking alcohol during pregnancy. On average, the media drew broad conclusions of causality.
Reddit and blogs
Reddit and blog sources tended to veer from the discussion of the scientific papers and media, instead focusing on personal experiences. The key difference between the Reddit posts and blogs is that the blogs tended to be more structured and formal.
A topic unique to Reddit is “rejection dysphoria”, which is an extreme emotional sensitivity and pain triggered by the perception — but not necessarily the reality — that the person has been rejected, teased or criticized by important people in their life.
Topics most frequently discussed
Amongst the research papers, Semantria determined that the top five topics discussed are anxiety, twitching/teeth grinding, forgetfulness, medication and environmental factors. These topics and their subtopics map the emphasis of the research papers around the side effects and prevention of ADHD. The subtopics ranged from maternal iron deficiency, school, genetic factors, anxiety, and sleep deprivation.
The topics in the news media centered around medication, anxiety, forgetfulness, sleep problems and smoking. Here we see that the high frequencies of the topics anxiety, medication and forgetfulness are similar to those found in the medical research articles. Despite this overlap, the news media looks at the story from the other direction. Instead of preventative solutions, they focused on medicating the symptoms through sleep medications, mood medications, diet and exercise.
The focus on Reddit was personal experience. The most discussed topics are forgetfulness, anxiety, being late, journal/diary compilation, and distractions identified by the authors of the posts. To wit, there is a constant flow of information about the symptoms of ADHD and the side effects of treatment, including subtopics of anxiety, hyper focus, sleep problems, distractions and procrastination. Non-pharmaceutical treatments are also explored to a higher degree on Reddit than in research papers and news media, with emphasis on journaling and yoga.
ADHD bloggers centered the conversation on forgetfulness, anxiety, journal/diary compilation, distractions and medication. Here we see a lot of overlap with the main topics from Reddit, with the key difference being that “medication” replaces “being late” as a primary topic for bloggers.
How diseases are perceived across channels
As shown, Semantria identifies the gaps and overlays in the conversation surrounding a topic, but it does so with an alacrity that a human researcher cannot match. A human being simply can’t read thousands upon thousands of documents and hope to draw a conclusion in a reasonable amount of time. Through text analytics, however, we find signal within a complex medical data set with relative ease. This is particularly useful for medical research and pharmacology when outlining the discussion surrounding a particular disease or disorder. Once it is seen where the overlay and gaps are in the conversation, the divide can be bridged. Using text analytics like Semantria to see how a disease or disorder is perceived across different channels allows for more effective communication. Consequently, medical professionals may more impactfully connect with people struggling with treatment.