Social media data could be key to tracking disease patterns

Social media has grown rapidly in recent decades, with some of its biggest players, including Facebook, founded in 2004, and Twitter, started in 2006. In just the past 12 months, the number of active social media users has grown by more than 400 million and Twitter alone reaches 211 million daily active users. All of this translates into amounts of data that can be used to gain population-level insights.

The COVID19 pandemic provided an opportunity for researchers such as the Optum Group to explore whether social media data could be associated with disease patterns and trends. Aa HIMSS22 presentation, Danita Kiser, VP of Optum, took us on a deep dive into one such project, where over 20 million posts on Twitter were reviewed

The organization posed the question, “How strongly is social media data correlated with actual COVID-19 cases, and will that signal remain stable during the pandemic?”

“We have collected a series of geolocated tweets… read the tweets and tag them [to categorize them]† Then, using those classified tweets, we built natural language processing models to…categorize untagged data,” Kiser said.

The team of researchers then ran the models on real-time data, and the categorized tweets were tracked over time.

“We spent quite a bit of time collecting and monitoring…before we could define trends,” Kiser said.

More than 15,000 hand-tagged tweets were placed into categories, some of which included “confirmed,” “showing symptoms,” “fixed,” and “hoax.” They also tagged whether the content of the tweet was near the location of the post. What the group discovered was interesting.

At the start of the pandemic, there was a very strong correlation between confirmed tweets and cases.

“Tweets correlated most strongly when we shifted tweets by seven to 10 days. … People were tweeting about cases before the number of cases started to rise, [and this was found to be] a leading indicator of COVID cases.” said Kiser. “This was important because there were no leading indicators at the time.”

Interestingly, however, in the latter part of the Delta wave, the tweet delay became shorter. In Pennsylvania, for example, this delay shifted from seven to two days, meaning the number of cases increased rapidly after tweets were posted.

The biggest challenge was countering a moving ‘ground truth’. The categories chosen ultimately correlated with this defined “truth,” but knowing the facts continued to evolve as people better understood the disease over time and navigated multiple COVID-19 variants.

Social media is a powerful tool for gaining insights on an individual and population level. Working with university partners and data scientists, the Optum Group found that, especially when COVID-19 cases increase, they could introduce Twitter signals as leading indicators to predict counts.

The hope is that such data analysis can be used for future pandemic preparedness and response. As UnitedHealth Group senior director Gina Debogovich stated, “There are a multitude of data sources that can help us predict the course of the disease more accurately, but digital surveillance could be one of our most effective offensive mechanisms. … We need social media closely so we can proactively identify the next major outbreak.”