Deep learning for pollen allergy surveillance from twitter in Australia

Australian researchers looked into the possibilities of using tweets to map hayfevercomplains using deep learning technology.


Deep learning for pollen allergy surveillance from twitter in Australia

A few questions about this project to the authors of the article from the Institute for Sustainable Industries & Liveable Cities, Victoria University, Melbourne.


1. What was the goal of the research?

The research investigated the potential of social media data to complement the limited survey-based approaches in the context of Hay Fever surveillance. The assumption was that people share health-related information online, which can be relevant for healthcare professionals, policy makers and allergy sufferers. Still, a major challenge to the effective knowledge extraction from highly unstructured user-generated content online lied in the relevant content identification. Social media data abounds in misspellings, abbreviations and a wide range of creative expressions referring to medical concepts (e.g. hay fever sob = watery eyes). State-of-the-art deep learning techniques have been applied to automatically classify social media posts into Informative (i.e. symptoms, treatments) and Non-Informative (i.e. news, ads, warnings, ambiguous) categories. Health surveillance from social media has already proven successful in early outbreak detection of infectious diseases such as Influenza in the number of previous studies. Yet the topic of allergies is still underexplored.


2. Which social media channels where used, and which not?

Twitter was selected as a primary source for data collection due to its widespread use in research works, public character of the tweets and availability of APIs that make data easy to extract.

Twitter also recorded 5.3 mln active users in December 2019 in Australia, where the study was conducted (//www.socialmedianews.com.au). The short text format of Twitter messages proves an ideal source for real-time updates, frequently concerning health-related matters. The other popular social media platforms for potential consideration were Facebook, YouTube, Instagram and Snapchat. Still, there are privacy restrictions regarding large-scale data scraping from Facebook, while the latter 3 platforms are intended for visual content sharing such as image and video.


3. Did you find a correlation between posts and well chaired press releases related to hayfever?

The press releases about the Hay Fever were not included in the scope of the study, but its definitely an idea worth consideration for the future work. It would be interesting to see whether there is any correlation between the media coverage about Hay Fever and its actual incidence. Yet, the goal of the current study was to extract the Pollen Allergy self-reports to estimate its prevalence within population, thus any news, warnings, and commercials were considered as non-relevant and excluded from further analysis.


4. Which area did you cover?

The area covered so far was Australia. Similar types of studies have already been conducted in UK and US. For potential correlation with weather variables the tweets collected from 3 major cities on Australian east coast (Melbourne, Sydney, Brisbane) were collected, and respective meteorological information from nearby weather stations was used.


5 How many posts did you use? Was that enough?

Overall approx. 4,000 tweets that met the pre-specified criteria were collected. Given the relatively small size of the Australian population (in comparison with US or UK), the number of tweets was sufficient and results proved informative (e.g. peak in tweeting about Hay Fever around pollen season). Also, each survey-based method includes only a fraction of population, and social media-based health surveillance aims to complement rather than supplement traditional approaches to Hay Fever prevalence evaluation. Additionally, the finer-grained details extracted can aid knowledge discovery about the condition.


6. What I find an interesting angle is how people express their state of hayfever, like ’hay fever sucks’ or ’my head is killing me’. Is there some kind of top-10 of expressions or metaphores people use online? or realy funny or original ones?

Given the informal social media setting, the descriptions of symptoms, treatments and overall feelings were often found figurative and creative, far from official medical concepts. That is why the Deep Learning approach was adopted, to automatically identify that a tweet e.g. ‘I’m not crying, it’s hay fever attacking me’ refers to Hay Fever symptom ‘watery eyes’ despite the lack of its explicit mention. Overall, the popularity of informal expressions such as ‘sniffling’, ‘sneezy’, ‘snot’ has been found when referring to blocked nose symptom, and `crying’, `tear’, `sob’ when referring to watery eyes symptom. The findings highlight the importance of appropriate techniques development in the efficient health surveillance from social media as pre-defined list of medical terms proves limited when exposed to a wide variety of expressions posted by Twitter users.


What is shown in table 6 in the paper?

The main purpose of this table is to demonstrate the internal workings of deep learning, as it is often criticised as ‘black box’ approach. Here we can see that term ‘antihistamines’ is closely associated with a wide range of commercial brands of that drug class, which is correct. This means that if we input this vector of associated terms into the model, it will be able to classify the tweet including terms such ‘zyrtec’ or ‘telfast’ as hay fever treatment, despite no prior list of medicines specified a priori (as it would be labour-intensive or even infeasible). Similarly, the word ‘eyes’ is closely associated with e.g. ‘staring’ and ‘tears’, thus any variations in description of watery eyes in tweets (for instance ‘tears streaming down my face #hayfever’), would be correctly classified as hay fever symptom.

This shows deep learning robustness and flexibility, as the vectors of associated terms are trained fully automatically and are ready for use. Weights from the table indicate the similarity (based on distance in the pre-trained vector space) of each word to the keyword selected. Closely related words occur in similar contexts.

For example:

Pollen allergy, also known as hay fever, is a respiratory condition…

Allergic rhinitis, also referred to as hay fever, is a chronic respiratory condition…

… Hence, Pollen allergy and Allergic rhinitis must be associated as shown by high score (in this case synonymity relation).



7. What are the most used emoticons by hayfeverpatients?

Special characters (including emoticons) are frequently removed as part of the pre-processing step to reduce the level of noise in dataset and improve the classification performance. Yet, the large proportion of swear words as well as ‘crap’ and ‘shit’ was found among the tweets where users referred to their state of wellbeing (in particular during unfavourable weather conditions). More negative sentiment of Informative versus Non-Informative (from the perspective of the goal of the study) class of tweets was particularly useful differentiative characteristic during automatic content classification. Below the top-4 of most used hayfever-emoji’s:

1. Sneezing Face

 2. Loudly Crying Face

😭

 3. Weary Face

😩

 4. Face With Medical Mask

😷

References: Du, J., Michalska, S., Subramani, S., Wang, H. and Zhang, Y., 2019. Neural attention with character embeddings for hay fever detection from twitter. Health information science and systems, 7(1), p.21. //doi.org/10.1007/s13755-019-0084-2


8. Did you come across any funny snapchat filters?

Snapchat was not included in the research, which main goal was the textual content classification into relevant Hay Fever self-reports and non-relevant content (e.g. news, ads), with primary focus on the jargonic expressions used by Twitter users.


9. Is the machine learning aspect of this research easy to explain? Was it trained by this research?

Machine Learning approach is definitely less intuitive than the traditional query-based data extraction. Currently, there are numerous open-source packages that allow to pre-process the data, train the model and optimise its hyperparameters in relatively simple procedure. The most ‘labour-intensive’ part of the process was data extraction and its annotation by the domain expert, to subsequently train the model (supervised Machine Learning). The study used 4 Deep Learning architectures, following the steps above, and performed the extensive evaluation of their performance in terms of classification accuracy. Additionally, the state-of-the-art in natural language processing ‘word embeddings’ have been incorporated into the model to improve text representation (therefore classification performance). These are already available in pre-trained format, ready to input into the model (//nlp.stanford.edu/projects/glove/). Our research also evaluated the potential accuracy improvement by developing own embeddings, specific to the domain i.e. Hay Fever. This step is more time-consuming and not required for commercial purposes since the default performance was already satisfactory (88% accuracy).


10. Is Twitter suited for this research as it might show more extreme emotions and not the general feeling?

It is likely that users express more extreme feelings regarding their condition on Twitter. At the same time, lack of Hay Fever mentions on Twitter is also indicative of its potential low occurrence or mild severity during specific time period. On the other hand, strong emotions expressed in tweets can signify the exacerbation of the problem in certain areas, and can be used to instantly identify e.g. Pollen Allergy Hot Zones. The geo-location feature of tweets also highlights the advantage of Twitter for real-time health surveillance purposes.

Overall, social media data is particularly useful in emergency situations, e.g. disaster response, outbreak detection, where timely response is of highest importance. Thus, its usage is mostly considered as complementary source of information to traditional statistics, which prove limited especially in critical circumstances. After all, social media is the largest and the most dynamic data set about human behaviour and real-world events (//bigdata-madesimple.com/top-5-social-media-scraping-tools-in-the-market/).


11. What weather aspect (e.g. temperature, humidity, windspeed, has the strongest correlation with amount of tweets?

It depends on the city. Also, not all of the weather parameters proved statistically significant (p>0.05). Out of the significant ones, the moderate positive correlations between the volume of hay fever relevant tweets were found for Evaporation (mm) and Sunshine (hrs), while moderate negative correlation for Humidity (%). Based on that, overall dry conditions showed to co-occur with higher number of hay fever mentions on Twitter, which is intuitive. The results are for Melbourne, which is considered the world’s allergy capital.


Below are the results from the study Text Mining and Real-Time Analytics of Twitter Data: A Case Study of Australian Hay Fever Prediction (//link.springer.com/chapter/10.1007/978-3-030-01078-2_12).

> (opens in a new tab)”>See also graph 4 in paper >>


12. Did you find differences between cities?

There were slight differences between the cities. Still, the sample size must be considered while making comparisons (i.e. Melbourne ~2k, Sydney ~1k, Brisbane ~200). Also, Melbourne had the largest proportion of statistically significant correlations, which makes it the most reliable.


Below are the results from the study Text Mining and Real-Time Analytics of Twitter Data: A Case Study of Australian Hay Fever Prediction (//link.springer.com/chapter/10.1007/978-3-030-01078-2_12).

The size and colour of the dots on the map represents the magnitude and direction of the correlation between the volume of tweets about hay fever and Temperature variable.


References:

Rong, J., Michalska, S., Subramani, S., Du, J. and Wang, H., 2019. Deep learning for pollen allergy surveillance from twitter in Australia. BMC medical informatics and decision making, 19(1), p.208. //doi.org/10.1186/s12911-019-0921-x


Subramani, S., Michalska, S., Wang, H., Whittaker, F. and Heyward, B., 2018, October. Text mining and real-time analytics of twitter data: A case study of australian hay fever prediction. In International Conference on Health Information Science (pp. 134-145). Springer, Cham. //doi.org/10.1007/978-3-030-01078-2_12


Australian Government Bureau of Meteorology //www.bom.gov.au/


The research was supported by “Australian Government Research Training Program Scholarship”.