Stewart A, Denecke K
Using ProMED-Mail and MedWorm blogs for cross-domain pattern analysis in epidemic intelligence.
Stud Health Technol Inform.
In this work we motivate the use of medical blog user generated content for gathering facts about disease reporting events to support biosurveillance investigation. Given the characteristics of blogs, the extraction of such events is made more difficult due to noise and data abundance. We address the problem of automatically inferring disease reporting event extraction patterns in this more noisy setting. The sublanguage used in outbreak reports is exploited to align with the sequences of disease reporting sentences in blogs. Based our Cross Domain Pattern Analysis Framework, experimental results show that Phase-Level sequences tend to produce more overlap across the domains than Word-Level sequences. The cross domain alignment process is effective at filtering noisy sequences from blogs and extracting good candidate sequence patterns from an abundance of text.