Research Projects

Can Generative Language Models Learn to Generate Valid Statements from Premises

Investigated whether a state-of-the-art (SOTA) pretrained generative language model (Google T5) is able to learn to generate valid statements given two premise statements (e.g., “steel is a metal” + “metal is a thermal conductor” -> “steel is a thermal conductor”).
We used the QASC dataset because it is in natural language, and the “premises-conclusion” pairs are given.
We found that some simple statements can be generated well, while the model failed on the complex ones, such as those requiring monotinicity reasoning and rephrasing.
The project is implemented with python and pytorch.

Compared the state-of-the-art (SOTA) neural information retrieval methods (such as BERT and USE-QA) with classic word-overlap based methods (such as tf-idf and BM25).
Conducted a probe task to show that query embedding of USE-QA encodes the information of the target document
Proposed a hybrid retrieval method to combine both: when the top BM25 score of the query is high, use BM25; otherwise use USE-QA.
Results show that the hybrid method outperforms the individual baselines on three QA-based retrieval datasets.
The project is implemented with python, pytorch, tensorflow, lucene and scala.

Compared the LSTM-based classifier with the linguistic-knowledge-informed classifier on polarity detection of biomedical events.
Integrated the LSTM classifier to Reach (a information extraction software for biomedical publications, implemented in scala).
Implemented a hybrid classifier to combine the LSTM one and the linguistic one. The routing classifier works by counting the negation words in the text: if there are many negation words, use the LSTM classifier; otherwise use the linguistic classifier.
This project is implemented in python, dynet and scala.