Shankai Yan

Shankai Yan
National Institutes of Health | NIH ·  National Center for Biotechnology Information

Doctor of Philosophy

About

23
Publications
11,834
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
474
Citations
Additional affiliations
September 2015 - present
City University of Hong Kong
Position
  • Research Assistant
Description
  • CS1102 Introduction to Computer Studies This course aims to provide an introduction to computing concepts, skills and the technologies behind the Internet. Students are introduced to software tools, web content scripting and basic computer programming.
Education
September 2015 - August 2018
City University of Hong Kong
Field of study
  • Computer Science
September 2012 - January 2015
South China University of Technology
Field of study
  • Business Intelligence
September 2008 - June 2012
South China University of Technology
Field of study
  • Software Engineering

Publications

Publications (23)
Article
The study aims at developing a neural network model to improve the performance of Human Phenotype Ontology (HPO) concept recognition tools. We used the terms, definitions, and comments about the phenotypic concepts in the HPO database to train our model. The document to be analyzed is first split into sentences and annotated with a base method to g...
Article
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for...
Conference Paper
Full-text available
In recent years, big data deluges have resulted in exciting data science opportunities. In particular, there is always a desire to extract the most from different data sources. To address it, a promising and recurring task is to perform feature selection and feature extraction. Specifically, the objective is to obtain the non-redundant and informat...
Article
DNA Computing is still at its infant stage since its emergence. Multiple aspects of DNA computing have been studied but most of the research results have not been applied to the reality. It has been proved to exhibit high data storage density and support efficient random data access. It also shows the potential to provide alternative facilities for...
Article
Motivation Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to i...
Preprint
Full-text available
The COVID-19 pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the p...
Preprint
Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify bio...
Article
Full-text available
Cyberbullying and hate speeches are common issues in online etiquette. To tackle this highly concerned problem, we propose a text classification model based on convolutional neural networks for the de facto verbal aggression dataset built in our previous work and observe significant improvement, thanks to the proposed 2D TF-IDF features instead of...
Article
Full-text available
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of v...
Preprint
Full-text available
Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools...
Article
Full-text available
The recent advances in DNA sequencing technology, from first-generation sequencing (FGS) to third-generation sequencing (TGS), have constantly transformed the genome research landscape. Its data throughput is unprecedented and severalfold as compared with past technologies. DNA sequencing technologies generate sequencing data that are big, sparse,...
Article
Full-text available
Motivation: Biomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from literature. Limited by the event context, the existing event detectio...
Preprint
Inspired by the success of the General Language Understanding Evaluation benchmark, we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomed...
Preprint
Motivation: Biomedical event detection is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from literature. Limited by the feature context, the existing event detectio...
Article
Full-text available
The early detection of cancers has the potential to save many lives. Multianalyte blood test results could provide valuable information for early cancer detection. A recent attempt to detect multiple cancer types from multianalyte blood test results by logistic regression and random forest has been demonstrated successful in 2018 \cite{pmid29348365...
Article
The Gene Expression Omnibus (GEO) repository harbours an exponentially increasing number of gene expression studies. The expression data, as well as the related metadata, provides an abundant resource for knowledge discovery. Each study in GEO focuses on the gene expression perturbation of a specific subject (e.g. gene, drug, and disease). The iden...
Article
Full-text available
Transcription factors (TFs) are the major components of human gene regulation. In particular, they bind onto specific DNA sequences and regulate neighborhood genes in different tissues at different developmental stages. Non-synonymous single nucleotide polymorphisms on its protein-coding sequences could result in undesired consequences in human. Th...
Article
Full-text available
Motivation: Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as...
Conference Paper
Full-text available
Verbal aggression and cyberbullying are widely concerned issues in netiquette. In this article, we introduce a text mining system that can detect whether a certain paragraph contains the aggressive sentiment, and demonstrate its performance with different classification models. In addition, it is observed that our system works well on both our manu...
Article
Full-text available
Understanding genome-wide protein-DNA interactions forms the basis for further focused studies. In particular, the chromatin immunoprecipitation (ChIP) with sequencing (ChIPSeq) technology can enable us to measure the genome-wide occupancy of DNA-binding protein of interest in vivo. Multiple ChIP-Seq runs thus inherent the potential for us to decip...
Conference Paper
Full-text available
In this paper, we propose a new algorithm for mining association rules in corpus efficiently. Compared to classical transactional association rule mining problems, corpus contains large amount of items, and what's more, there are by far more itemsets in corpus, and traditional association rule mining algorithm can-not handle corpus efficiently. To...

Network

Cited By

Projects

Projects (2)
Project
Since 1990s, the whole genomes of different species have been sequenced by their corresponding genome sequencing projects. In 1995, the first free-living organism Haemophilus influenzae was sequenced by the Institute for Genomic Research. In 1996, the first eukaryotic genome (Saccharomyces cerevisiase) was completely sequenced. In 2000, the first plant genome, Arabidopsis thaliana, was also sequenced by Arabidopsis Genome Initiative. In 2006, the Human Genome Project (HGP) announced its completion. Following the HGP, the Encyclopedia of DNA Elements (ENCODE) project was started, revealing massive functional elements on the human genome in 2011. The drastically decreasing cost of sequencing also enables the 1000 Genomes Project and Roadmap Epigenomics Project to be carried out. Their results have been published in 2012 and 2015 respectively. Nonetheless, the massive genomic data generated by those projects impose an unforeseen challenge for massive data analysis. In our group, we aim at tackling the challenges by scalable machine learning and data mining methods.