To facilitate the process of tailor-making a deep neural network for exploring the dynamics of genomic DNA, we have developed a hands-on package called ezGeno. ezGeno automates the search process of various parameters and network structures and can be applied to any kind of 1D genomic data. Combinations of multiple abovementioned 1D features are also applicable.... The ezGeno package can be freely accessed at https://github.com/ailabstw/ezGeno.
Bringing Machine Intelligence To Life
One main interest of c4Lab is to annotate variants and DNA sequences in the human genome. We built machine learning and deep learning models to predict variant pathogenicity, functional regions (e.g. enhancer, TFBS, eQTL, etc) and their sequence effect.
We developed an efficient and effective GWAS method to detect epistasis for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer’s disease (AD).... GenEpi is a computational package to uncover epistasis associated with phenotypes by the proposed machine learning approach. GenEpi identifies both within-gene and cross-gene epistasis through a two-stage modeling workflow. In both stages, GenEpi adopts two-element combinatorial encoding when producing features and constructs the prediction models by L1-regularized regression with stability selection. The simulated data showed that GenEpi outperforms other widely-used methods on detecting the ground-truth epistasis. As real data is concerned, this study uses AD as an example to reveal the capability of GenEpi in finding disease-related variants and variant interactions that show both biological meanings and predictive power.
For the organisms without the reference transcriptome, de novo transcriptome assembly must be carried out prior to quantification. This study investigates how assembly quality affects the performance of quantification based on de novo transcriptome assembly. ... We examined the over-extended and incomplete contigs, and demonstrated that assembly completeness has a strong impact on the estimation of contig abundance. Then we investigated the behavior of the quantifiers with respect to sequence ambiguity which might be originally presented in the transcriptome or accidentally produced by assemblers. The results suggested that the quantifiers often over-estimate the expression of family-collapse contigs and under-estimate the expression of duplicated contigs. For organisms without reference transcriptome, it remains challenging to detect the inaccurate estimation on family-collapse contigs. On the contrary, we observed that the situation of under-estimation on duplicated contigs can be warned through analyzing the read proportion of estimated abundance (RPEA) of contigs in the connected component inferenced by the quantifiers. In addition, we suggest that the estimated quantification results on the connected component level have better accuracy over sequence level quantification.
攝影：謝郁震The Mikado pheasant 帝雉 (Syrmaticus mikado) is a nearly endangered species indigenous to high-altitude regions of Taiwan. ... We completed the draft genome of the Mikado pheasant, which consists of 1.04 Gb of DNA and 15,972 annotated protein-coding genes. The Mikado pheasant displays expansion and positive selection of genes related to features that contribute to its adaptive evolution, such as energy metabolism, oxygen transport, hemoglobin binding, radiation response, immune response, and DNA repair. To investigate the molecular evolution of the major histocompatibility complex (MHC) across several avian species, 39 putative genes spanning 227 kb on a contiguous region were annotated and manually curated. The MHC loci of the pheasant revealed a high level of synteny, several rapidly evolving genes, and inverse regions compared to the same loci in the chicken. The complete mitochondrial genome was also sequenced, assembled, and compared against four long-tailed pheasants. The results from molecular clock analysis suggest that ancestors of the Mikado pheasant migrated from the north to Taiwan about 3.47 million years ago.
Bioinformatics has played an important role in annotating the human genome since its draft was first announced in 2001. As the sequencing cost decreased dramatically owing to the advance of next-generation sequencing technology, the need of precisely annotating a personal genome is right around the corner. ... This talk will start with the success of using structural bioinformatics in predicting the influence of a single nucleotide variation on changing the protein-DNA binding affinity. Next, the concept of deep learning and how it has been used to annotate epigenomes and to explore the roles of cis-regulatory sequence variations will be introduced. As the scale and complexity of personal genomic data analysis increase rapidly, deep learning will definitely become one of the effective ways to associate personal genomic variations with diseases or drug responses. The current status and challenges of using deep learning in annotating personal genomes will be kindly addressed in the talk and it deserves more attentions when designing Bioinformatics education in the near future.
Recent advances in sequencing technology have opened a new era in RNA studies. Novel types of RNAs such as long non-coding RNAs (lncRNAs) have been discovered by transcriptomic sequencing and some lncRNAs have been found to play essential roles in biological processes.... However, only limited information is available for lncRNAs in Drosophila melanogaster, an important model organism. Therefore, the characterization of lncRNAs and identification of new lncRNAs in D. melanogaster is an important area of research. Moreover, there is an increasing interest in the use of ChIP-seq data (H3K4me3, H3K36me3 and Pol II) to detect signatures of active transcription for reported lncRNAs. In this study, we have developed a computational approach to identify new lncRNAs from two tissue-specific RNA-seq datasets using the poly(A)-enriched and the ribo-zero method, respectively. In our results, we identified 462 novel lncRNA transcripts, which we combined with 4137 previously published lncRNA transcripts into a curated dataset. We then utilized 61 RNA-seq and 32 ChIP-seq datasets to improve the annotation of the curated lncRNAs with regards to transcriptional direction, exon regions, classification, expression in the brain, possession of a poly(A) tail, and presence of conventional chromatin signatures. Furthermore, we used 30 time-course RNA-seq datasets and 32 ChIP-seq datasets to investigate whether the lncRNAs reported by RNA-seq have active transcription signatures. The results showed that more than half of the reported lncRNAs did not have chromatin signatures related to active transcription. To clarify this issue, we conducted RT-qPCR experiments and found that ~95.24 % of the selected lncRNAs were truly transcribed, regardless of whether they were associated with active chromatin signatures or not.
Oriental fruit fly is a very destructive pest of fruit in areas where it occurs and some of them have insecticide resistance. The underlying basis of such phenomena can involve complex interactions of multiple genes. ... This study aims to analyze the specific genes are involved in resistance mechanisms.
It's amazing for salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. ... This study aims to know the secret in transcription mechanism of salamander's regeneration.
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using ... the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
機器能跟人類一樣有智慧嗎？近年來，隨著數位化資料的大量累積，人工智慧（AI）的子領域機器學習（machine learning）發展迅速，機器已經能自己從大量資料中學習，「看起來」比以前更有智慧。怎麼讓機器自己學習呢？... 專長生物資訊的陳教授將為我們介紹機器學習的定義與演進，並帶我們探索機器學習的多元應用，瞭解除了搜尋引擎與智慧型手機之外，機器學習如何輔助基礎醫學研究：探索基因與各種疾病之關聯性，將在未來個人化醫療時代中扮演關鍵角色。
Insecticide resistance has recently become a critical concern for control of many insect pest species. Genome sequencing and global quantization of gene expression through analysis of the transcriptome can provide useful information relevant to this challenging problem. ... This study aims at using whole transcriptome analysis developed through de novo assembly to know resistance mechanisms from molecular aspect.
Previous studies have used target genes shared by two TFs as a clue to infer TF-TF interactions. However, the target genes with low binding affinity are frequently omitted by experimental data, especially when a single strict threshold is employed. ... This study aims at improving the accuracy of inferring TF-TF interactions.
DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. ...This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins' unbound structures (structures of the unbound state).
Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. ...This paper proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.
We propose a method for discovering TFBSs, especially gapped motifs. We use ChIP-chip data to judge the binding strength of a TF to a putative target promoter and use orthologous sequences from related species to judge the degree of evolutionary conservation of a predicted TFBS.
More and more disordered regions have been discovered in protein sequences, and many of them are found to be functionally significant. Previous studies reveal that disordered regions of a protein can be predicted by its primary structure, the amino acid sequence. ... Recent studies further show that employing evolutionary information such as position specific scoring matrices (PSSMs) improves the prediction accuracy of protein disorder. As more and more machine learning techniques have been introduced to protein disorder detection, extracting more useful features with biological insights attracts more attention.
Protein sequence clustering has been widely exploited to facilitate in-depth analysis of protein functions and families. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels, is generated. ... In this paper, the design of a novel incremental clustering algorithm aimed at generating summarized dendrograms for analysis of protein databases is described. The proposed incremental clustering algorithm employs a statistics-based model to summarize the distributions of the similarity scores among the proteins in the database and to control formation of clusters.
R304 Department of Biomechatronics
(new building), National Taiwan University