Introduction to InsideDNA bioinformatics platform for tools & analysis is an explanation of basics of working with the InsideDNA platform. While it is very easy to work with our application and most things are self-evident, there are some tricks you may not know from the first log-in. With this tutorial, we want to make sure you know all these tricks and can get most of the InsideDNA functionality. We therefore strongly recommend to give a brief glance on this tutorial before diving into genome crunching and sequence analysis.
BBMerge helps merge paired-end reads of ancient DNA by error-correcting of reads. Analyses of ancient DNA samples help study the processes of evolution and analyze population genetics. aDNA contain postmortem mutations resulting in sequencing errors. Next-generation sequencing methodologies retrieve DNA sequences and improve the quality of the overlapping bases.
RAD Tags Enrichment in RADSeq for Next-generation Population Genetics will help check whether RAD tags are enriched with reads in public RAD-Seq sample. RADSeq is a way to discover thousands of sequenced markers in any organism of choice. RADSeq can be applied to genomes of any size, enabling studies of non-model organisms and diverse populations. RAD tags enrichment amplifies reads of interest.
Guide to use HISAT2 for RNA-seq reads alignment against human population and visualization of mapped reads. Methods for gene expression analysis in HISAT2 explained.
Differential expression analysis of RNA sequencing data using cufflinks facilitates fast and principled analysis of complex data from RNA-seq experiments. The software not only provides for comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data but also is cut for gene discovery, identifying new splice variants, comparing gene and transcript expression under two or more conditions.
Admixture tool helps in inferring population structure to assign individuals to their population. Technological advances in genetic analysis and resequencing have facilitated genetic data samples used for genealogical purposes. Any individuals’ geographical origins can be inferred by analysis of their genetic ancestry.
Sometimes taking a reads of specific DNA may seem impossible when multiple organisms DNA is infused. BBSplit is a derivative program for BBMap that is useful in read binning and refining. This metagenomics tool amplifies single DNA and makes it detectable. Binning Reads using BBSplit- A Metagenomics Tool is a guide that tells you how to refine metagenomic reads of ancient human.
Variant annotation with SnpEff thoroughly predicts the effects of variants on genes. It’s known to all that variant annotation is the key step in analysis of genome sequencing data. It is essential to get all points in this crucial step correct to focus on disease-relevant DNA variants. The aim of our study obtain annotation of genetic variants to answer question of genes ,intergenic spaces, coding sequences change, stop-codons creation and frame shift resulting in loss of protein product function. Sample used for the study being filtered SNPs of human tumor tissue.
High-throughput sequencing technology is rapidly becoming the standard method for measuring RNA expression levels (aka RNA-seq). RNA-seq enables the detailed identification of gene isoforms, translocation events, nucleotide variations and post-transcriptional base modifications. One of the main goals of these experiments is to identify the differentially expressed genes in two or more conditions. In this tutorial, we explain differential expression analysis pipeline with RSEM for transcriptomes assembled de novo. With this you can learn how to compare and analyze transcriptomes of several samples. For our study here we will use two samples obtained from two ganglion of medical lynch.
Analysis of ancient DNA is particularly tricky. Unfortunately, there are only few bioinformatics tools that can tackle ancient DNA samples and produce results which accounts for DNA damage. In this Tutorial we will work with a rare example of such tool, MapDamage. MapDamage quantifies DNA damage patterns among ancient DNA sequencing reads generated by next-generation sequencing platforms. The model enables rescaling of base quality scores in SAM files according to their probability of being damaged. This is a crucial step for correct variant calling in later stages of sequence analysis.
BEDTools is an extensive suite of utilities for genomic features analysis. There are several common genomic file formats, such as: BAM, GFF, GTF, VCF and most frequently BED which are used as an input for the BEDTools utilities. These utilities allow one to perform basic computing and comparison of genomic features. Since input genomic features are represented as genomic intervals, BEDTools can perform the following manipulations with given genomic features: intersect, merge, count, complement and shuffle genomic intervals from multiple files. In this tutorial, we will use BEDTools to study genome methylation and test a hypothesis about methylation within CpG islands and outside
RNA-seq de novo assembly is one the most frequent type of sequence analysis in biology and bioinformatics. However, just as a complete genome assembly, RNA-seq assembly is not trivial and often requires large amount of RAM and CPUs. In this tutorial, we explain how to use one the most popular RNA-seq assemblers - Trinity. For the assembly pipeline and results interpretation we use transcripts from a ganglion of medicinal leech. In original study, RNA-seq data helped to understand how gene expression changes along the central nervous system of the species and to affiliate location with gene behavior.
Information on genetic variants in a sample – meaning the differences between a sample and a reference genome – are generally stored in the form of VCF files. Unfortunately, the structure of VCF files is not standardized; VCF files can include various characteristics of genetic variants; files can be sorted in different ways; and their headers can be slightly different. Such differences may complicate analysis of genetic variants, especially when you use VCF files derived from your colleagues or from a web database, and you don’t have detailed information about how this VCF file was produced. In this tutorial, we will discuss some of the major headaches of working with VCF files and how to resolve these headaches with GATK and Piccard. We will filter variants in files downloaded from the European Nucleotide Archive, which contain information on genetic variants of human tumor.
Several factors influence performance of de novo genome assemblers: read coverage, GC-content, repeats fraction, etc. This report aims to elucidate the effect of read coverage on the performance of de novo assemblers, while the other aspects will be covered in the future InsideDNA reports. In this study, we benchmark seven popular assemblers: SPAdes, Velvet, SOAP2, ABySS, MaSuRCA, DISCOVAR, and Newbler.
InsideDNA has 1000+ bioinformatics tools at your service to help you process your data quickly and with no extra effort. Here you can find the detailed instruction on how to search for tools and find exactly what you want.
Here we tell you how to share tool settings or your results with a single click
Bioinformaticians often have to manage large text files containing reads, sequences of genomes, alignments, genetic variants, and so on. Sometimes the files you receive do not have quite (or indeed at all) the same format required by tools used in the downstream analysis. In such cases you have to rearrange the data in your files – change the delimiters, order of columns or, quite likely, discard some unnecessary values. Manual processing of large files will take plenty of time, so bioinformaticians need some handy scripts for formatting the files. In this tutorial we cover some of the top useful UNIX commands every bioinformatician (or biologist learning bioinformatics) should know.
The growing number of metagenomic studies in medicine and environmental sciences is creating an increasing demand on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e., the number of reads correctly classified). CLARK (or, in fact, a suite of three tools - CLARK, CLARK-S and CLARK-L) is software that can classify short reads with high precision, high sensitivity and high speed at the same time. In this tutorial, we explain how to run CLARK on metagenomic samples against large NCBI reference database. Importantly, we also introduce a very easy to use interface which allows biologists with no technical skills to quickly and effortlessly obtain the desired result.
The phenotype of an organism depends not only on its genomic sequences, but also on the activity of its genes or, in other words, on gene expression levels. Gene expression, in turn, is determined by the structure of chromatin – a complex aggregate of DNA and proteins that forms chromosomes. Various methods allow us to evaluate levels of gene expression (e.g. RNA-sequencing), but when the goal of the study is to investigate protein-DNA interactions and, for example, identify transcription factor binding sites, ChIP-sequencing (ChIP-seq) is the best strategy. In this detailed tutorial, we work with ChIP-seq data for pathogenetic fungus Candida albicans and explain an entire pipeline of how to analyse ChIP-seq data from A to Z.
Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. In one of our previous tutorials, we mapped reads with TopHat and obtained BAM files. However, before we can use these BAM files in downstream analysis, we need to learn basic and more advanced operations which allows to deal with the file, filter them and pre-process. In this tutorial, we explain how to manipulate with BAM files with samtools - an excellent suite of bioinformatics commands which allows various operations on SAM/BAM files.
Processing and analysis of SAM files is a key step in many bioinformatics pipelines. It is essential for both biomedical and comparative bioinformatics. As we explained before, SAM files contain coordinates of reads aligned (mapped) to a reference sequence. Reference sequences can be a single assembled genome or a set of contigs or scaffolds. SAM files can be produced by various read mappers - for example, Bowtie2, BWA, BBmap. In this tutorial, we continue to work with the drug-resistant strains of Mycobacterium tuberculosis and demonstrate how to post-process, sort and clean SAM files with mapped reads and do basic variant (SNP) calling with varcall from ea-utils.
Sequencing read mapping is a key step of the next generation sequencing data processing. It allows to find locations of the newly sequenced reads and align them with respect to a reference sequence (e.g. reference sequence, transcriptome, de novo assembly). Both RNA-seq analysis and variants calling require read mapping to be as accurate as possible. In this tutorial, we explain how to do read mapping with Bowtie2, one of the most popular tools for read alignment. We also explain how to fine-tune some of the Bowtie2 parameters in order to achieve the highest sensitive of the mapping. We continue with the data from the study of a multi-drug resistance in Mycobacterium tuberculosis.
Analysis of RNA expression is of the most important bioinformatics tasks. However, with RNA-seq many things can go wrong which makes expression analysis very tricky. In this tutorial we provide quite a detailed guide to RNA-seq mapping and explain some of the important factors you need to consider when doing mapping. You are going to touch a fascinating RNA-seq dataset obtained from a human brain tissue and used to study changes in gene expression patterns during aging in human.
Reads cleaning and filtering is an important pre-processing step of the raw sequencing data. In our previous tutorial we explored the quality of the raw sequencing data and demonstrated how to correctly interpret results from the FastQC quality reports. Based on the results of the quality assessment, we will now do the cleaning and filtering of the sequencing reads. In this new tutorial we are going to filter and trim sequences using Trimmomatic and RemoveBadTiles tools. We explain how different parameters affect quality of filtering and show tricks to improve your data quality. We continue to use an open dataset from a recently published paper with an exciting theme: a study of a multi-drug resistance in Mycobacterium tuberculosis. This is a second tutorial from a set of tutorials on read quality improvement.
Quality control and filtering of sequencing reads is one of the most important steps in the pre-processing of sequencing reads. However, it is not always trivial to figure out which reads needs adjustment and which can be left untouched. In this tutorial, we explain the basics of the Phred score concept and introduce important quality metrics used in a majority of quality control bioinformatics tools such as FASTQC. We also demonstrate how to understand and interpret these quality metrics. We use an open dataset from a recently published paper with an exciting theme: a study of a multi-drug resistance in Mycobacterium tuberculosis. This is a first tutorial from a set of tutorials on read quality improvement.
BS-seq is an important method for analysis of DNA methylation. It provides a snapshot of a cell’s epigenomic state and reveals genome-wide cytosine methylation at single base resolution. One of the powerful tools for BS-seq data analysis is Bismark suite. Bismark can discriminate between cytosines in CpG, CHG and CHH context and allow to visualize and interpret methylation data. Output data can be mapped to genome viewer. In this tutorial, we demonstrate basic usage of Bismark in InsideDNA platform using UI interface and command line.
Console (shell, command line) is an essential tool for majority of bioinformatics tasks. As soon as you need to parse genomic data and analyze it even in a slightly non-trivial way, usage of the console is unavoidable. One of the most common issues with a console is that user is either bound to his own machine (with a limited RAM and CPU) or to a cluster that may not have all the tools installed, be overloaded with tasks, or simply would not have a node capacity necessary for a task completion. In spite of a rising popularity of containers that solve burden of tool installation, the cluster capacity and its load remain a pressing issue in bioinformatics. In this tutorial, we explain how InsideDNA resolves this issue by offering HPC/PC-like console experience, but in a cloud environment. From now on, you don’t need to worry neither about number of available nodes, nor about their capacity: should you need to assemble 100 genomes each on a 200 GB RAM and 32 cores, InsideDNA will instantly scale to a needed capacity as you submit tasks.
Metagenomics is a hot area of scientific research. It is now extensively used in ecology, biofuel production, agriculture, human and animal health. Nevertheless, with opportunities come new challenges: in particular, data processing becomes an increasingly time consuming and computationally expensive. The most “expensive” (time and compute-wise) task is to assign taxonomic labels to metagenomics DNA sequences. Recent bioinformatics advances try to overcome this problem by enabling efficient algorithm. Here we present one of the most recent software for metagenomics data analysis – CLARK. Compared to existing solution, it is more than 5 times faster, yet for large datasets requires powerful machine and is not available on Windows. In this tutorial we show basic usage of the tool with InsideDNA. In one of the next tutorials will demonstrate a full pipeline for bacterial, viral and human metagenomics data analysis.
One of the most powerful frameworks for inference of population genetics or genomics scenarios is Approximate Bayesian Computation (ABC). Compared to conventional approaches such as likelihood-based ones, ABC-based strategy allows to effectively model complex population history scenarios and offers a flexible way of assessing a fit of alternative hypotheses. One of the popular tools for ABC modeling is DIY-ABC. It has an easy to use interface, but, unfortunately, requires a powerful cluster for generation of thousands of datasets for ABC evaluation. Here we describe a simple 2 steps wrapper pipeline for simulating large amount of ABC datasets. Our wrapper will be particularly useful for those who don’t have UNIX skills or cluster but still need large amount of ABC simulations.
Gene prediction is one of the most common tasks in bioinformatic analysis of newly sequenced genomes. AUGUSTUS is an excellent gene prediction tool which works with eukaryotic genomes. It allows to predict genes ab initio (de novo) or based on some hints (e.g. RNA-seq/EST, protein alignments, synthetic genomic alignment). In this tutorial we explain how to use protein profiles to improve gene search in the genomic fasta files. For this purpose, we discuss AUGUSTUS protein profile extension (PPX) and explain all steps necessary to run a prediction with an addition of a protein profile.
Restriction-site-associated genomic markers (RAD markers) is one of the cheapest and most convenient ways to obtain large number of loci for multiple species or individual samples. Typically, RAD-seq allows to generate relatively long SNP profiles across species or individuals and these profiles can then be used for admixture analysis, genetics map generation, phylogenetic analysis or population comparative genomics. Here we present one of several tools currently available for RAD-seq data processing and analysis – pyRAD. pyRAD is a great pipeline because it incorporates different stages of RAD-seq processing: from initial raw data de-multiplexing and filtering to generation of VCF, nexus, phylip, and other files. However, usage of VSEARCH for ortholog clustering within pyRAD pipeline requires powerful enough computing machines especially for large and high coverage datasets.
When researchers need to reconstruct a relatively large phylogeny for multiple genes (e.g. sequenced de-novo and obtained from the NCBI database), after source sequences are obtained, aligned and combined into a single matrix, the last important step is phylogeny reconstruction. Here we present a simple way reconstruct a phylogeny from the DNA matrix based on multiple genes with Phyml tool.
When researchers need to reconstruct a relatively large phylogeny for multiple genes (e.g. sequenced de-novo and obtained from the NCBI database) there are typically two important steps: multiple sequence alignment and merging of the genes into a single large matrix while preserving both species and alignment orders. Here we present a simple way to align sequences for multiple genes and combine these aligned genes into a coherent DNA matrix for phylogeny reconstruction.
One of the typical tasks when comparing datasets between multiple genomes or transcriptomes is to build a Venn diagram of overlapping orthologs, gene clusters, or transcripts. The same task can be applied when comparing overlap between gene clusters in metagenomics studies. However, majority of tools allowing for comparative genomic analysis do not provide any simple way to obtain a suitable file for plotting of Venn diagrams. Here we present simple pipeline of two tools built with InsideDNA platform. This pipeline (1) transforms OrthoMCL and MCL output csv file into format suitable for VennDiagram function in R and 2) plots resulting file as a Venn diagram in tiff format
One of the common burdens for evolutionary biologists dealing with phylogeny reconstruction is supplementing newly sequenced data with sequences already available in GenBank. Here we present a second tool in the pipelines which allows to automate large phylogeny reconstruction. The tool is called geneCoverage2fasta and it automatically retrieve most represented sequences via BLAST from the GeneBANK database in fasta format.
One of the most popular and powerful tools in bioinformatics suitable for ortholog detection is OrthoMCL. Nevertheless, it is also one of the most difficult to install and run tools, because it requires many dependencies including BLAST and MySQL database. With InsideDNA platform, we made OrthoMCL as easy as possible to work with and describe below the entire pipeline for ortholog detection in genomes of three bacterial species of Ralstonia. In total, it takes about 10 minutes to run the entire pipeline on three bacterial genomes.
One of the common burdens for evolutionary biologists dealing with phylogeny reconstruction is supplementing newly sequenced data with sequences already available in GenBank. This exercise is particularly common when one would like to build a large(r) phylogenetic tree. Here we present a small pipeline of two tools - geneCoverage and geneCoverage2fasta – which allows to automate two critical steps for such tasks: evaluation of gene coverage for a given taxon (i.e. how many unique species were sequenced for different genes/gene products) and automatic retrieval of the most represented sequences via BLAST from the GeneBANK database in fasta format.