Questions related to Basic Bioinformatics
I have an unusual question: I am working on a Erasmus internship project with Drosophila mutants at 2 different timepoints and with WT, KO and KI condition. A company analyzed the data using DESeq2 and I have only got loads of PDFs and the results_apeglm.xlsx file.
This contains: Transcripts per million for each gene, replicate and timepoint with the comparison for looking at DEGs - so I have a padj and log2FC value. A snippet is attached as an example.
I now want to construct a graph and clustering where genes that are going in changing directions between WT and KO over time become visible out of the hundreds of candidate DEGs. With this I want to narrow down the long list to make it verifiable with qPCR and serve as a marker for transformation from presymptomatic to symptomatic.
I am setting up my analysis in R and want to use the degPatterns() function from DEGReport, as it gives a nice visual output and clusters the genes for me.
How can I now transform my Excel sheets, to a matrix format that I can use with degPatterns()? The example with the Summarized Experiment given in the vignette is not really helpful to me, sadly.
Thank you all for reading, pondering and helping with my question! I would be very happy if there´s a way to solve my data wrangling issue.
All the best,
If I have a sequence (genome.fasta). And I want to check the gene located in 400nt -500nt.
What bash script (I have WSL in my windows) I should use or are there any conda packages ?
Thank you in advanced
I'm using autodock vina in Python to dock multiple proteins and ligands, but I'm having trouble setting the docking parameters for each protein. How can I do this in Python? (I have attached my py code which I have done in this I have assumed this parameters same for all proteins)
I have tried to separate a direct coculture of MSCs (mesenchymal stromal cells) and macrophages to do bulk RNA seq on macrophages, as I want to find out how MSCs change the genetic expression on macrophages. I have tried different methods to separate the coculture as much possible, but I can only manage to retrieve a cell population with 95% macrophages, and 5% MSCs still present.
Therefore, I want to know if anyone has experience with analyzing data when the population is not completely pure with one cell type and how do I handle such data?
Is it wise to proceed with bulk RNA seq when 5% of my cells are still MSCs, well aware that the expressed genes observed could come from the 5% MSCs?
Can someone please list the analysis that needs to be performed to understand and compare the mechanism of a protein in its native and inhibitor-bound state? (NOTE: Not the interaction between Protein and Ligand)
I am looking for a tool (soft, script), a tutorial or some help to generate 10000 random non-redundant peptide sequences with a fixed length of 9 Amino Acids, and for which some AA are fixed and not random.
An example to illustrate:
When starting with : _ _ A _ M _ S S _
I want to generate a pool of 10000 non-redundant peptides with random AAs at positions 1,2,4,6 and 9, while keeping the position 3 (A), 5 (M), 7 (S) and 8 (S) the same.
I am sure with a simple script in python it can be done quite easily but I have almost 0 knowledge to write such script. If someone can help me, or is aware of an online tool that does that (so far the closest thing I found is https://peptidenexus.com/article/sequence-scrambler but it cannot generate thousand of peptides nor fix positions) it would be much appreciated.
Certain softwares and sites allow to calculate a DNA hairpin Tm depending on the size of the loop and the stem sequence. For example, Gene Runner. Yet the calculation method or citation is not provided. Is there a formula that could help?
Does anyone have experience with the table2asn tool of NCBI? It is used for submitting annotations of genomes using GFF files.
It can be found at this location: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/
I am having a problem with it.
1) The windows tool I get running, but it gives me an error. The command used is the following: table2asn -M n -J -c w -t template.sbt -a r10k -l paired-ends -i FinalContigs.fsa -f 2700988623.gff -o output_file.sqn -Z output_file.dr -locus-tag-prefix xxxxx
template.sbt is a file generated by NCBI
FinalContigs.fsa is a file containing all the different contigs of the draft genome. The first contig is named Contig001, seconf contig is Contig002, and so on.
2700988623.gffis the file containing the annotation. Column 1 gives the contigs where the CDS and RNA are found in, so Contig001, Contig002, etc.
After running the command (as told by NCBI), I get the following error:
Cannot resolve lcl|Contig002: unknown
So if anyone could assist me with getting the tool running in Windows, that would be very much appreciated as it is the final step of submitting my genome.
Thanks in advance!
I have two different ChIP-seq data for different proteins, I have aligned them to some fragments in the DNA. Some of these fragments get zero read count for one of them or for both. To be able to say these fragments has protein X much more than the protein Y, I use student's t-test.
I wonder if It would be better to remove the zero values from both of the data showing rpkm values for each fragment. Moreover, they pose problem when I want to use log during data visualization part.
What would you suggest?
I am a final year M.Phil student and I am reading up many articles to find a topic for my research. My major is microbiology (molecular) and I am interested in integrating basic bioinformatics. My supervisor is helpful in this matter but I would like to come up with 2 or 3 topics of my own as well to propose but having a severe blockage in this matter. A recommendation would be helpful. Thank you in advance.
I have some files in bed and bedgraph format to analyze with IGV. My team and I tried to upload them on IGV following the IGV site's tutorias but it hasn't worked. The bedgraph files are large (5157) and we converted them to the bynary .tdf format using the IGVTools "Count" command but it hasn't worked. Only with some files we can see a single flat line on IGV screen without any information. With FilexT we can see that the files in bed and bedgraph are not damaged.
We think that the problem is the step when we select the option "Load from File" on IGV. How can we do? What can we do?
We use the IGV_2.10.3
Suppose I have a peptide sequence of 400 amino acid and there is a particular domain (suppose a DNA binding domain). I am looking for any software that will take the peptide sequence as input and give an option to specifically download the sequence of the domain in the peptide sequence.
Hello friends, today I am raising a concern- What are real palindromic DNA sequence ? off course you will say- Restriction enzymes sites, but through a video available at the link http://bit.ly/palindromicDNA, I am raising an issue that, in true sense mirror repeats are palindromic in nature as defined by standard English dictionaries. There are many unique properties of mirror repeats DNA which i will share later. Hopefully biological scientific community will accept mirror repeats as True English Palindrome. So please check out http://bit.ly/palindromicDNA
I have a question regarding gene prediction for long metagenomic reads (MinION nanopore).
I was trying to understand the process of gene prediction. In my attempt, I classified my metagenomic sequences using a reference database by following methods :
1. I did Prodigal to predict ORFs using -p meta option and then ran a diamond aligner using e= 0.002
Result: 689 queries aligned
2. I directly used diamond for alignment of metagenomic reads to the reference database using the same e score value (I did not give identity parameter)
Result: 7292 queries aligned
3. I converted my DNA fasta file to protein sequence using GOTRANSEQ, and did the same analysis with same parameter.
Result: 169 queries aligned.
There is a huge difference between 2 and 3 method? Confused...!!
Which approach is better for predicting protein gene sequence for long reads ?
Is e-value a sufficient parameter for diamond blastp analysis? Do I need to give any identity % in case of first approach?
In addition, I would also like to confirm, whether I can directly use the translated file from PRODIGAL analysis (-a output) for DIAMOND?
I am looking fora command that will modify 3 chains available in the original pdb into a single chain and then renumber all of the residues. I have tried using alter command but when I export the pdb I get only one chain (of the initial trimer) and not the merged chain
Could someone please provide a working example (or point me to a good resource) of how 'lambda local' (along with lambda BG, 1k , 5k and 10k) is calculated for peak calling in MACS?
λlocal=max (λBG, [λregion, λ1k], λ5k, λ10k)
Does 'max' here mean the upper limit, based on whichever of the variables λBG, λregion, λ1k, λ5k, λ10k is the highest and hence the p-value being calculated taking this 'max' value for λ? Is λlocal a product/average/sum of all the other λs?
I have been unsuccessful in trying to understand it by referring to the script and the tutorial given in the link below:
Also, is the 'mfold' (10-30 fold enrichment) parameter estimated w.r.t lambda BG?
The GAD_Disease segment in DAVID 6.8 acquires the data from the Genetic Association Disease Database (https://geneticassociationdb.nih.gov/). GAD database was shut down in 2014. It has been 6 years since then, yet DAVID has not abandoned GAD database. My concern is, what is the relevance of the data now in 2020. Is it still scientifically relevant and usable in research, or should be ignored?
I am looking forward to your informed opinion.
Thanks and regards,
In the fasta output of Prokka listing the name of genes, some genes does not have any name ("gene: NA"). My question is whether these genes are hypothetical or they do not have any name?
If the former one is the case, how Prokka determine them?
I'm a molecular biologist, and i have a few projects coming up in transcriptomes and small RNA analysis. Can i get by without knowing any programming using user-friendly software such an Geneious Prime or another program you can suggest or is it absolutely a must?
Hi to all,
I'm approaching to the haddock web-tool for the first time. I got the username and password for the easy interface.
I'd like to know wheather i'm on the right way.
Once I've uploaded the pdb files to be docked, I have to specify both the active and the passive residues.
In order to determine the active residues I have performed an NMR titration of the unlabelled protein with the labelled ligand and vice versa. Then I've calculated the chemical shift perturbation.
Now I have to determine which among them are the active residues in the protein-ligand interaction.
So, shall I have to submit the pdb to a SASA (solvent accessible surface area) calculation program and chose the chemical shift perturbation residues that match with those solvent accessible by the SASA program?
is it correct?
do you advise any software/webtool? (i know NACCESS, but there is a very tedious procedure that i have to follow in order to get codes for decrypt the rar files)
what have i do for the passive residues, is reliable the option on haddock that allows to determine them automatically?
I actually have two queries. Biochemical studies suggested presence of an enzyme in an organism but the gene encoding the enzyme is not known. I would like to find out gene candidates based on homology search/sequence similarity, using sequences of similar enzymes present in other organisms. My questions are:
1. What points should I consider to select an already known gene/protein for use in homology search
2. To find out the orthologue of the known gene/prtoein by bioinformatics, which database/software should I use?
Please suggest papers/websites/softwares for beginners.
Hi All, we have a bioinformatics challenge and we would love any help this community can offer. We have data from a target-enrichment experiment that was supposed to capture certain microsatellite motifs. The three enriched libraries were sequenced in a rapid run on Illumina Hiseq 2500 (paired end mode) and our data is in the standard illumina fastq output. Our three libraries come from three different sources. The first library is developed from fresh fish tissue; the second one is mammal tissue; and the third one is the same mammal species but from fecal samples. For the fecal samples, we need to somehow filter out sequences belonging to the mammal only (i.e. not prey or microbiome). We have a reference genome for the mammal, but not for the fish. The data has been demultiplexed already (so for the fish we have 40 individual fish each with its own .fq file containing all the read data). Now, we are facing the challenge of how to deal with this data. Although we are familiar with most basic bioinformatic tools and analyses we do not have advanced programming skills. We need to find a way not only to find and identify the length of our microsats within the reads but also (for the fecal library) somehow be able to identify unique flanking sequences that would correspond to our mammal, in such a way that the reads of other species in the fecal libraries can be excluded. Would anyone have a suggestion on what approach(es) we could use? We have already (unsuccesfully) attempted to tackle this with SSR_pipeline. Thank you in advance for any help you can offer - it is very much appreciated! Daniel & Vania
I've been using the refine.bio website to download normalized transcriptome data; each downloaded dataset consists in a compressed directory with an expression matrix in .tsv format, its metadata in .tsv format too and an aggregated metadata file in .json format.
I'm trying to associate the expression matrix with its metadata using R programming language, but I don't know how to do it, and I don't find the way in the site's documentation. I only know that I need reed these files with these commands:
> expression_df <- read.delim('SRP068114/SRP068114.tsv', header = TRUE,
> row.names = 1, stringsAsFactors = FALSE)
> metadata_list <- fromJSON(file = 'aggregated_metadata.json')
but I have no idea how to merge them for generating a full-informative matrix.
Can someone help me, please?
Thank you so much.
Where can we get best web-course beginning from basic Python programming and using Kernels till analyzing data with Machine learning algorithms.
I think most of the Biologist don't need much to know on how to write programs or design Kernels but should be able to write essential codes to analyze data through already available programs.
I found some online courses, one is: pythonforbiologists.com that appears instantly when searched with the concerned keywords, but was not able to get detailed reviews.
Another is Python for Genomic Data Science by Coursera, the course content looks good, but the reviews say that it lacks later materials (Weeks 3 and 4) in uses/applications that makes virtually impossible to finish the course.
Suggestions appreciated for recommending efficient web-course on Python for Biologist...
I just need to explain my data about a filarial nematode and need proper tools for the analysis, plotting and tabulating my data. There are so many software and tools on the internet but so many problems when working with those tools. Any suggestions for tools to analysis, plotting and tabulating genomic data of eukaryotic genomes?
Hi to all,
AMBER and CHARMm are among the force fields used for computate protein properties.
I'd like to know Which are the differences between the AMBER and the CHARMm Force fields, and when AMBER or CHARMm should be used for.
I'm trying to extract prophages from bacterial genomes, the principal issue i been having is that i have multiple hits distributed along bacterial genomes and i don't know how to extract them together in order to get the complete prophage genome. I know that there is an outfmt of blast that can give me subjetct-sart and subjetct-end but i want to know if there is an alternative that can give me only the initial and final positions.
In the example only two strains are shown, those contain several hits for one bacteriophage, i want to retrieve the initial and final position of those hits to extract the prophage from databases
I am currently trying to align several thousand (5-10) protein sequences to a related protein sequence (reference) to determine the percent identity of the large subset of sequences to the reference.
Does anyone know of a program that I can use to align large sets of protein sequences to a reference sequence. I know that Geneious is capable of performing reference alignments, but it is almost prohibitively expensive and I am capable of using command line tools.
If anyone knows of any open-source tools that could help me perform these tasks I would greatly appreciate the suggestions.
Thank you in advance!
Hi everyone. Let's suppose that we have an index which summarizes an event of a real-life situation. Let's suppose then that we want to compare the same index of two patients and one index is way bigger than the other one (crude ratio around 200). If we do the ratio between the inverse log10 of the index (1/Log10(index)) of the two patients we obtain a 10-fold ratio, which might be easy to "handle" and to understand in a publication. Would it be better to run statistical analysis with the crude ratio, the ratio of inversed log10, or both? Which other transformation would you suggest for presenting and working with such data? Thank you in advance for your time and suggestions.
A newby question:
I have a mammalian protein for which I would like to know "its history". Is there any tool/platform I could use to check the presence of this protein's orthologs in different taxons/kingdoms (e.g., plants and animals)? Most platforms don't say "There is no such protein in sequenced genomes belonging to this kingdom" and offer instead some blast alternative. What would be the best way to conduct such search?
Any suggestion is welcome! Thank you!
I'm working on haplotypes case control association study, for data analysis , I employed two online analysis softwares ,
The first is (SNPstats) which gave me extremely large odd ratio values as showing in the attached snapshot .
the second is (http://analysis.bio-x.cn/myAnalysis.php) which gave me empty values (---) as showing in the attached snapshot . .
please , how can I explane this odd ratio values ,
thank for any help .
I don't know anything about bioinformatics but I'm going to do RNAseq and ChIPseq analysis in R later this year and I would like to know if a Macbook Pro (2018) with a i5 processor, 256 gb SSD and 8 gb RAM would be enough for basic bioinformatics. My model organism has a 12 Mpb genome and I am hesitant if I need the 16 gb RAM and 512 gb SSDA which is way more expensive.
The fly protein is 671 aa while the human protein is 620. What are the softwares that can help me predict the domain conservation between the orthologs? Multiple alignment or pairwise alignment?
I want to build a fusion/chimera protein structure which is including 3 different protein domains belong to three individual proteins and then evaluate quality of fusion protein 3D structure.Protein domains are linked to each other with a 15 aa linker. I want to use UCSF Chimera, modeller or any other suitable software for this purpose, How can I do that?
I am an undergraduate neuroscience and bioinformatics research assistant and my personal project has been to explore the gut microbiota of an EAE mouse model. There are three treatment groups: untreated control, Complete Freund's Adjuvant only, and MOG+CFA. Samples were taken from 6 control, 6 CFA, and 5 CFA+MOG mice over 5 time points (1 before and 4 after starting the EAE experiment).
I have since been focusing on analyzing a network created from the abundance data. Counts were normalized with DESeq2 and split into each treatment group. These count matrices were used to calculate a Spearman's rank correlation coefficient matrix for each group. This technique was applied to permutation testing with 100 randomized matrices generated. From the original matrix, all coefficients that fell below a 5% significance level in their corresponding distribution were set to 0 and correlations above were set to 1. In addition, any coefficients below 0.5 in magnitude were set to 0. This binary matrix was used to create unweighted network for each group.
I have since been focused on using over-representation testing on various features in the networks. Of key interest is how the networks are divided into communities/modules. I am using the fast greedy algorithm from igraph due to time/computing constraints but that could change based on suggestions.
Currently, I have been testing for whether a certain taxa-taxa interaction is over-represented in one module versus the rest of the network. I am using fisher's exact test where the 2x2 matrix can be split into within the network vs outside the network and the taxa-taxa interaction vs every other interaction. The counts correspond to each edge in the network that fulfills the criteria of each cell.
The data I get back is a matrix of a taxon-taxon interaction per each row and a module for each column. The values are p-values from the hypothesis testing. There are 3 such matrixes, one for each network/treatment group. I also have a matrix of the counts for each interaction/edge in each module.
My question is how can I better use the results of this data to derive biological insight? I have looked into dividing up bacteria into functional classes and potentially machine learning applications, but there are no standout programs that I know of that could readily take this data. The goals of this project are to better understand the structural changes in the gut microbiota during EAE and possibly to discover specific features like keystone taxa or co-occurrence groups that are gained or lost in the MOG+CFA group. Of particular interest are any OTUs related to the Lactobacillus/Bacilli lineage.
Hi, this is quite a basic question I feel, but I'm not exactly sure of what the best practice would be. I am currently using IGV for an interactive view, but I would rather have this automated on the command line with any bespoke elements put in place by myself. What tools (command line, not interactive) would I need for the following:
I have indexed BAM data of specific reads along with their corresponding vcf file. I would like to go back all the way to my reference genome, so I can take each read with an identified SNP and pull out the sequence from the reference genome plus an extra N bp e.g., 50 bps, on either side from where the read aligns on to the reference. In short, I want to know what exists to the left and to the right of my aligned read. Could this be done using the #CHROM and POS data in the vcf?
I need to identify some targets of a number of miRNAs but i need suggestions on the best target prediction tool or a combination majorly for humans. Thanks
I am interested in obtaining set of all human proteins with the DNA binding ability in FASTA format in one step. Downloading one by one is very tedious for me :-D
Do you know any easy (user-friendly) way to do it?
Thank you in advance
I'm currently trying to get the 3D structure of a set of peptides (ranging from 12 to 20 aminoacids). Subsequently we want to make docking analysis against an enzyme.
Which software do you use for that? How do you refine the structure?
I am trying to submit sequences of COX-1 gene(mitochondrial gene) to gene bank. My sequences is almost 100% identical to a published one. It return a message saying that they contain stop codons and that this hinders their publishing. Now, I checked similar seq. from gene bank and they also contains a stop codons ? if this is hindering, how did those one were published ?
Could any one help to explain why is this and how can I deal with this problem ?
I understand from the TCGA data, I can easily plot a survival curve related to a gene (for example gene ABCD) for a certain type of cancer.
However, I would like to go a little deeper, as in to see if gene ABCD affects the survival outcome of a group of colorectal cancer patients that were treated with a certain drug, say oxaliplatin.
I want to see if overexpression of ABCD led to reduced survival in oxaliplatin treated colorectal cancer patients, or if the converse if true.
Are there any data sets like that out there?
Thank you all.
Hi everyone! I would like to perfom a a energy minimization analysis to study the effect of phosphorylation in the protein stability in terms of Gibbs free energy. Aditionally I would like to compare this stability with mutants versions of my protein lacking this phosphorylation site
I have seen this test in a recent published paper using the Discovery Studio Software but I am not capable of reproducing it. I attach you the graph of this test as well as the reference of the article.
What are some good introductory courses available online.
Paid is fine as it would be good to have a certificate issued at the end of it.
- I mapped my whole genome against the reference strain, and found 100 bp gaps between all the contigs? Is it something to be worried about? How can I overcome this shortcoming in my sequence?
Until now, I was relying on KOBAS ( KEGG Orthology Based Annotation System), which was quite user-friendly and provided p-values for distinct pathways. However, it seems to be no longer available. Does anyone know a comparable online tool for analysis of both microarray and RNA-seq data?
Colleagues, I need help with Venn diagrams and transcriptomics. I have three list of IDs (example: c58516_g4_i4), only IDs, not the sequences. I need to make a Venn diagram, to know which IDs are shared among the three lists, and which only between two of them and which are only present in its original list. I could do it manually, but it's a huge amount of IDs. Can you suggest me some sowtware for windows or script for linux ?. Thanks!
I need to extract the x,y coordinates of a PCA plot (generated in R) to plot into excel (my boss prefers excel)
The code to generate the PCA:
pca <- prcomp(data, scale=T, center=T)
If we take a look at pca$x, the first two PC scores are as follows for an example point is:
29. 3.969599e+01 6.311406e+01
So for sample 29, the PC scores are 39.69599 and 63.11406.
However if you look at the output plot in R, the coordinates are not 39.69599 and 63.11406 but ~0.09 ~0.2.
Obviously some simple algebra can estimate how the PC scores are converted into the plotted coordinates but I can't do this for ~80 samples.
Can someone please shed some light on how R gets these coordinates and maybe a location to a mystery coordinate file or a simple command to generate a plotted data matrix?
NOTE: pca$x does not give me what I want
Redoing prcomp() without scale and center gives me this for PC1 and PC2 for the first 5 samples
1 -8.9825883 0.0113775
2 -16.3018548 9.1766104
3 -21.0626458 3.0629666
4 5.5305875 4.0334291
5 0.2349433 12.4872609
However the plot ranges from -0.15 to 0.4 for PC1 and -0.35 to 0.15 for PC2
This might be a basic question but I am wondering how does one approach this task? My aim is to compare specific genes from the published genomes of the closest sequenced species/genera to my organism.
I was surprised at first and asked if I will make a similar database as BLAST but I was told that by "database" they expect me to just download FASTA files from Genbank. Is that simply it?
I am currently working on making a variants database.
I generate seperate VCF files for each samples them and merge them using VCFtools (using mergeVCF command) to make a multisample VCF file. Then I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.
My question is, what is your recommendation about the best way to proceed. Do you recommend any other pipeline? Please also correct me if I am wrong in choosing the pipelines mentioned above.
I have a short nucleotide sequence which has 6 ambiguity characters within it. Is there a way of taking this sequence with the ambiguity characters and producing a list of all the possible combinations of the sequence?
In a cancer research i have too many data , now when train the data with all features (redundant also) in a model ,and later use test data in this model,and accuracy is pretty good.I knew, relevant feature is important .Can any one can explain this ,why feature selection needed when with all features we got pretty good result ?
I have done MiSeq sequences on ice samples and I would like to run the analysis for them. Nevertheless the two pipelines I'm familiar with (Mothur and Parallel-META) are not the best for ITS1 analysis. I have look in google to find tutorial (Mothur is absolutely fantastic for this) but I can't find anything similar for Skata.
What softwares are available to show how SNP can change shape or confirmation of DNA or RNAs?
Kindly suggest me some easy to use softwares that help to simulater the effect of genetic variants in coding or non coding region.
I am trying to identify the point of insertion in the genome that may have caused the size difference in fruits between wild-type and mutant of a plant species (also mentioned here : https://www.researchgate.net/post/Identifying_insertion_of_transposable_element_in_a_large_plant_genome_any_recommendations_on_my_situation).
Generally, procedures I used to confine the mutated region include short-read whole-genome sequencing -> k-mer frequency analysis -> contig assembly -> mapping of reads, aligning contigs with databases etc.. In short, my intermediate goal is to extract only reads containing the mutated region, assemble it, assess by read-mapping, before going to some wet-lab (e.g. PCR).
Actually I am so frustrated with the long procedures I have done to the sequencing data without having any control to ensure I did not overdo something (e.g. selection criteria set too stringent). Besides, I am overwhelmed by the algorithms available (e.g. assembler) that output results showing significant difference which, again, hard to compare without any control or gold standards. It is so painful to screen something based on theoretical calculation (while there is always exception) or sometimes even intuition (e.g. the room I should leave in order not to miss any exceptions). However, in the other way round, if I just screen everything so conservatively, tons of candidate sequence remained.
Regarding the situation, I have the following questions :
1) What are the principles of controlling bioinformatics procedures?
2) What are the principles of balancing between the efficiency of screening and the risk of losing a real candidate?
3) Any other suggestions to effectively screen for the true mutation (including wet-lab)?
I am considering if I should accept the situation as it is with the quality and quantity of data on hand, anyway, I want to see how far I could go by optimizing the analysis method after getting advice on the questions.
Why internal stop codons appear during process of submission of COX-I or cyto b partial gene sequences to NCBI? And how can we get rid of them? This happened even if the nucleotide sequences are same as of other closely related species already submitted to the gene bank. How can we submit sequences if such error conditions persist?
important factors considered while performing homology modeling
suggest best tools
I am a hardcore experimental biologist. In particular, I am well-versed in molecular and cell Biology techniques. However, the area of Bioinformatics and Computational Biology have revolutionised the field of Biology.
Can anyone suggest basic hands-on-trainings/courses of international standard where we can get a good exposure on the following for beginners?
2. Systems Biology
3. Synthetic Biology
I have blast results in tabular format comprising near about 0.5 Million entries having titles "qseqid; sseqid; qlen; slen; qstart; qend; sstart; send; evalue; length; pident; qcovs; bitscore; staxid; sscinames; scomnames; sskingdoms; stitle"
I want to analyse this data based on family, order, class, genus and species level in excel. How to import all taxonomic tree data for each taxon ids
Is there any perl, python script or tool available to do this analysis or fetch data from ncbi taxdb or other db?
suppose my data having more than 1 Million read which tool should i use for simple calculations like abundance and all (Excel only supports 1048576 raw in single sheet)
I have written a few tcl vmd scripts , those we can also use as VMD commands in tk console. I hope these will be helpful for VMD users.
Please have a look at these and let me know your valuable suggestions.
Thanks in advance.
K .anji babu
I want to analyze several genes for a set of bacteria simultaneously. Is it possible to do an analyze for that set of genes in a real time and drawing their phylogenetic tree, I mean that putting all sequences of each species in a raw? Which software has this ability?
I am working on a new project to observe the transcription activator's function.
I was wondering how I could test the DNA binding site via R-studio or any software you would suggest.I would like the align yeast DNA from negatively numbered codons to Tata box then I would like to see where the transcription activator might bind, if there no binding site, then I would go head and work on a possible mutation on the protein so, that it binds the yeast DNA to start the transcription.
I need suggestions about what type of software I could use to do DNA alignment and see how the binding could occur.
I worked on Rstudio a little bit but, I am not sure if it is the correct road to follow.
Also I don't know how to align the DNA sequence by using negative numbers to postivie numbers on Rstudio.I would like to hear suggestions to design my code and project.
I am looking forward to hearing from you soon.
Thanks for your time.
I have a vcf files with a bunch of SNPs, I want to transform this file in to a distance based matrix (where HomoRef=0, Heterozygous=1, HomoAlternate =2, for example). How can I do this?
Here is the thing, we have recently submitted a paper about tumor prognosis biomarker, and a reviewer asked us to validate our findings in a given microarray dataset on http://www.ebi.ac.uk/arrayexpress.
After I downloaded the Raw data (.CEL files) and opened it with R (which I have never used before), I can only find manuals of full-scale expression analysis which took hours on my computer. After that, I have to pick out that specific gene's expression, integrate it with the survival data, and go for the validations, which is quite time-consuming.
I am wondering if there is a way to only extract the expression data of that specific gene(e.g. gene name 284065_at in HG U133 plus 2) from the Raw data files(in this case .CELs) and get the results I need much faster.
Meanwhile, does anyone knows how can I combine that expression data directly with survival data, (or better, direct data processing online without using R on my computer)? Or if you know a way to improve my work flow(look below for the codes I used), please share with us :)
P.S. the R code I am using is from this site: http://jura.wi.mit.edu/bio/education/bioinfo2007/arrays/array_exercises_1R.html