FANDOM


DescriptionEdit

What is the relative abundance of As, Ts, Cs and Gs in the genome? This script will calculate it from a given FASTA file!

OutputEdit

CHROM	A	C	G	T	N
1	26.5%	19.0%	19.0%	26.5%	9.0%
2	27.0%	19.3%	19.3%	27.0%	7.5%
3	26.8%	19.2%	19.2%	26.9%	8.0%
4	27.1%	19.1%	19.1%	27.2%	7.4%
5	27.3%	19.1%	19.2%	27.4%	7.0%
6	27.5%	19.2%	19.2%	27.5%	6.6%
7	27.6%	19.2%	19.2%	27.6%	6.4%
8	27.7%	19.2%	19.2%	27.7%	6.2%
9	27.5%	19.1%	19.2%	27.6%	6.6%
10	27.2%	19.4%	19.4%	27.2%	6.8%
11	27.5%	19.6%	19.6%	27.5%	5.7%
12	27.8%	19.7%	19.7%	27.9%	4.8%
13	27.5%	19.2%	19.2%	27.5%	6.6%
14	27.1%	18.9%	18.9%	27.2%	7.8%
15	26.8%	18.7%	18.7%	26.8%	9.0%
16	26.6%	18.8%	18.8%	26.6%	9.2%
17	26.6%	19.1%	19.1%	26.6%	8.6%
18	26.7%	19.1%	19.1%	26.8%	8.3%
19	26.5%	19.2%	19.2%	26.6%	8.5%
20	26.9%	19.3%	19.4%	27.0%	7.4%
21	26.8%	19.2%	19.2%	26.8%	7.9%
22	26.5%	19.1%	19.2%	26.6%	8.6%
X	27.6%	19.2%	19.2%	27.7%	6.4%
Y	27.4%	19.0%	19.0%	27.4%	7.3%
M	27.5%	19.1%	19.2%	27.6%	6.6%

From the above output, taken from build 36 of the human genome, we can see that A and T are the most common bases. We can also see that between 4.8% and 9.2% of a chromosome may have missing sequence (Ns), generally due to highly repeative regions of the genome that cannot be easily sequenced.

Running the scriptEdit

> perl ./count_nucleotides.pl input_FASTA_file.fa

RequirementsEdit

This script requires Perl. It calculates the base composition using a FASTA file. FASTA files for the human genome can be found from the UCSC website: http://hgdownload.cse.ucsc.edu/downloads.html#human

CodeEdit

Script 1 - count_nucleotides.plEdit

#!/usr/bin/perl

$narg = $#ARGV+1;
if ($narg < 1)
{
	die("Usage: perl count_nucleotides.pl FASTA_file.fa \n");
}
$in = $ARGV[0];

open(IN, '<', $in) or die("Could not open fasta file: $in\n");
@nuc = ("A", "C", "G", "T", "N");
%counts = ();
$header = "";
print "CHROM";
for ($i=0; $i<=$#nuc; $i++)
{
	print "\t" . $nuc[$i]; 
}
print "\n";
while ($line = <IN>)
{
	chomp($line);
	if (substr($line, 0, 1) eq '>')
	{
		if ($header ne "")
		{
			print "$header";

			$sum = 0;
			for ($i=0; $i<=$#nuc; $i++)
			{
				$sum += $counts{$nuc[$i]};
			}
			for ($i=0; $i<=$#nuc; $i++)
			{	
				$prc = sprintf("%.1f", $counts{$nuc[$i]}*100/$sum);
				print "\t" . $prc ."\%";
			}
			print "\n";
			%c = ();
		}
		$header = substr($line, 1);
		next;
	}
	$line = uc($line);

	for ($i=0; $i<=$#nuc; $i++)
	{
		$base = $nuc[$i];
		$count=($line=~s/$base/$base/g);
		$counts{$base} += $count;
	}
}

print "$header";

$sum = 0;
for ($i=0; $i<=$#nuc; $i++)
{
	$sum += $counts{$nuc[$i]};
}

for ($i=0; $i<=$#nuc; $i++)
{	
	$prc = sprintf("%.1f", $counts{$nuc[$i]}*100/$sum);
	print "\t" . $prc . "\%";
}
print "\n";

close(IN);

Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.