Site author Richard Steane
The BioTopics website gives access to interactive resource material, developed to support the
learning and teaching of Biology at a variety of levels.
DNA, genes and chromosomes
This topic has connections with Nucleic acid structure and DNA replication (in Biological Chemicals).
It continues with 'DNA and protein synthesis' (links below).
Mouseover of green text (in the main body, not headings) should bring in further explanation in a small popup window.
DNA - different types in different places
DNA is the most variable molecule known to Man.
Some biological compounds with the same empirical formula exist in different forms (isomers), as parts of their molecules can be arranged differently.
Although DNA is composed of only a few sub-units, different numbers of these can be put together in an infinite number of combinations.
The sub-units are called nucleotides, but these themselves have some sub-sections, some of which are the same: the sugar deoxyribose and phosphate groups, and another section called a base: which can vary; it can be either adenine, cytosine, guanine or thymine, usually abbreviated to A, C, G or T. Nucleotides have similar names to bases, and so can use the same letters.
It is this variation in molecular structure that enables DNA to perform its essential function in living organisms: storing genetic information and carrying it from cell to cell, and from generation to generation.
Size of DNA molecules is often expressed in terms of the number of base pairs they contain.
The distance between bases of DNA is 3.4 Angstroms (0.34 nm) so 1 million base pairs is 340 µm in length.
Without doubt, the investigatory work that revealed the DNA sequence that makes up the Human Genome in 2003 has generated a lot of useful information about DNA, genes and chromosomes. And work on other organisms has resulted in interesting information that can be used for comparison.
Things you should know about DNA structure
Here I have opted for a slight change to my usual mouseover hidden text strategy: just click in the spaces to have them filled in, and stay visible.
The DNA molecule is a double helix in shape, and each helix is a polynucleotide, i.e a polymer consisting of a number of nucleotides in a row, coiled up to give this shape.
The outside edges of each helix consist of alternating deoxyribose and phosphate groups, held together by strong phosphodiester bonds.
The middle part of the double helix consists of pairs of nitrogenous bases - adenine + thymine, or cytosine + guanine, clinging together by much weaker hydrogen bonds. These pairs are called complementary, as their molecules are shaped to fit together, and the hydrogen bonding relies on this closeness of fit.
The two polynucleotide strands run in opposite directions. This is called antiparallel.
Prokaryotic cells - no loose ends
In prokaryotic cells (bacteria, cyanobacteria and archaea) their DNA is fairly short and circular, although it appears as a diffuse blob in the cytoplasm, not associated with any other structures within the cells.
Different forms of DNA in a bacterial cell
Terms like short and circular are not very helpful.
This DNA forms into tangled loops, not as long or as organised as DNA in eukaryotic cells.
The main DNA is sometimes called the 'bacterial chromosome'
or nucleoid. There may be other independent sections of DNA called plasmids
which are also circular. These often code for characteristics like antibiotic resistance.
The bacterium Escherichia coli
has a DNA molecule 4.6 million base pairs in length.
This equates to 1564 µm, in a cell approximately 2 × 1 µm.
That makes a loop about 500 µm in diameter, so it must be folded back on itself at least 500 times to fit inside the cell.
Physical map of the Macrococcus caseolyticus plasmid pMCCL2
The bacterial genus of the top-hit entry for each orf is denoted by colouration.
(methicillin-resistant Staphyloccus aureus
) has emerged as an antibiotic-resistant bacterial strain. Several genes for antibiotic resistance have been identified on plasmids which can be passed between related species.
For example, Macrococcus caseolyticus
has a large plasmid pMCCL2 consisting of 80,545 base pairs.
That is 27.38 µm in length - a loop 8.7 µm in diameter - if circular. It is obviously looped back on itself a few times within a bacterial cell.
Eukaryotic cells - all wound up
DNA is coiled around 'bobbins' of histone proteins
In eukaryotic cells (animals, plants, protoctists, fungi) their DNA is comparatively long, linear and associated with proteins, called histones.
consists of a DNA molecule and its associated proteins.
In humans, there are 46 chromosomes in each body cell, so the main DNA is divided into 46 sections.
Human chromosomes arranged by size
to form a karyotype - a genetic identification parade
Courtesy: National Human Genome Research Institute
This diagram shows the banded appearance of chromosomes when stained. The dotted horizontal line denotes the centromeres - points of attachment to the spindle during mitosis. These form the central parts of the X-shape composed of duplicated arms (chromatids) of chromosomes, seen at metaphase.
The diploid (2n) set of chromosomes in Man is 46.
There are two copies of each chromosome 1-22 and either
two X chromosomes (in females) or one X and one Y chromosome (in males).
6 billion base pairs shared by 46 chromosomes gives an average number of 139,534,488 base pairs per chromosome.
Chromosome 1 (the largest) has 248,956,422 base pairs - with 2058 protein coding genes.
Chromosome 21 (the smallest)
has 46,709,983 base pairs - with 234 protein coding genes.
The chromosomes are found within the nucleus of the cell
and their DNA is described as nuclear DNA, as distinct from the next category.
The DNA in a normal human body cell is about 6 billion base pairs long. This is a total of about 2.05 metres in length.
It is sometimes quoted as half this figure, but this presumably refers to the single set of chromosomes, which is only found in gametes (sperms and eggs).
DNA in the nucleus has several layers of organisation. It is wound onto histones, each of which are composed of 8 sub-units. It is said that 146 DNA base pairs are wound onto each histone bobbin forming structures known as nucleosomes
, and these form into 'beads on a string' sections that form into ribbons which coil into 30 nm diameter fibres which associate into larger rope-like structures, forming the chromosomes, which are much more visible during mitosis. Some of the coiling prevents access to the DNA, and there are regions where DNA is less tightly coiled, and loops on the edges allow access by enzymes for transcription.
Chemical changes to histones and DNA are involved in epigenetic control of gene expression in eukaryotes (A2 topic).
DNA and proteins make up a material known as chromatin
, so called because of variations in staining reaction.
Mitochondria and chloroplasts - keeping their own information safe
The mitochondria and chloroplasts of eukaryotic cells also contain DNA which, like the DNA of prokaryotes, is short, circular and not associated with protein.
This points to the possibility that these organelles have developed from single-celled organisms, such as bacteria and cyanobacteria, that were taken into the cells of ancestors of present-day organisms, millions of years ago. It seems that the combination was more efficient in energy terms, and modern animals and plants have co-evolved alongside their endosymbiont partners.
This reinforces the notion that these organelles have a semi-independent existence within cells.
Mitochondrial DNA has proved to be a useful tool for tracing evolutionary relationships within a species.
Human mitochondrial DNA
(mtDNA) includes 16,569 base pairs and encodes 13 proteins which are involved in the process of oxidative phosphorylation (aerobic respiration) which takes place in mitochondria, resulting in the generation of ATP.
Surprisingly, some of its genes are read from one side of the DNA, and the others from the other side ('light strand' and 'heavy strand').
Some of the DNA codes for its own version of RNA, used in its own protein synthesis: two ribosomal RNAs (rRNA), and 22 transfer RNAs (tRNAs) - blue spheres, with single letter amino acid code abbreviations
(cpDNAs) are typically 120,000-170,000 base pairs long.
Viral DNA and RNA - putting it about
Viruses also have nucleic acids but they lack any of the cell's organelles which can operate on them.
Only when viral nucleic acid enters the environment of the host cell can it become operational.
They are categorised according to their genetic material: DNA viruses, RNA viruses (riboviruses) or retroviruses.
DNA viruses can be double-stranded (ds) (e.g. Herpes simplex) or single-stranded (ss) (e.g. Protoparvovirus)
RNA viruses can be double-stranded, or positive and negative single stranded forms.
Retroviruses have genes encoded in RNA, and this needs to be reverse-transcribed into DNA by an enzyme called reverse transcriptase before it can be copied in the usual way.
The reverse transcriptase and integrase enzymes are produced in the host's cell before being packaged into virus particles which leave the cell to infect others.
Example : HIV - human immunodeficiency virus - which has the potential to cause AIDS - Acquired Immune Deficiency Syndrome.
Think of a virus
Genes in the COVID 19 RNA
More about ORFs below
Coronavirus COVID 19 is an RNA virus, and its genome consists of 29,903 RNA nucleotides - not in a loop.
These code for four structural proteins (S, E, and M are on the outside of the virus capsid), and others which inhibit host defences.
S is the spike protein, a target for some of the vaccines which are now under development. And other vaccines use sections of the mRNA itself.
A replicase gene produces 2 polypeptide chains that are then split by viral protease into 16 'nonstructural proteins' that are involved in replication of the viral genome. This protease has been singled out for attack by inhibitor molecules such as Paxlovid.
The D614G mutation is an allele causing a modification of the virus' surface spike protein, which has become increasingly common. This notation means that the 614th amino acid in its polypeptide chain is altered from being aspartate - aspartic acid - (one-letter amino acid code D) to glycine (G). This is likely to be the result of a change (A to G) in the middle base of a codon in the viral RNA. See the genetic code table below.
And there is more information about changes in the Covid 19 genome and the proteins coded for
in another topic on this site.
What is a gene?
In genetics, a gene is a unit of inheritance, and it may exist in different form called alleles.
At a biochemical level, a gene is a section of DNA with a particular function.
It is frequently described as a base sequence, because genetic information is carried in the form of bases in the nucleotides, but their order is
critically important, just like the letters in a word.
Each gene (or allele) has a particular position called a locus on a DNA molecule or chromosome. Often many individual genes are inactive for the life of a cell.
When a gene is active, the selected part of the DNA in the nucleus of a cell is converted into another (but single-stranded) nucleic acid: messenger RNA, which moves out into the cytoplasm of the cell. This transcription process takes place base by base, and the resulting copy of mRNA has a very close resemblance to one half of DNA, as there is a 1:1 relationship between DNA bases and mRNA bases.
DNA and RNA both have the bases adenine, cytosine and guanine, but DNA has thymine and RNA has uracil. Of course DNA has deoxyribose whereas RNA has ribose in the backbone of the molecule.
It is said that the DNA of a gene codes for a section of mRNA which has a particular function: it uses its own sequence of bases to code for a sequence of amino acids which are joined together to form a polypeptide - the basis of a protein. This process, which takes place in ribosomes on the rough endoplasmic reticulum, is not such a straightforward copying process, and it is called translation. In fact it takes a sequence of three (DNA or RNA) bases to code for a single amino acid. This is called a triplet, also referred to as a codon.
It was once thought that the only function of a gene was to produce a protein required by the cell, according to the concept described as the 'central dogma of molecular biology':
DNA makes RNA makes protein.
However it now thought that DNA often codes for RNA which does not automatically synthesise a protein. This RNA can have a specific function in the cell.
For example ribosomes - the site of protein synthesis - are composed of ribosomal RNA (and protein). Several forms of transfer RNA (tRNAs) share the function of bringing in amino acids in protein synthesis. Each of these forms of RNA are coded for by DNA in the nucleus.
Inborn errors of metabolism
Sometimes details of genes are discovered when investigating inherited diseases - conditions which appear to be carried down from generation to generation.
In fact the effect of the normal version (allele) of the gene is often not noticed until it is necessary to explain how the other allele - a mutant form - has its effect, and the gene is often named after the defective form!
Example: the CFTR gene.
Chromosome 7 has 989 protein-coding genes
The locus of the CFTR gene is shown in red
Like many genes, this exists in different forms, and it is named in terms of the protein which is produced as a result of its transcription and translation.
The Cystic fibrosis transmembrane conductance regulator (CFTR) is a membrane protein and chloride channel in vertebrates that is encoded by the CFTR gene.
Normally it allows the release of chloride ions from epithelial cells in the lungs, intestine and pancreas. It maintains an ion gradient that causes osmosis to draw water out of the cells.
In cystic fibrosis
thickened mucus occurs in the lungs frequently causing respiratory infections, and blockage of pancreatic ducts causing malnutrition and diabetes. In affected males, infertility is usually caused by the malformation of the vasa deferentia.
As it is caused by a recessive allele, two copies of the mutant form need to be inherited for the condition to be seen. This allele is actually quite common - carried by 1 in 23 people in the population.
The CFTR gene is found on chromosome 7, on the long arm at position q31.2, as shown on the right.
Positions on chromosomes are generally coded as p (for the short arm -'petit') or q (for the longer arm - 'queue' - tail)
, followed by numbers for distance along the arm.
More specifically, it consists of 18870 base pairs and spans from from base pair 116,907,253 to base pair 117,095,955.
The most common mutation, DeltaF508 (ΔF508) results from a deletion of three nucleotides in the DNA, causing a loss of the amino acid phenylalanine (F) at the 508th position on the polypeptide. This causes incorrect folding of the protein so it does not make a functional membrane channel protein and it is broken down.
The genetic code is not a secret (any more)
The way that messenger RNA specifies the amino acids to be included in the polypeptide produced by a gene is another astounding fact about life on this planet:
All living organisms use exactly the same system, which amounts to a language for the production of proteins.
The genetic code is described as being:
- universal - it applies to all living organisms
- non-overlapping - RNA bases 1, 2 and 3 code for the first amino acid, and 4, 5 and 6 code for the second ...
- degenerate - there are 64 (43) possible combinations of base triplets but only 20 amino acids
So most amino acids can be coded for by more than 1 base triplet combination.
In fact, eleven have 2 alternatives, five have 4, and three have 6 alternatives!
This is shown in the table opposite. The information is included for reference only. Do not worry about the details!
In exams, you will probably not be required to recall details of the genetic code . However, you may be asked to relate the base sequence of nucleic acids to the amino acid
sequence of polypeptides, when provided with suitable data about the genetic code.
The genetic code was worked out by Har Gobind Khorana, Marshall W. Nirenberg and Robert W. Holley who received the Nobel Prize in 1968.
In one of the first experiments to reveal the genetic code, a cell-free system was used to translate a poly-uracil RNA sequence (i.e., UUUUUU...).
The result was a polypeptide consisting of many copies of only one amino acid: phenylalanine (poly-phenylalanine. See UUU opposite.
Similarly it was found that the poly-cytosine RNA sequence (CCCCCC...) coded for the polypeptide poly-proline.
In a continuation of this work, RNA was synthesised consisting of alternating bases of uracil and cytosine (UCUCUC...). The result was a polypeptide consisting of two amino acids, serine and leucine, alternating.
Explain this result, using information above and opposite.
> RNA reads as 2 alternating triplets:
UCU (which codes for serine) and CUC (which codes for leucine)
The genetic codes for each amino acid
| RNA triplet
| RNA triplet
3lc* = 3-letter code for amino acid
- do not confuse with DNA/RNA triplets
I have included a few triplets "out of order" in italics to put all the alternatives together.
Some similar amino acids are grouped together: for example glycine, alanine and valine.
The first 2 bases seem to be more important than the 3rd.
In the two-alternative examples the third base can be either C or U, or A and G.
This corresponds with pyrimidines (bases with single ring) or purines (two rings).
Coding for degeneracy
Triplet codes can be expressed using the letters A, C, G and U for the 4 bases, but in conjunction with other letters: N for any of them, R for purines (A or G) and Y for pyrimidines (C or U).
| RNA triplet
| RNA triplet
It has been found that much of the DNA in eukaryotes does not actually code for proteins/polypeptides (or, more properly, RNA producing proteins/polypeptides).
It used to be thought that this was junk DNA and it was compared to odd copies of outdated and backup information that build up on computer systems in the production of an article or web page. However it has been found that much of it has a role in the functioning of the cell.
Within the gene are sections of the DNA that code for RNA that is edited out - introns- as functional mRNA - composed of exons - is produced.
Between the actual genes are sections of DNA which may code for RNA which has a regulatory function, as well as different types of RNA involved in the synthesis of ribosomes themselves.
Some non-coding sections are multiple repeats of the same base sequence.
Satellite DNA forms the the centromere, which is the constriction point of the pair of chromatids in dividing cells, giving it an X-shape. This is the point of attachment of the chromosome to the spindle. It also forms heterochromatin, which is a form of densely packed DNA that stains differently in cells. It is important for controlling gene activity and maintaining the structure of chromosomes.
And telomeres at the ends of chromosomes have repeated noncoding DNA sequences. These protect the ends of chromosomes from being degraded during the copying of genetic material during mitosis and meiosis.
Non-coding regions of DNA contain short, repeating sequences called variable number tandem repeats (VNTRs). These are used in genetic fingerprinting.
ORFs - another way of defining genes (and finding them)
An open reading frame is a section of DNA or RNA, beginning with a start codon and ending with a stop codon, with a number of bases between them that is a multiple of three. This effectively identifies a coding section of nucleic acid: a gene.
It is especially useful when considering large amounts of base sequencing data (human or other species genomes) stored on a computer, or information from other sources such as bacterial plasmids, viruses and organelles.
There are programs which scan datafiles searching for start and stop codons, with codon counting. When processing RNA it is conventional to perform a triple scan: starting at a given base, then again at the next two bases, to cover all options. With DNA, it is necessary to perform a triple scan on both strands, in order to check for (hidden?) genes on both strands of the double helix, as the sense and antisense relationships are not necessarily the same along the whole length of the DNA molecule. In each case, data is read from the 3' end to the 5' end. In addition, ORFs are checked for associated regulatory elements such as promoter regions.
These approaches are especially useful in comparing genes from different organisms. This is important in the context of evolutionary studies, and developments such as epidemiology of infectious diseases.
Other related topics on this site
(also accessible from the drop-down menu above)
This series (genetic information)
DNA and protein synthesis
Genetic diversity and adaptation
Structure of Nucleic Acids (DNA and RNA)
Genes, DNA and Chromosomes - same words, different order (s)
How DNA controls protein synthesis by means of a base code
Interactive 3-D molecular graphic models on this site
(also accessible from the drop-down menu above)
The DNA molecule - rotatable in 3 dimensions
Human genome - From Wikipedia, the free encyclopedia
Genetic code - From Wikipedia, the free encyclopedia
NHGRI History and Timeline of Events
Biologists can genetically engineer plants and animals - Online textbook
Genomic Basis for Methicillin Resistance in Staphylococcus aureus
The COVID-19 Pandemic: A Summary
The coronavirus is mutating - does it matter?