AI in computational biology


(This article is a reproduction of the lectures I have given in Engineering Colleges in Mumbai, for students and faculty)

In a previous article, I have suggested the DTOP (Data-Technology-Objective-Possibilities) framework (see https://www.linkedin.com/pulse/4-hints-get-started-ai-your-company-devesh-rajadhyax)to analyse AI use cases in a company. Today I am going to use the same framework to explain some applications of AI to Computational Biology(CB).

In this article I will focus more on the data aspect. There is a reason. Biological data is probably the most important kind data for us. However, very few engineers have a good understanding of this data. I will be very happy if this article encourages some engineers to study biological data in detail. Understanding and experience with this data will become a much wanted skill in near future.

Let me first put CB use cases in DTOP.  

Data: Biological data is of many types. I will mention three major types here and explain one of them in this part:

-      Genetic data: data associated with our genome. There are 3 billion letters in our genome and we are just one of the hundreds of thousands of species. The structure of DNA was discovered by legendary scientists Crick and Watson in early 1950’s. The human genome was decoded in 2001. Since then, genome data has provided us many insights about medicine and life in general.

-      Protein data: our body is almost entirely made up of proteins. The structure of each protein is unique and is a source of data. Understanding of the protein data is crucial for discovering medicines and treating patients.

-      Medicine data: Medicines are chemicals, so they have a molecular structure. They interact with protein and make changes in what is called as gene expression. All this data becomes valuable for inventing new drugs.

Technology: The AI technologies available to Computational Biologists are mostly Machine Learning (both Supervised and Unsupervised) and Deep Learning. The major tasks that these techniques perform in CB are pattern detection, similarity and classification, among others. I will not write much about the techniques themselves as many excellent sources are available for studying them.

Objectives: Biology or medicine is a subject close to our heart and a number of objectives can be stated, however, most objectives will fall in three major buckets:

-      Diagnosis: The protein and genetic data can be utilised for detecting what is wrong or even what will go wrong with the person in future

-      Treatment: All three types of data are needed for coming up with new medicines. As of now, not much use of data is made to decide treatment plan for a patient, but this is the goal of personalised medicine.

-      Research: In the universities and labs across the globe, research is being conducted that does not have much to do with humans, but will someday lead to better medicine. Research to understand functioning of genes common to all life is one such example. The genetic and protein data will be useful here, but not necessarily those of human beings.

Possibilities: Applying AI to the rich biological dataset can create many possibilities. It can point you to some chemicals as possible medicines, some genes may be identified as responsible for a function, the root cause of a disease may be identified or a person may be declared as susceptible to certain disorder. But before we understand the possibilities, we need a certain understanding of the underlying data.


Genetic data

Cell


Our body is made up of cells and nothing else. The cell is the stage for everything that happens to us and all living beings.

The cell is of course an unimaginably large source of data. But at present we are interested in only one type – the data that is encased by the cell nucleus. It is called DNA.


The DNA

A DNA is a very very large molecule that looks like a twisted rope ladder. It is made up of four types of chemicals called A, G, C and T, from their chemical names. The DNA is usually grouped as a set of ladder strings called chromosomes. Human beings have 23 pairs of them, 46 in all. The sequence of A, G, C, T is our data. Let’s go ahead and see what this data means.

The gene

The sequence on our DNA appears without beginning and end, but in reality it is divided in piece of letters called genes.

GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGTATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACCATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCCCACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCCCTTTGGCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGCAGAGGGAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGGTGAACCAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGATGGACCTGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCTGTCTGGCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTCTGCTCCACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGGCTGGGGTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGGGAGTGGCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGGAGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCGGATCCCCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTTCAGGAGGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTGCTCCTCTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATCATCCCAGCTGCTCCCAAATAAACTCCAGAAG

(The sequence for HSBGPG - Human gene for bone gla protein (BGP)

There is no standard length for a gene. One gene is a recipe for making one protein. We will now see how.

Amino Acids

All living beings are made up of primitive compounds called Amino Acids. As of last count, there are 21 of them. Which means the entire life you see around you is made up of just 21 chemicals. That's some amazing modularity that makes data scientists happy.

Each amino acid is represented by three letters on DNA. Examples:

AGC – Serine

GCA – Alanine

See more for yourself:

(In a little quirk of biology, T is replaced by U, but I will tell you that one later when we learn about RNA)

An amino acid is an organic compound, which means it is made from carbon. This is why they say that life on earth is carbon based.

This is how amino acids look:

Where R represents a group of atoms, different for each amino acid.

Amino Acids are the building blocks of proteins. That's the next link in this explanation.



Proteins

Proteins are long sequences of amino acids, turned and twisted in a particular way.

The figure on right side is representation that shows various substructures in a protein, such as coils, wires etc. We should keep in mind that all these structures are made by twisting and turning of amino acid sequences and the colours and shapes are purely representational.

So now you know that a gene is an instruction set for making one protein. It is almost as if says – ‘put some Alanine, add a pinch of Glycine,….., twist, twist, …, turn’. But how does the manufacturing happen? That brings us to the RNA and some messaging.


The manufacturing takes place in microscopic machines inside the cells called ribosomes. When a gene is to be made into a protein, a copy of the sequence of that gene is taken on a material called the RNA. There are just two differences in DNA and RNA – a) the chemical T is replaced by the chemical U and b) it has just one strand, one side of the ladder.

This copy of gene, called messenger RNA or mRNA travels to the ribosome. Ribosome is like a machine in a plastic factory. From one side a generous supply of amino acids is fed to it. It then connects them as per the sequence in the mRNA, twists and turns as per the instructions and sends a manufactured protein molecule from the other side.

When in a particular cell, a gene causes its protein to be manufactured, the gene is said to be ‘expressed’. This gives rise to gene expression data, which basically means which genes are expressed in a particular cell. The gene expression data is increasing being used is detecting many disorders and discovering new drugs.

Well, I think that is enough content for one article. Allow me to break this into a series and discuss some use cases based on genetic data in the next part. I have two in mind – identifying gene functions and repositioning drugs. 

By Devesh Rajadhyax

Co-Founder, Cere Labs

Comments

Popular posts from this blog

Can language models reason?

AI implementation myths

Homework 2.0: The science behind homework and why it still matters in the AI age!