Transcript of meeting in BioMOO, PPS Base 22nd May '96 15:00 GMT: Introduction to Bioinformatics

We would like to thank PPS Consultant Clare Sansom for holding this seminar.

This transcript can currently be found on the tape 'bioinfo_tape' in the PPS Base.


ClareS says, "Right - let's start."

ClareS says, "As an introduction, I should point out that this will be a very elementary bioinformatics introduction. I will be covering the basics of sequence databases, how to analyse them, and how sequence and structural information are related."

ClareS says, "Gustavo and GeorgF, who are here, certainly know *much* more about (e.g.) how sequence alignment algorithms work than I do. It may be possible to cover these in a later advanced session."

ClareS says, "Before I start properly, *please* do interrupt with any questions you may hav"

ClareS . o O ( have )

Mykol says, "I'm particularly interested in "

ClareS says, "The agenda will be the following:-"

ClareS says, "1) Sequence databases and how to find them on the Net"

ClareS says, "2) the anatomy of a SwissProt entry"

ClareS says, "3) Searching databases"

ClareS says, "4) Similarity, identity, homology"

ClareS says, "5) What can be learned about structure from sequence"

ClareS says, "6) Public domain software available"

ClareS says, "Does anyone have any comments on that agenda? Anything you are particularly interested in?"

Mykol says, "I think the agenda is good, and I'm particularly interested in the applications"

Jzt thinks it will also be useful to those not here who will read the transcript.

ClareS says, "OK. Here goes."

ClareS says, "Sequence databases..."

ClareS says, "Of course, there are two kinds of sequence database - gene sequence databases and protein sequence databases"

ClareS says, "But databases can also be classified into those which are built up from direct submissions and those which are derived from source databases"

ClareS says, "Of the first kind, the best known GENE sequence database is GenBank / EMBL and the best known PROTEIN Sequence database is Swiss-Prot."

ClareS says, "For this section, it would be best for you to have access to a web browser"

ClareS says, "as most of these databases are available to search on the Web, and I will be giving out some URLs."

ClareS says, "Most (if not all) sequence databases are freely available, and each has a web site."

ClareS says, "For example, SwissProt and a large number of associated databases are mounted at the ExPASy server in Geneva"

ClareS . o O ( No, I don't know precisely why ExPASy )

ClareS says, "The main page with links to all the databases is at"

ClareS says, "GenBank is at"

ClareS says, "One of the most important derivatory sequence databases is OWL."

ClareS says, "Owl is "a comprehensive, non-redundant sequence database""

Jzt finds the expasy page has a large health-warning on it!

ClareS says, "Comprehensive: it is a protein database derived from all the source databases, including translations from the nucleic acid databases"

PeteO says, "What do you mean by "source database"?"

ClareS says, "Non-redundant: each sequence is only included *once*. So if a particular sequence is included from one database, it won't be included from another one"

PeteO nods.

ClareS says, "Source databases are the databases which gene and protein sequences are originally submitted to, like GenBank and SwissProt. OWL is derived from these databases."

ClareS says, "SwissProt is taken as the primary source as its annotation is particularly good. So when OWL is created, if a GenBank sequence is also in SwissProt, it isn't included again - and so on, down a list of databases"

ClareS says, "OK so far?"

HorstJS nods.

JohnW nods

PierreH nods.

ClareS says, "OWL is on the UCL database browser:"

Kurt says, "If for example you had an alternatively spliced protein, would it get one or more entries in OWL?"

ClareS [to kurt]: I would imagine that it would get more than one entry, but I don't know for sure

ClareS says, "if that's enough for you, there are lots of lists of databases on the net. The WWW Virtual Library of Biosciences (a very useful list of a huge number of sites) has many links"

ClareS has discovered that the Americans have woken up ;-)

PierreH says, "about alternatively spliced proteins I checked for one in OWL, there are 2 entries (for human insulin receptor)"

ClareS says, "The Virtual Library: Biosciences page is at"

ClareS says, "Thanks, Pierre"

ClareS says, "I'm now going to go through the kind of information that is available in a SwissProt entry. If you want to follow this closely, please find the ExPASy home page "

ClareS says, "is anyone who wants to follow this going to have any problems running netscape at the same time?"

Jzt says, "OK here"

ClareS says, "A couple of digressions. The first *important* one screams at you from the ExPASy home page."

Kurt says, "Are you sure of the Virtual Library: Biosciences page?"

Kurt says, "I get " unable to locate the server: golgi.harvard.."

ClareS says, "there is a funding crisis and the vital information now freely available in all the SwissProt related databases may disappear shortly"

ClareS says, "please email messages of support"

PeteO says, "I've accessed the Virtual Library (from California)."

ClareS says, "(but not now)"

ClareS [to kurt]: I have periodical problems accessing that page from here, but the server address is OK

Kurt says, "Seems to be working OK now!"

PierreH says, "I just connected with the URL, usually works"

ClareS says, "another minor digression: those of us in Europe have sequence information available via national (and other) nodes of the European Molecular Biology Network (EMBnet)"

JohnW says, "Yes, sometimes Netscape claims it doesnt have a DNS entry, then works happily when you try again"

ClareS says, "this includes some commercial programs and search tools, as well as the (currently) free databases"

ClareS says, "if anyone from Europe would like the URL of their national EMBnet node, let me know"

ClareS says, "but now, on with the search of SwissProt....."

ClareS says, "go down past the banner headline to "database entry points" and click "SwissProt""

DWild says, "sorry I arrived late - what is the URL you're discussing...?"

ClareS says, "you can search SwissProt by name, description, keywords, accession number (to find an individual entry) but *not* by sequence string, you'll do that with OWL later"

ClareS [to DWild]:

ClareS [to DWild]: then find "Database Entry Points... SwissProt

ClareS says, "Click on "Direct access to Swiss-Prot... by description or identification, and then type the name of a protein you're interested in"

ClareS says, "(no-one is trying to do this on a browser which doesn't support forms, are they?)"

JohnW adds that the 'Graphical Example' link is very helpful!

ClareS says, "if you land up with a screen full of entries, select one at random"

ClareS says, "OK?"

Jzt has a large number of insulin entries.. how to choose?

ClareS [to jzt]: for now, at random...

JohnW [to jzt]: Remember that if you searched for 'insulin' you will also get 'insulin-like growth factor' etc

ClareS hopes that everyone has picked a protein with known structure - there are lots of nice cross-links to other databases including structural ones

ClareS [to JohnW]: Good point! - this illustrates one of the main pitfalls of keyword searching

PeteO says, "I'm getting a "500 Server Error: contact server administrator" when I try to search on insulin."

ClareS says, "You tend to have to refine your search criteria carefully - to find all insulins correctly, you will need to search for "insulin AND NOT growth" and then probably refine the criteria still further to remove false positives"

ClareS [to PeteO]: that's a problem with the server or the connections, if this happens, try again!

ClareS says, "Ok , I will assume that most people are looking at an entry "

JohnW nods

ClareS says, "the first couple of lines just give the sequence identifier and the accession number, the next section is the bibliographic details"

ClareS says, "which is fairly standard, but note the links - to Medline etc"

ClareS says, "Enzymes also have the EC number given and linked to the appropriate place in the ENZYME classification database (also at ExPASy)"

ClareS says, "after comments, comes a very useful section: links to other databases"

ClareS says, "firstly, the nucleic acid sequences, in GenBank and EMBL"

ClareS says, "then, *if there is a structure*, a direct link to the PDB entry"

ClareS says, "you can retrieve the entry, load it into Rasmol, or look at the entries in two other structural databases: HSSP (secondary structure) and SCOP (fold classification). I don't have time to go into those now, but they are very useful"

ClareS says, "ProSite is another very useful database which is linked to SwissProt. Does everyone have a line starting PROSITE: in their entry? "

HorstJS says, "yes"

PeteO nods

Iddo nopes

ClareS says, "Prosite is a database of sequence patterns"

DWild says, "no - just PRODOM"

ClareS says, "Very many sequence patterns have been determined which have functional relevance"

ClareS says, "so a particular protein family *is known to* contain a certain sequence pattern"

ClareS says, "These patterns are listed in the PROSITE database"

ClareS says, "PRODOM is similar: it describes the domain structure of the proteins"

ClareS says, "*All* these databases are maintained by the indefatigable Amos Bairoch, at Geneva"

ClareS says, "and at the SwissProt site each database is cross-referenced to the relevant entry by a link"

ClareS says, "below the cross references, important residues are listed"

ClareS says, "they could be active sites, known disulphides, or just residues for which there is a sequencing conflict"

ClareS says, "if secondary structure is known, it is listed"

ClareS says, "and finally, at the foot of the entry, you get to the sequence - in single letter code"

ClareS says, "OK, is anyone unclear about anything so far?"

JohnW says, "Just one thing about the sequence-"

ClareS says, "of course, this isn't the only way you can search a sequence database"

JohnW says, "looking at insulin for example, some of the entries include the sequence of the C (excised) peptide, and some dont- i.e. some are the actual primary sequenceof the protein, and some are of precursors"

ClareS says, "a lot of the time, you will be interested in which proteins are similar to a protein you're studying"

ClareS [to JohnW]: the keywords field, if not the title, should indicate whether you're looking at a precursor

ClareS [to JohnW]: if the sequence is a translation of a gene sequence, rather than a sequenced protein, it is very often the precursor

JohnW nods

Kurt says, "What does OWL do in such cases, does it just have the precursor?"

ClareS says, "Suppose you are only interested in which other proteins contain a short amino acid sequence which is known to be important in "your" protein."

ClareS [to kurt]: Owl will have the precursor and the mature sequence. In this way you can tell where the mature protein starts ;-) Precursors and mature proteins are not different enough for Owl to throw one out

ClareS [to kurt]: not *similar* enough

ClareS blushes

Kurt smiles

ClareS says, "You can search Owl for short peptides. Can everyone (who is following this - it's not compulsory!) find"

ClareS says, "Then click "OWL" (first entry of main database list)"

ClareS says, "then click "by sequence""

ClareS says, "enter a short chunk of single letter code in the box"

ClareS says, "for example, if your name happened to be David, you might be curious to know how many proteins contain the seq Asp-Ala-Val-Ile-Asp"

JohnW tries sequence probe 'HAPPY' and gets 4 matches

ClareS says, "seriously, 4-7 amino acids usually give a sensible number of matches"

Jzt wonders what's the limit of 'sensible'?

ClareS says, "less than 4, you retrieve thousands: more than 8, you're down to "your" protein and its very close homologues"

JohnW nods

Iddo says, "unless you enter something like an N-myristilation site"

ClareS says, "if (and it's a big "if") the chunk of sequence in question has a known function, then you would be better searching in the Prosite database"

ClareS [to Iddo]: How many residues are there in a N-myristilation site consensus sequence? I can't remember

Iddo [to ClareS]: neither can I . But I think more than 4

ClareS says, "One to look up in Prosite later, I think"

ClareS says, "You can also search OWL by regular expression, allowing some residues in a sequence to vary. The query language is the same sa Prosite, and too complicated to go into at the moment"

ClareS . o O ( as Prosite... )

ClareS says, "patterns and regular expressions might be a suitable topic for a later advanced session, possibly?"

Gustavo . o O ( definitely )

Iddo [to ClareS]: does owl do imperfect matches (gaps, a few AA exchanged)?

Iddo [to ClareS]: does owl do imperfect matches (gaps, a few AA exchanged)?

ClareS says, "If you want to find out which proteins have similar sequences to a full sequence, you need to use one of the many homology searching programs around"

ClareS says, "Again, most of these programs are public domain"

ClareS says, "Probably the two best known are Blast and FastA"

ClareS says, "they are also pretty fast on a powerful machine"

ClareS says, "A lot of people also use Sweep (written by the UK's EMBnet nodemaster, Alan Bleasby"

ClareS says, "Basically, the only difference between any of these programs is the precise algorithm used"

ClareS says, "you define certain parameters - e.g. numerical penalties for introducing or lengthening gaps in sequences - and give the program your sequence. "

ClareS says, "It will return a sorted list of the "most similar" proteins in the database"

ClareS says, "Blast and FastA work with *either* gene *or* protein sequences, but you need to choose an appropriate database!"

ClareS can remember searching a protein database for a "protein" consisting only of Ala, Thr, Cys and Gly

ClareS says, "One advantage of FastA is that there is also a related program, TFastA. This allows you to search a protein sequence database with a NA sequence"

ClareS says, "it simply translates the gene sequence into each of the 6 reading frames (3 on each strand) and searches the protein database with each translated sequence"

ClareS . o O ( thus taking 6 times longer, approx..... )

ClareS says, "is everyone following this, still?"

PeteO nods

Iddo nods

Kurt nods

Silke nods

JohnW nods

Jzt nods

HorstJS nods.

counter-ion turns its eyes toward JohnW.

DWild nods

ClareS says, "One thing which is important to know (and which there is no definite answer to) is what degree of sequence similarity implies a functional or evolutionary relationship"

ClareS says, "or, from a structural biologist's perspective, what degree of similarity to a protein of known structure implies that "your" protein is likely to have the same fold"

Ahotz nods

ClareS says, "it is well known that you need a *very high* degree of sequence IDENTITY (in the high 90%s) before you can have a reasonable chance of modelling the 3D structure of an unknown protein, based on a known one"

Iddo says, "There's the good example of HIV-I RT, which is a dimer of two subunits identical in sequence, but for one which is about 150 AA short. They fold quite differently"

ClareS says, "Yes, the modellers would never have predicted that- more likely, they would have assumed that the extra 150aa folded into a different domain"

Iddo says, "Actually, crystallographers think that the difference in the fold is due to the dimerization process itself..."

ClareS says, "if you have 1) a reasonably high % sequence identity (say > 30%, even), 2) a common FUNCTION, and 3) if possible, a common sequence motif you can be reasonably sure that the proteins will be members of the same *fold family*..."

Silke [to Iddo]: does that mean that as monomers they would have a similar fold?

ClareS says, "a good example of this is the lipocalin family"

ClareS says, "members of this family have common motifs and a common function, but only about (typically) 20-30% sequence identity between pairs, and they share the same fold"

Iddo [to Silke]: Probably yes. I don't recall if it's bee proven

ClareS says, "one thing I would like to make clear now, is the definition of the word "homology""

ClareS says, "you will have noticed my referring to percentage sequence identity, or sequence similarity, but not homology"

ClareS says, "the word "homology" has a strict meaning - two proteins are homologous if they are evolutionarily related"

Iddo [to ClareS]: how does the lipocalin similarity look if viewed through a PAM/BLOSUM matrix?

ClareS [to Iddo]: not sure... worth trying

Silke [to Iddo]: what is a PAM/BLOSUM matrix?

ClareS [to Iddo]: Would matrices be another suitable topic for an advanced class?

Iddo [to ClareS]: yes. Sorry to interrupt... there was a delay here.

ClareS says, "Well, I have covered *most* of the agenda, and it is nearly 18:00 (BST) so I propose to end the formal part of the meeting at 18h. After that I can stay around for a while, if anyone has any questions. Is there anything anyone would particularly like me to cover (if I can) in the next 5 minutes or so?"

Jzt says, "Many thanks for covering so much material today, Clare."

Iddo claps

JohnW nods in agreement

Gustavo would just like to point out that, even though there is a strict definition of the word 'homology', *too many* people use homology/similarity/etc interchangeably.

Paulyta jumps up and down in excitement

PeteO says, "Thanks for the excellent discussion:-)"

Kurt . o O ( homology/similarity/e )

Jzt worries about Paulyta

ClareS says, "One thing I haven't covered, explicitly, is public domain programs, Most of the prograns I have discussed today *are* public domain. "

JohnW does a double back flip...oh never mind

Kurt . o O ( homology/similarity/e )

Kurt . o O ( terms for the glossary )

Gustavo nods to Kurt.

ClareS says, "One program I haven't mentioned is GCG"

ClareS says, "that is certainly NOT public domain, and it is *extremely* useful - it is a suite of programs for all the tasks I have mentioned today, plus many others - gene to protein translation, phylogenetic trees, seconda=ry structure prediction, transmembrane helices, glycosylation sites, etc..."

Gustavo says, "For those who don't know yet - Eitan and I have established here a 'GCG HelpDesk' (east from the Central Room), where people can help each other on any issues related to GCG, or sequence analysis in general..."

Silke says, "is there any documentation on GCG available?"

ClareS says, "the *good news* for those of us in Europe who don't have direct access to GCG, is that all EMBnet nodes have either GCG or a similar suite available for their users."

DWild says, "will there be a transcript of this discussion available by ftp or mail?"

ClareS says, "and membership of an EMBnet node is free at least for academics"

JohnW says, "If its ok with everyone, I will put a transcript of this on the WWW"


GCG documentation


HorstJS says, "where can you retrieve a list of EMBnet nodes?"

ClareS says, "the GCG manual is also available for everyone (not just people with seqnet usernames) at Seqnet, the UK EMBnet node"

ClareS says, ""

DWild [to JohnW] - please could you let us know the URL for the transcript?

Kurt thanks Clare for the EXCELLENT class and the many useful URL's, then waves and leaves

ClareS [to HorstJS]:

HorstJS says, "thank you! :-)"

ClareS [to HorstJS]: the EMBnet doc. is mirrored throughout Europe - i.e. - but I can't remember exactly which other countries have mirrors

ClareS wonders if she can get a commission from Alan B.

JohnW [to DWild]: Yes, I will put it up at

ClareS says, "I think I really ought to have turned the recorder off before now..."

ClareS turns the ClareS_recorder off.