MIPE
Minimal Information for PCR Experiments
An XML schema for the exchange of PCR related data

Jan Aerts

news purpose download rationale implementation usage scripts wishlist contact

News

July 20, 2005

A manuscript describing the MIPE format is now available from the Online Journal of Bioinformatics website. Reference: Aerts J and Veenendaal T. MIPE - a XML-format to facilitate the storage and exchange of PCR-related data. OJB 6(2): 106-112 (2005).

In addition, Steffen Moeller has created a debian package for MIPE, which is available at http://bioinformatics.pzr.uni-rostock.de/~moeller/debian/mipe. To install, run 'dpkb --install mipe_1.1-1_all.deb'.

May 3, 2005

A manuscript describing the MIPE format has been submitted.

April 5, 2005

The format has now reached version 1.0, and is defined in a XMLSchema file instead of a DTD file. The scripts have undergone minor changes as well. Take CAUTION: I did not have the time yet to test these properly! In addition, the comments on this website still reflect older versions of MIPE. When time permits, I'll change those.

Purpose

To provide a standard format (i.e. MIPE) to exchange and/or storage of all information associated with PCR experiments using a flat text file. This will:

allow for exchange of PCR data between researchers/laboratories
enable traceability of the data
prevent problems when submitting data to dbSTS or dbSNP
enable the writing of standard scripts to extract data (e.g. a list of PCR primers, SNP positions or haplotypes for different animals)

Although this tool can be used for data storage, it's primary focus should be data exchange. For larger reporisitories, relational databases are more appropriate for storage of these data. The MIPE format could then be used as a standard format to import into and/or export from these databases. (See for an example of using text-files for data exchange: Lincoln Stein's article How perl saved the Human Genome Project.

If I have the time, I'll post a SQL scheme for a relational database on this site to store PCR related data. In addition, a small script will be written to import/export data into/from this database implementation.

If the MIPE format almost-but-not-completely serves your needs...

As this is an open format, please don't hesitate to contact me if the MIPE-format almost-but-not-completely serves your needs. It has been developed to serve ours, but can easily be extended (which is what I hope to do with your help).

Download

Download from the Sourceforge website or checkout from cvs for the latest sources using

cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/mipe export -dtomorrow all

Developers can checkout using:

cvs -d:ext:username@cvs.sf.net:/cvsroot/mipe checkout all

A Debian package is available on http://bioinformatics.pzr.uni-rostock.de/~moeller/debian/mipe.

Rationale

The MIPE format is built on two basic parts:

the designed PCR product: which primers were used and what sequence is expected to come out when this PCR will be run
the used PCR product: after running the PCR and sequencing: what is the resulting sequence, is it the reverse complement of the designed sequence, and are there any polymorphisms (e.g. SNPs) that were detected by PCR-ing different individuals?

A schematic overview of the relationship between these three is provided below.

Design

The design part of a MIPE record contains information on the source that was used to design the PCR primers (e.g. an accession number or DNA sequence) and information on the PCR primers.

Use

The use part of a MIPE record contains information on the results from a PCR resequencing experiment. These include the DNA sequence of the amplified fragment, whether or not this is the reverse complement of the DNA sequence as presented in the design part, any polymorphisms with associated assays and samples with associated genotypes.

Implementation

XML

XML stands for eXtented Markup Language. It looks much like HTML (HyperText Markup Language), which uses tags to markup text for webpages. An example of HTML text is:

This <italic>word</italic> is in italic.

It is displayed on webpages as:

This word is in italic.

This example shows that HTML stores information on how words should be presented to the end-user.

XML - contrary to HTML - stores information on what words mean. For example, the XML text <seq>AGGTCCACCTWGGSCC</seq> represents a so-called element, consisting of an opening tag (<seq>), the content (AGGTCCACCTWGGSCC) and a closing tag (</seq>). The closing and opening tags give information on what the thing in between them actually is. Note that spaces after the opening tag or before the closing tag are of significance and are not automatically removed. Therefore <id> some_id </id> is not the same as <id>some_id</id>.

It is possible to nest elements within other elements. For example: the different properties of a SNP can be represented as follows:

  <snp>
    <pos>591</pos>
    <amb>R</amb>
    <sbe>
      <oligo>OL04-231</oligo>
      <specific>GAATACCAGCTACT</specific>
      <tail>TTTTTTTTTTTTTTTTTTTTTTTTTTTT</tail>
    </sbe>
    <remark>this is a remark about the SNP</remark>
  </snp>

Some guidelines should be followed for good practices:

Every element (opening tag + content + closing tag) is put on one line, except when it contains subelements. In the latter case, the opening and closing tags are put on separate lines (see example above: SNP).
Subelements are indented two spaces compared to the parent element. There should be no empty elements: an opening tag should not be followed immediately by a closing tag.

The first line of an XML file always states the XML version:

<?xml version="1.0"?>

A XML file is called well-formed when all opening tags are closed, more particularly from in- to outside. For example: <tag1><tag2>text</tag1></tag2> is not well-formed, while <tag1><tag2>text</tag2></tag1> is.

MIPE

To be MIPE compliant, a well-formed XML file has to adhere to a set of rules as specified in the XSD file. Such a XML file is not only well-formed, but also valid. The XSD file sets rules like:

A SNP has no more than 1 ambiguity code.
A PCR primer can have 0 or more remarks.
...

For MIPE, the XSD file is called mipe.xsd. The path to the corresponding XSD file is set in the second line of the XML file itself, underneath the line with the XML version (see above). So a MIPE file should start with the following two lines (although the path to the XSD file should be set appropriately):

<?xml version="1.0"?>
<mipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.xsd">
...

The order of elements in a MIPE compliant file has to be the same as specified in the XSD file. A description of all elements is presented here. Sorry, that's somewhat broken for the moment. Trying to export to HTML from Excel gives a great result...

According to the XSD file for MIPE, the outermost element - and there is only one - always is a <mipe> element (see Box 1).

Important: A XML file that doesn't comply to the rules in the XSD (i.e. that is not valid), is said not to be a MIPE file. The linux command xsdvalid your_file.mipe checks if the XML file complies to the corresponding XSD file. If something is wrong (most probably some element is missing or in the wrong place), this program reports the line number of the error.

An example MIPE compliant (or valid) file is shown in Box 1. The extreme minimal (and not really informative) MIPE file according to the XSD is represented in Box 2. A template file is available (i.e. template.mipe; be sure to change the second line to match the location of the mipe.xsd file).

Box 1: An example MIPE compliant file.

<?xml version="1.0"?> <mipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.xsd"> <version>1.0</version> <pcr id="PCR1"> <id>PCR1</id> <modified>20040426</modified> <modified>20040428</modified> <researcher>Jan Aerts</researcher> <species>chicken</species> <design> <source> <file>CYP2D6.fas</file> </source> <range>125-642</range> <seq>ACCTACTACTACAAACTACAACAAAATTCACATCAAAACATACACCATACCTACTACTAT...</seq> <primer1> <oligo>OL04-242</oligo> </primer1> <primer2> <oligo>OL04-243</oligo> </primer2> </design> <use> <seq>CACCATCACAGCTCACTATCGCCTGCGGGATCTCTCATTTACACAATTCGAGCTCACATCTATCATATCTAA...</seq> <revcomp>1</revcomp> <snp id="SCW0006"> <id>SCW0006</id> <pos>45</pos> <amb>R</amb> <rank>3</rank> </snp> </use> </pcr> </mipe>

Box 1: An example MIPE compliant file.
<?xml version="1.0"?> <mipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.xsd"> <version>1.0</version> <pcr id="PCR1"> <id>PCR1</id> <modified>20040426</modified> <modified>20040428</modified> <researcher>Jan Aerts</researcher> <species>chicken</species> <design> <source> <file>CYP2D6.fas</file> </source> <range>125-642</range> <seq>ACCTACTACTACAAACTACAACAAAATTCACATCAAAACATACACCATACCTACTACTAT...</seq> <primer1> <oligo>OL04-242</oligo> </primer1> <primer2> <oligo>OL04-243</oligo> </primer2> </design> <use> <seq>CACCATCACAGCTCACTATCGCCTGCGGGATCTCTCATTTACACAATTCGAGCTCACATCTATCATATCTAA...</seq> <revcomp>1</revcomp> <snp id="SCW0006"> <id>SCW0006</id> <pos>45</pos> <amb>R</amb> <rank>3</rank> </snp> </use> </pcr> </mipe>

Box 2: A minimal MIPE compliant file.

<?xml version="1.0"?> <mipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.sxd"> <version>1.0</version> </mipe>

Box 2: A minimal MIPE compliant file.
<?xml version="1.0"?> <mipe xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.sxd"> <version>1.0</version> </mipe>

Usage

To use the MIPE format, you don't have to do anything, as it is a format and not a program. You basically save the mipe.xsd file in a convenient place and get your data in a flat-text file (either by hand or by writing a script to access a database) and check that it complies with the rules set out in the mipe.xsd file. To do this on a linux/unix machine, you can type xsdvalid your_filname. Make sure to change the second line of your file to reflect the position where you mipe.xsd file is saved.

Accompagnying scripts

Getting things out of a MIPE file

mipe2pcroverview.pl: Prints high level data on each or selected PCR product.
mipe2pcrprimers.pl: Prints data on fw and rev PCR primers for PCR products.
mipe2snps.pl: Prints data on each SNP for PCR products.
mipe2sbeprimers.pl: Prints data on SBE primers for each SNP in each PCR product.
mipe2putativesbeprimers.pl: For each SNP, prints flanking regions for SBE primer design.
mipe2genotypes.pl: Prints genotypes for all samples for all SNPs.
mipe2html.pl: Pretty_prints MIPE file in HTML format, to be opened in a web browser (see example).

Changing the contents of a MIPE file (not thorougly tested yet)

snp2mipe.pl: Add SNP data to existing MIPE file.
sbe2mipe.pl: Add SBE data to existing MIPE file.
snpPosOnDesign.pl: Calculates SNP position on DESIGN sequence, based on position on USE sequence.
snpPosOnSource.pl: Calculates SNP position on SOURCE sequence, based on position on DESIGN sequence.
removePcrFromMipe.pl: Remove a PCR product from a MIPE file.
removeSnpFromMipe.pl: Remove a SNP from a MIPE file.
removeSbeFromMipe.pl: Remove a SBE from a MIPE file.

Example usage: suppose you have SNPs in a MIPE file with (Polyphred) ranks from 1 to 6, and want to keep only the ones with a rank < 4:

mipe2snps.pl your_mipe_file.mipe > snp_list.csv
Edit file to contain only the PCR product IDs and SNP IDs of the SNPs with rank >=4.
removeSnpFromMipe.pl your_mipe_file.mipe < snp_list.csv

Wishlist

More developers so that this format can really reflect the needs of the community.
Is it possible to set the 'xsi:noNamespaceSchemaLocation' such that it directs to a file on sourceforge, instead of having to copy the mipe.sxd file locally? => EASY: use 'xsi:noNamespaceSchemaLocation="http://mipe.sourceforge.net/mipe.xsd"'
Is there a way to check the sanity of the data without using the CheckSanity.pl script? Is it possible to define in the XML Schema that, for example, the snp_id in a genotype element has to be present as a snp element? Or that the PCR primers can be found back in the source and design sequences?

Contact

Jan Aerts, Roslin Institute
jan$DOT$aerts$AT$bbsrc$DOT$ac$DOT$uk

Last modified: July 20, 2005

MIPEMinimal Information for PCR ExperimentsAn XML schema for the exchange of PCR related data