Hack-a-thon session #4 for Test Cases #9 and #12

19 - 21 September 2013 in Amsterdam, the Netherlands

Test Case #9

Test Case #12

In September we will organise the fourth hack-a-thon. This hack-a-thon will take place in Amsterdam at SARA.

Go to Google maps. Find Watergraafsmeer in Amsterdam and in Watergraafsmeer the Science Park. In the top right corner of the triangular half of the science park you find a boomerang-shaped building. That's it.

SARA and the eScience Centre share the building. Google places SARA at the location where we will get a room for the hack-a-thon, and if you search for eScience Centre Amsterdam you find the building where the eScience centre (who will help us too) and SARA (who will officially host us) are housed.

Hotel CASA 400 Eerste Ringdijkstraat 4 1097 BC Amsterdam

TC Leaders: Oren Tzfadia and Erik Alexandersson

Monika Brandt
Agnieszka Danek (Silesian University of Technology, Gliwice, Poland)
Estelle Wera (SLU)
Itziar Frades (SLU)
Didi Amar (Weizmann Institute)
Tatyana Goldberg (Technische Universitaet Muenchen)
Erik Alexandersson (SLU)
Oren Tzfadia (Weizmann Institute)
Sanjeev Kumar Sharma (James Hutton Institut)
Gregoire Rossier (SIB)

It will start Thursday Sept 19 at 12:00 at SARA.

We will end Saturday with lunch (around 13).

Provisional program, schedule is flexible

Wednesday 18 September
Arrival of first participants

Thursday 19 September
morning: Arrival of participants
12:00: Lunch at SARA
12:45: Welcome and AllBio project overview (Greg)
13:00 - 18:00 Hackathon part 1
13:00: Introductory slides (Oren)
13:30: Overview of the test case (Erik)
13:45: Overview of the data sets, pre-processing and validation scheme (Didi, Oren)
14:20: Coffee break
14:35: Organize all data set on computers and SARA server. 1st round or pre-processing runs. In parallel, preparation of ‘gold standard’ discussion (Sanjeev, Erik)
16:15: Round table discussion on Potato metabolic genes as ‘gold standard’ for the annotation validation schemes (Sanjeev, Erik)
16:45: 1st round of the validation scheme scripts runs
17:45: Return to CASA hotel
18:15: Day 1 summary, write a wrap-up mini report (at the hotel)
20:00: Dinner at CASA hotel

Friday 20 September
<8:30: Breakfast at CASA hotel
8:30: Taxi to SARA
9:00 - 12:45: Hackathon part 2
9:00: Refine scripts after analysis of 1st round results. Re-running the validation score. Define ‘stable’ scores and scoring schemes
10:45: Coffee break
11:00: Round table discussion – preparing for writing the summary report and delineating bullet points for a manuscript road map
12:45: Lunch at SARA
13:30 - 18:00: Hackathon part 3
13:30: Split to task sub-groups and run specific needed computational/biological analyses
16:00: Video conference (skype) with Kate Dreher from USA. Coffee provided
17:00: Continue running computational tasks and summarizing results
18:00: Taxi to CASA hotel
19:30 - 20:30 Free slot if needed
20:45: Walk for dinner at Restaurant Merkelbach (Middenweg 72)

Saturday 21 September
<8:45: Breakfast at CASA hotel
8:45: Check out
9:00 - 12:00: Hackathon part 4 (hotel meeting room)
9:00: Recap video conference with Kate
9:45: Hackathon summary and perspectives
12:00: Lunch at CASA hotel
13:00: End of the event


For non-model organisms, genes predicted in the sequenced genome are relatively poorly functionally annotated. Instead, researchers have to rely on information derived from sequence identity to model organisms.

We aim to gather information from several complete sequenced plant genomes and bind/modify existing tools and pipelines for efficiently analyzing large scale ‘-omics data’. By that, we seek to generate a robust and automated framework to assign genes into functional categories, and classify them in a biological context such as biological pathways.

Another challenge is to compare, link and annotate transcript sequences derived from RNA-seq of not yet sequenced genomes with already sequenced genomes. The potato genome was sequenced last year and it is time to ripe for genome comparisons, function assignment of genes, transcriptome and proteomics analysis. During the sequencing project, the potato genome consortium run into several problems, due to sequence heterogeneity and eventually genome assembly could only be successfully done based on a homozygous doubled-monoploid potato clone (S. tuberosum group Phureja). The genome structure of this clone differs greatly from the cultivars that are commonly studied, i.e. crop potato cultivars grown for food or as starch for industrial use.

Currently, the OrthoMCL clusters are used for gene family analysis together with BLAST. Visualisation of ‘-omics’ data has been done in a commercial software, QluCore, but this does not handle multiple data types simultaneously well and does not visualise functional pathways. Gene predictions were done ab initio with parameters trained for A. thaliana and also based on sequence similarity with four other plant genomes. Functional annotation of predicted genes was done by identifying orthologous and paralogous gene families in 12 sequenced plant species by OrthoMCL.

Available data: RNA-seq (GSII Illumina pair.end reads) exist for 3 different potato cultivars and 3 wild potato species. In addition, gene expression data (Agilent microarray based on the 3.4 version of the potato genome) and secretome quantitative proteomics data from various states exist (all samples are from leaves).

Rough Sketch of the Plan before hack-a-thon:

  • Share and review project “blue print” by all Hack-a-thiners.
  • Data collection.
  • Building virtual “warehouse” of relevant existing tools.
  • Comprehensive related literature survey.
  • Data pre-processing and protocols set up.
  • Simulations of “dry” (practice) runs for optimizing “wet” (real) runs to be performed at the time of the hack-a-thon meeting in Amsterdam.
  • Define computational needs (computer clusters, memory usage, and required packages installments).

During the hack-a-thon, we would like to test and evaluate ~5-6 annotations pipelines: - Trinotate - Blast2GO - PotatoCyc - OrthoMCL / InParanoid - Phytozome - KEGG

Two things are CRUCIAL for us to complete BEFORE going to Amsterdam: 1) Collecting outputs from each pipeline. 2) Parse the output of each pipeline to get a tab delimited file that looks like so: GeneID (Potato ID) Term (GO term ID OR E.C ID) Score (if available p-value/e-value).

public/hack4.txt · Last modified: 2013/09/18 15:19 by greg
Trace: hack4