Software

This page contains links to the source code for our publications. Note that code without a public repository is generally not maintained anymore. We will however help with any questions and issues as best we can.

Contact: larsab@cs.uit.no

Open source projects

    We have developed the META-pipe pipeline for marine metagenomics data analysis. We plan to open source most of these repositories. Version 2.0 consists of several backend systems, servers, and services:

    1. Marine Metageonomics Portal. Marine reference databases and more.
    2. Galaxy pipeline provided as part of the NeLS infrastructure. It is intended for Norwegian users so a a FEIDE account is needed for login.
    3. Spark based execution manager and Go components. Closed source.
    4. META-pipe job manager. Closed source.
    5. META-pipe authentication service that is integrated with Elixir AAI.
    6. META-pipe web application. Closed source.
    7. Object storage server.. Closed source.
    8. Tool to setup META-pipe backend on the OpenStack cPouta cloud.
    9. Tool to setup META-pipe backend on OCCI enabled endpoints.
    10. Scripts to setup META-pipe backend on AWS EMR.
    11. META-pipe deployment scripts.. These will not be open sourced.
    12. Marine Metagenomics Portal code.
    13. Marine reference databases web app. Closed source.
    14. Galaxy-Pulsar integration on the Stallo Supercomputer. This is specific to the Stallo machine and will not be open sourced.
    15. Auto scaling framework, simulator and runtime.

    Source code for META-pipe 1.0 is in the following repositories. Note that these are not maintained anymore.

    1. META-pipe 1.0. Implemented for execution on HPC clusters.
    2. Patches for META-pipe specific metarep (1.4.0) sequence retrieval modifications.

    These repositories are from research projects that use data, infrastructure, or problems from the META-pipe project:

    1. GeStore. This is a system for enabling transparent incremental updates for metagenomic pipelines. Several publications describe the system in detail.
    2. nrsoot. Minimalist process isolation tool implemented with Linux namespaces. Desribed in this short paper.
    3. COMBUST I/O. Abstractions facilitating parallel execution of programs implementing common I/O patterns in a pipelined fashion as workflows in Spark. A detailed description is found in Jarl Fagerli's master thesis.
    4. Mario is a system for interactive data analysis. It is built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support. The README file in the source code provides installation instructions, and the Master thesis of Ernstsen gives a detailed description of the system.
    5. This benchmark was used to evaluate the performance of Hbase using data and access pattern found in typical biological data processing tools. The dataset size is tunable. The README file in the source code provides installation instructions, and the Master thesis of Ernstsen gives a detailed description of the system.

    We have developed a system for data management and standardized preprocessing of the data in the NOWAC study.

    1. Nowac R package: has information about the available datasets and analyses you can run on them. (closed source).
    2. Pippeline: standardized and interactive pipeline for NOWAC data preprocessing (closed source)
    3. nowaclean. R package implementing the methods of the standard operating procedure for cleaning microarray data in the Norwegian Women and Cancer postgenome study. Described in bioRxiv 144519 paper.
    4. geneset. R package of data sets and functions that facilitate gene set analysis.
    5. seq. A collection of Docker containers with different bioinformatics tools, such as GATK, bwa, and Picard, installed.

    We have developed systems for data management, analysis and exploration in the NOWAC project. But these can also be used for other datasets.

    1. Kvik. A framework for developing interactive data exploration applications in genomics and systems biology. The Master thesis of Fjukstad, Fjukstad et al. 2015, and Fjukstad et al. 2017 describe the system.
    2. walrus. A system for running data analysis pipelines using Docker containers. Described in Reproducible Data Analysis Pipelines for Precision Medicine.
    3. Freia. Biological Path Visualization using Unity3D to visualize gene expression data integrated with pathway images. The Master thesis of Kenneth Knudsen has a detailed description of the tool.
    4. KEGGviewer. Simple Python Flask web viewer for KEGG images.

    In addition, we have developed many different data analyses for the NOWAC data:

    1. Mixt. Matched Interaction Across Tissues (MIxT) is a web application for exploring and comparing transcriptional profiles from two or more matched tissues across individuals. Online at mixt-blood-tumor.bci.mcgill.ca

    The air:bit project repositories are:

    1. Luft. Web application for visualizing air quality in Tromsø with data from The Norwegian Institute for Air Research (NILU) and Kongsbakken VGS. Online at luft.cs.uit.no.
    2. air:bit backend platform. The backend is deployed on Google Cloud Platform.
    3. Air quality sensor and web server. an Arduino-based portable air quality sensor kit and a Ruby on Rails web application deployed on Heroku.

    Source code for our research projects (random order):

    1. Histology learning tool for use in a browser with a Python backend .
    2. validator. an R package for running repeated k-fold cross-validation.
    3. So you want to use R on stallo. A brief guide to launching long-running embararssingly parallel R jobs on the UiT supercomputer Stallo.
    4. Supporting data and code to "Empirical bayes shrinkage estimation of crime rate statistics".
    5. krongen. Creates kronecker graphs that simulate networks with power law edge distributions.
    6. Handwritten digit recognition.
    7. code for Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.
    8. DICOM anonymizer.
    9. Mr. Clean is a tool for combining different visualization tools, interaction devices, and display middleware for visual comparisons on high-resolution displays.
    10. M.O.R.T.A.L. is a programming language for domain specific high performance computing.
    11. Spell expression data processing pipeline. This a data cleaning pipeline for microarray data.
    12. Troilkatt system This is a system for scalable batch processing of biological data. Troilkatt is built on the hadoop stack.
    13. BSV system. This is a system for scalable visualizations on multi-core and multi-display platforms. It provides a programmatic control for visualizations implemented using Python visualization libraries.
    14. UiT Github course guide and template. The unofficial guide for using GitHub for UiT courses.