Link Search Menu Expand Document

Technical Needs of the Community

We can describe the technical needs of the community in terms of:

  • hardware - the infrastructure required by community members to perform computational research
  • scientific software - the programs and libraries used to perform analyses
  • supporting software - platforms, languages, and tools that facilitate computational science

In addition to these technical needs, community members also require access to training to use these tools effectively. See this page for more information on this topic.

Computing

Perhaps the most important resources for the Bio-IT community to have access to are those that enable scientists at EMBL to perform their research. From a computational perspective, these resources include: individual and shared compute infrastructure, i.e. compute and storage resources and the network infrastructure that enables information to be transfered to, from, and between them; research software; and research data, stored both locally and in remote catalogues/repositories.

Compute Needs

Many EMBL researchers find that the requirements of their analyses far outstrip the capacity and capability of their personal computers. As the volume and complexity of data produced in their experiments increases, the resources available to process, transfer, and store the data must scale accordingly to avoid delays and/or limitations to research.

Servers

Most research groups with a significant computational component to their work access additional compute capacity via their own high-powered machines, often referred to as group servers. These may be physical machines or virtual machines rented to the group by EMBL IT Services. Such servers are administered within the group and often connected to the rest of EMBL’s internal network, allowing access to networked storage such as “group shares” (see Storage Needs section below).

In addition to servers belonging to individual groups, some compute infrastructure is also provided by unit-specific computational support groups, such as Genome Biology Computational Support.

All such servers at EMBL run some varient of a Linux operating system, with the overwhelming majority running CentOS 7.

Cluster

The compute cluster also provides an environment for shared use amongst all EMBL researchers. The cluster includes many individual computer, referred to as nodes, with various hardware configurations. Although most nodes are configured for high-througput data processing, the cluster includes some nodes with graphical processing units (GPUs), and some with large amounts of RAM for jobs with especially large memory requirements. Nodes are networked together alongside shared storage volumes, enabling fast transfer of information inside the cluster. As the cluster is a shared resource, with compute tasks run by multiple users simultaneously, somewhat unpredictably, and often in large numbers, the allocation of resources to these tasks is managed by a job scheduler, Slurm. As well as its own storage volume, every node is connected to a large, central storage volume, to allow data to be staged and shared between nodes.

Cloud

TODO

Storage Needs

Everyone with an EMBL user account, administered by IT Services, has their own home directory. These home directories are accessible across the whole internal network but are subject to a strict size quota.

This quota is too limiting for any researcher working with modern methods in computational biology, so EMBL groups use networked volumes to store their research data. These volumes are usually provided by IT Services, though some groups use similar volumes provided by their units specific computational support staff. Storage provided by IT Services is made availble in two different configurations.

These volumes are connected to the scratch volume, allowing users to “stage” their data ready for analysis in EMBL’s cluster environment (see above).

The final stage of the data lifecycle - long term storage - is taken care of by an archiving service from IT Services.

For more details, see the relevant page on the EMBL Intranet.

Software Needs

Bioinformatics software [can be difficult to install][]. To avoid the need for all researchers to handle installation of all tools and dependencies themselves, they have access to a central software repository. Software is provided as modules via the lmod system, which requires the user to load tools before they can use them. The library of modules is built according to a collection of EasyBuild descriptions maintained by members of IT Services and volunteers from the Bio-IT community and deployed via the Continuous Integration system associated with the EMBL GitLab.

As another way to reduce the difficulty of installing and running software, Containers can be run with Singularity on the EMBL cluster, allowing users to describe the complete environment needed for a tool to run.

Many users find it helpful to maintain some personal control over their compute environment, supplementing the options given by the centrally-provided EasyBuild module system, e.g. in order to manage a library of modules and versions for Python, and use conda to manage these.

Genome Biology Computational Support (GBCS) provides an RStudio Server instance, which gives an RStudio interface, linked to many packages/libraries and multiple R environments, connected to powerful compute resources.

Some users with sophisticated analysis pipelines and/or a need to perform many similar analyses at high throughput make use of dedicated software to manage these pipelines. There exist many such tools but the three used most commonly at EMBL are Snakemake, NextFlow, and Galaxy.

GBCS provide a Galaxy instance, which allows users to manage and execute analysis workflows on EMBL’s compute infrastructure via a graphical interface. Both of these systems are maintained by the GBCS team, removing a burden of administration from the user and providing a more accessible interface to the local infrastructure.

In late 2019, IT Services deployed jupyterhub.embl.de, providing a JupyterLab environment accessible to all EMBL users via their web browser.

Collaboration & Project Management

In addition to tools and resources for individual researchers, community members also need to collaborate, with scientists inside and outside EMBL.

Sharing Data

EMBL IT Services provides access to [OwnCloud][embl-owncloud], a file sharing system that allows EMBL members to up- and download files, share them with others via public or private link, and restrict access (set a password for access, make the folder view/upload only, etc).

These OwnCloud folders can be accessed over the Internet, or mounted to the file browser, syncing between the user’s local machine and other devices. (See instructions on the IT Services pages.).

Sharing Software

The Bio-IT community maintains two platforms with which members can share their software with others:

Managing Projects

The EMBL GitLab can be used to manage projects. It is particularly well-suited to managing software projects but the Issue tracker, discussion threads, and Markdown editor make it useful for other projects more generally. It lacks many of the features (e.g. Gantt charts) provided by dedicated project management tools/platforms.

Some community members also use an internal OpenProjects instance, which provides much more advanced project management features. Read more on the IT Services pages.

Managing Data

Recent years have seen an increase in the number of tools available to help community members manage and track their research data. These tools include:

  • STOCKS: The Genome Biology Computational Support team maintain this service, providing an electronic lab notebook system with modules for managing lab collections, track orders and samples, and share protocols.
  • dm.embl.de: EMBL IT Services maintain the Data Management App, which can be used to track location and metadata of research data throughout the lifecycle of a project.
  • eLABJournal: Some research groups at EMBL are using eLABJournal, another electronic lab notebook solution.

Communication

Several platforms exist for the community to communicate on technical topics:

  • chat.embl.org: a Mattermost workplace chat system with Markdown support. The system allows easy sharing of code snippets, error messages, and screenshots, making it well-suited to discussion of computational topics. To allow users to focus only on the discussions most relevant to them, dedicated channels exist for many common topics, including cluster, R, Python, image analysis, etc.
  • A community-wide mailing list exists for announcements relevant to all members
  • Bio-IT also hosts occasional interest group meetings (e.g. cluster and/or cloud computing), and seminars for in-person discussion.