Introduction to FIFE and Component Services¶
- Table of contents
- Introduction to FIFE and Component Services
- What is FIFE?
- New GPGrid
- Open Science Grid Overview
- art framework
- Data Management Overview
- IF Data Handling Client Tools (ifdhc)
- File Transfer System
- Conditions database
- Electronic Logbooks
- Continuous Integration (CI)
- Useful links to other services from CD and SCD
Goal: provide an overview of offline computing models to get users to think about what an experiment needs.
Why worry about this now?
There are many challenges in offline computing at Fermilab, and many ways to accomplish a particular task. Some methods make analysis easier later. Others make offline computing difficult. Some tools that work in a single instance fail, and bring down a server, when attempted in an automated mode. Some of the problems experiments have encountered have led to stopping an experiment from functioning, or even stopping other experiments. This can result in an experiment not being allowed on shared resources. Thinking through the options, and making informed choices will save time, effort, and frustration.
- overloading database server with too many direct connections from O(1000) parallel jobs. What should be done instead is to utilize web access to databases that use squid servers to cache information.
- overloading access to BlueArc volumes from O(100) parallel access request crashed the BlueArc server for both FermiGrid and interactive nodes for all experiments. Instead of this, use the Intensity Frontier Data Handling Client (ifdhc) which has resource provisioning to make sure BlueArc does not become overloaded. It should be used in batch jobs for all data transfers to and from BlueArc volumes, or for higher throughput workloads consider utilizing dCache.
What is FIFE?¶
FabrIc for Frontier Experiments (FIFE) is the central tools and services to address common challenges in offline computing. FIFE takes the collective experience from current and past experiments to provide options for designing offline computing for experiments. It is not a mandate from Scientific Computing Division about how an experiment should or should not design their offline computing. FIFE is modular so experiments can take what they need, and new tools from outside communities can be incorporated as they develop.
Some of the areas covered include:
- job submission and workflow to dedicated and opportunistic resources and user-friendly monitoring of submitted jobs.
- data management and handling with co-scheduling of data and job services
- database and dataset applications such as beam monitoring, conditions, and hardware
- collaborative tools such as electronic control room logbook and shift scheduler
- collaborations with experiments to build integrated solutions
Picture from: Mike Kirby
FermiGrid is the Fermilab Campus Grid that services cyberinfrastructure (VOMS, GUMS, Squid, Site Gatekeeper, etc.) Services are deployed in a Highly Available (HA) infrastructure with automatic failover (FermiGrid-HA). Multiple Grid resources are operated for a variety of Fermilab experiments.
Computing grids are a way to handle batch queues, to make computing resources available to the appropriately authorized users. Priority schemes can be established to ensure those who provide resources have special access to their own resources while allowing others to have "as available" access. A large collection of software tools implement the grid. These tools manage authentication, authorization, accounting, job submission, job scheduling, resource discovery, work-flow management, and so on.
One of the goals of FIFE is to facilitate an experiments access and utilization of grid resources (both at Fermilab and as part of the Open Science Grid (OSG) ). The ability to utilize opportunistic OSG resources greatly expands the total computing power available to an experiment and enables increased physics output from experiments. Access to OSG resources is not without effort and the FIFE team will help guide experiments in understanding OSG operation and in the design and implementation of OSG based computing models.
In a typical grid installation, the worker nodes are multi-core machines and it is normal for there to be one batch slot per core. For example, in an installation with 100 worker nodes, each of which has a dual quad-core processor, there would be 800 batch slots. This means that up to 800 grid jobs can run in parallel. Some grid installations have more than one batch slot per core, perhaps 9 or 10 slots on a dual quad-core machine. This makes sense if the expected job mix has long IO latencies.
Picture from: Kimberly Myles and Canzone Diana
When a job is submitted to the grid, it is submitted to a particular head node, which looks after queuing, resource matching, scheduling and so on. When a job has reached the front of the queue and is ready to run, the head node sends the job to a slot on a worker node.
FermiGrid is designed to serve compute-bound jobs, and is shared by various experiments at Fermilab. The typical job reads about a GByte of data, produces little output, and takes a few hours to run. Jobs that run under 15 minutes, or are I/O limited will not run efficiently. Jobs that run more than a day may have trouble completing due to scheduled maintenance or preemption from opportunistic resources.
Open Science Grid Overview¶
The Open Science Grid (OSG) provides software and services to enable the opportunistic usage and sharing of resources. The OSG doesn't own or allocate resources. The owners (generally large Virtual Organizations (VO), universities, or national labs) fully control usage policies on their resources. The OSG is jointly funded by the Department of Energy and the National Science Foundation.
The OSG promotes science by enabling a framework of distributed computing and storage resources, a set of services and methods that enable better access to computing resources for researchers and communities, and principles and software that enable distributed high through-put computing (DHTC) for users and communities at all scales.
An important concept in OSG is Virtual Organizations (VOs) which are a collection of researchers who join together to accomplish their goals. They typically share the same mission, but that is not a requirement. A VO joins OSG to share their resources, computing and storage, with the other VOs and to be able to access the resources provided by other communities in OSG as well as share data and resources with other peer and international computer grids. The resources owned by a VO are often geographically distributed; a set of co-located resources are referred to as a site and thus a VO may own a number of sites. In some cases, VOs do not bring resources to OSG and are only users of available resources on OSG.
Many resource providers allow the use of resources beyond those owned by the VO. The most common policy for resource utilization outside your own VO is "opportunistic use". The interpretation of this varies greatly. Some sites allow access to all their "spare" resources to anybody registered with a VO on OSG, others to only a small subset of resources and/or VOs. Some sites interpret "spare" to mean "until a more privileged user arrives", others guarantee a minimum wall clock time for your job once it starts.
To access the Open Science Grid, users need to belong to a Virtual Organization and to have a digital certificate. Virtual Organizations can be found at: http://myosg.grid.iu.edu/vosummary?all_vos=on&active=on&active_value=1&datasource=summary
A digital certificate is like a passport for the grid. Specifics may depend on the VO you belong to.
The best way to get access to more sites (and not lose access to sites!) is be a well-behaved grid user. Follow best practices. If you will be developing your own grid application (as opposed to using a prebuilt one from your VO), make sure you read the using the grid guide, https://twiki.opensciencegrid.org/bin/view/Documentation/UsingTheGrid
Here is containing a few known site issues when submitting jobs to Open Science Grid sites: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Information_about_job_submission_to_OSG_sites
h3. Open Science Grid References
Jobsub¶Jobsub (Job Submission) is a suite of tools to manage batch/grid submission. These tools are designed to simplify the job submission process by:
- Defining common interfaces for experiments
- Integrating complex grid tools in a sensible manner
- Protecting shared resources from overload
Newly architected Jobsub is modular with components that can be easily replaced or upgraded. The Jobsub server accepts requests using a well defined REST-like API. REST (Representational state transfer) focuses on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements. Internally, the Jobsub server uses jobsub_tools to perform its tasks. In future, all the relevant features in jobsub_tools will be migrated to new Jobsub client and made available to the users.Jobsub architecture allows servers to communicate with multiple client types:
- command line (supported)
- portlet/app clients
Jobsub is network centric with a thin client, meaning it depends heavily on some other computer (its server) to fulfill its computational roles. Jobsub is scalable and can be deployed in a High Availability mode.
Picture from: Parag Mhashilkar
When scientists move between experiments, the submission parts of their scripts won’t need to change.The format is:
jobsub_submit (options) executable (executable options)
Some common options include:
- Open, read, close a SAM data set
- Transfer data to (and back from) worker node using ifdh method
- Run job on local batch system or fermigrid or OSG site or Computational Clouds
- Restrict number of jobs running at one time
- Use non-default grid credentials
- Create and/or submit tarball
Jobsub submits user jobs to a HTCondor scheduler. This scheduler can be monitored by the Glideinwms to provision computational resources on the Grid or Cloud. Jobsub along with the Glideinwms shields the user from complexity of running complex workflows on Grids and Clouds by providing a HTCondor batch system like interface. HTCondor maintains per user and per experiment quotas on these farms and allocates provisioned resources to queued jobs based on priority.
Jobsub client-server has been in production since Sep. 24, 2014.Important tools include:
- jobsub_submit, used to submit jobs to condor pools, various input options steer jobs to the different pools and control input and output of data files to the user application running on the worker nodes.
- User tools to help manage and monitor jobs in HTCondor Pool.
- Jobsub Project Wiki: https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki
- Jobsub Project Documents: https://cdcvs.fnal.gov/redmine/projects/jobsub/documents
- Jobsub Client-Server Architecture: https://sharepoint.fnal.gov/org/scd/fife/fife_weeklycollaborationmeetings/FIFE%20Weekly%20Working%20Meeting%20Oct%2017/Document%20Library/1/Jobsub-FIFEMeeting-17Oct2013.pdf
- Jobsub User Guide: https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki#Client-User-Guide
- Presentation on Jobsub Tools: https://indico.fnal.gov/getFile.py/access?subContId=0&contribId=3&resId=0&materialId=slides&confId=6895
- Presentation to Microboone about using Resilient dCache for tarball uploads: https://cdcvs.fnal.gov/redmine/documents/1251
Authentication determines the identity of a user or a user's program. Authorization determines what the user is allowed to do. Fermilab strives for a single sign-on model, so most users have a single username.
Strong authentication is a form of computer security in which the identities of networked users, clients and servers are verified without transmitting passwords over the network. All applications, other than those intended for the general public, must support appropriate levels of authentication and authorization. In particular, any systems allowing arbitrary program execution or data transfer require authentication consistent with the computing authentication policy at http://security.fnal.gov/Policies/AuthenticationPolicy.htm, currently either a Kerberos principal (account) for use of general lab computing resources, or a PKI certificate for use of grid computing resources.Kerberos is the network authentication program that Fermilab uses to implement strong authentication. In addition to establishing identity (authentication), it supports encrypted network connections, thereby providing confidentiality.
- Users must understand how to authenticate through proper use of credentials before using lab computers.
- Users must not allow anyone else to know or use their Kerberos password.
- The Kerberos password is not to be used for other than Fermilab Kerberos.
- Do not transmit Kerberos passwords across the network.
- In the rare circumstances where transmitting a Kerberos password is necessary, it must be strongly encrypted.
- Never store Kerberos passwords (or the corresponding character strings) on a computer, encrypted or not.
Any remote login or general file transfer services in the General Computing Environment (GCE) that are visible from outside the Fermilab network, such as the Open Science Environment (OSE),
must be configured so as to require Kerberos authentication (or an exemption must be requested). See http://security.fnal.gov/StrongAuth for more details. Configuration rules for Kerberos-protected systems must not be circumvented. Similar services in the Open Science Environment must be configured to require appropriate grid certificates.
Service Certificate Management¶
Certain types of production jobs (jobs that run under the experiment production account, e.g. dunepro or novapro) require a service certificate for submission. Experiments may choose to obtain and manage such certificates on their own, or they may choose to have SCD manage them. In that case the service cert and key files will be stored securely on a centrally-managed machine, and a script will push proxies based on these certificates out to other machines where the experiment need them, and then experimenters can use these proxies for job submission. Please visit the page below for more details.
art is an event-processing framework developed and supported by the Fermilab Scientific Computing Division (SCD) to build physics programs by loading physics algorithms, provided as plug-in modules.
art comprises a suite of tools, libraries, and applications for processing detector events, including experimental data and simulated events. The major component is a C++ event-processing framework. The art framework coordinates the processing of events by user-supplied pluggable simulation, reconstruction, filtering, and analysis modules. art is not designed for use in real-time environments, such as the direct interface with data-collection hardware. It is designed to allow users to provide executable code in chunks called art modules that “plug into” a processing stream and operate on event data.
An event is one unit of data. It may represent an interaction, a beam crossing, or any period of collected data. Each experiment that uses the art framework can define the unit of data that corresponds to the event as it wishes. An event contains all of the information associated with the time interval, but the precise definition of the time interval changes from one experiment to another. In a triggered experiment, an event may be defined as all of the information associated with a single trigger; in an untriggered, spill-oriented experiment, an event may be defined as all of the information associated with a single spill of the beam from the accelerator.
art provides a way to share a single event-processing framework, and its associated tools, across many experiments. In particular, the design draws a clear boundary between the framework/infrastructure and the user code; the art framework is developed and maintained by software engineers, whereas user code for the experiments is developed by physicists whose focus is on the science. Experiments use art as a shared framework, having it appear in their code bases as an external product. This allows each experiment to build their physics software on top of a robust foundation. The goal is to make it easier to develop and maintain physics software, thereby improving the overall quality of the physics results.
Scientific Computing Division distributes art and the products on which it depends via Unix Product Support (UPS). The art documentation is under development and a lot of material already exists (see the references below). This includes the early chapters of a workbook that is designed to bring new users up to speed quickly. This document is built around a detailed sample project.
- The art home page: https://cdcvs.fnal.gov/redmine/projects/art
- Documentation on the art wiki: https://cdcvs.fnal.gov/redmine/projects/art-workbook/wiki/Existing_Documentation
- The home page of the art workbook: https://sharepoint.fnal.gov/project/ArtDoc-Pub/SitePages/Home.aspx
- A older powerpoint format art tutorial: http://oink.fnal.gov/new_tut/tutorial.htm#
The Open Science Grid (OSG) Application Software Installation Service (OASIS) is the recommended method to distribute software on the Open Science Grid. It is implemented using CernVM File System (CVMFS or CernVM-FS) technology.
The CernVM File System provides a scalable, reliable and low maintenance software distribution service via Hypertext Transfer Protocol (HTTP) outgoing connections. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide-distributed computing infrastructure used to run data processing applications. It is implemented as a Portable Operating System Interface (POSIX) read-only Filesystem in User Space (FUSE) module. Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs. Internally, CermVM-FS uses content-addressable storage and Merkle trees to maintain file data and meta-data.
Picture from Dan Bradley
CernVM-FS stores meta-data (path names, file sizes, …) in file catalogs. When a client accesses a repository, CernFM-FS downloads the file catalog first, and then it downloads the files as they are opened. A single file catalog for an entire repository can quickly become large and impractical. At the same time, clients typically do not need all of the repository's meta-data at the same time. CernVM-FS uses nested catalogs to partition the directory tree of a repository into many catalogs.
CernVM-FS can be used by small and large HEP collaborations. In many cases, it replaces package managers and shared software areas on cluster file systems as the means to distribute the software used to process experiment data.However, CernVM-FS comes with deployment costs:
- running and monitoring a CVMFS repository, Stratum-0 and Stratum-1 server network.
- having remote sites add your repository domain to their configuration; minimal effort for sites you “own”, but very difficult across the 100+ sites of the Open Science Grid.
Open Science Grid Operations runs a CVMFS repository server and a Stratum-0 and Stratum-1 server network. OSG does the work to get the OASIS repositories deployed on all sites for all Virtual Organizations. A Virtual Organization (VO) is a set of groups or individuals defined by some common cyber-infrastructure need. This can be a scientific experiment, a university campus or a distributed research effort. A VO represents all its members and their common needs in a grid environment. A VO also includes the group's computing/storage resources and services.
Picture from S. Fuess
- In the repository server at the OSG Grid Operations Center (GOC), or
- In a repository server hosted at an OSG site, including Fermilab.
The second method is the one recommended for all experiments based at Fermilab. The process is different for experiments large enough to be their own VO registered with the Open Science Grid (even if they are a sub-group under the fermilab VO) and for those that are so small that they are instead considered to be only a project that is part of the fermilab VO.
OASIS/CVMFS process for VOs that have Fermilab as a host institution¶
To distribute software to OSG using a CVMFS repository server hosted at Fermilab, installers must:
- Have a kerberos ID at Fermilab.
- Be associated with a Virtual Organization that is registered in the OSG Information Management (OIM) system.
- Submit a Service Now ticket. The details on how exactly to request that are in the Integrating Experiments into FIFE documentation. For reference, the steps that the Fermilab repository service administrator goes through to enable it are at https://opensciencegrid.org/docs/data/external-oasis-repos/.
- ssh to the repository server using the account name and machine name that came from the SNOW ticket.
- The files for the repository are found in /cvmfs/reponame.opensciencegrid.org. They are initially read-only until a transaction is started.
cvmfs_server transaction reponame.opensciencegrid.orgto initiate a transaction. This makes the repository writable. If there was a transaction already under way the command will inform you.
- Update files in /cvmfs/reponame.opensciencegrid.org. If you make a mistake, all changes can be discarded and the transaction aborted by running
cvmfs_server abort reponame.opensciencegrid.org.
cvmfs_server publish reponame.opensciencegrid.org. This processes all the changes, makes them part of the published repository, and makes the repository read-only again.
- Updates should then appear on worker nodes typically within a half hour, but sometimes it will be longer if the update was large or if the Stratum 1 is busy with doing a large update of another repository.
Installers should read more details about how to maintain repositories at http://cernvm.cern.ch/portal/filesystem/maintain-repositories. It is especially important to maintain a good .cvmfsdirtab file, and it should be in place before the initial publication of a large number of files. For assistance with determining good contents for .cvmfsdirtab you can contact the FIFE Support group via a Service Now ticket, but typically it is best to match the directories of each new release of every software package installed in the repository. Installers also have to be careful to follow the limitations on repository content as described at http://cernvm.cern.ch/portal/filesystem/repository-limits.
If installed files are copied into cvmfs from some other location, use /usr/bin/cvmfs_rsync on the publishing machine to do the synchronization rather than ordinary rsync. cvmfs_rsync understands the .cvmfscatalog files that are created for all the directories matched in the .cvmfsdirtab; the command avoids deleting the .cvmfscatalog files (which otherwise would happen every sync because they are not in the source location) unless the original source directory was deleted.
Nightly build repositories¶
Normally, even when a user deletes files from a CVMFS repository, they are not deleted from the stored files. That is because the files are stored de-duplicated (and compressed), and it is difficult for CVMFS to find out if another directory is referring to the same file contents. This is a problem mainly for cases that have a lot of churn, where files are frequently deleted, such as nightly builds. CVMFS has a solution for that called garbage collection. It is a relatively complex and time-consuming process, so it is not recommended for normal software distribution, which is expected to only rarely delete old files. So, the FIFE policy is to create a separate CVMFS repository when garbage collection is needed.
To request such a policy, go through the normal process to request a repository as above, but ask for a name that begins with the VO name, then a hyphen, followed by the purpose. For example, "nova-nightlies.opensciencegrid.org". Also ask that garbage collection be enabled. The user will be responsible for periodically running
cvmfs_server gc reponame.opensciencegrid.org, no more than daily and no less than weekly. Check for error codes to make sure that it runs correctly. It cannot run at the same time as other transactions.
OASIS/CVMFS process for projects based at Fermilab¶
For projects that are too small to be their own VO, they can be part of the "fermilab" OSG VO and have their code added to the shared repository fermilab.opensciencegrid.org if they submit a request as detailed in the Integrating Experiments into FIFE documentation. That repository is maintained by the FIFE Support group who will set up the repository server to automatically copy in the project code from a subdirectory or two on /grid/fermiapp and publish it to CVMFS. The projects that are using UPS/UPD should put that code under /grid/fermiapp/cvmfsfermilab/publish/products/projectname, and any other code in /grid/fermiapp/cvmfsfermilab/publish/projectname. The files are expected to be owned by a group account (typically matching the project name), and the people who manage the code log in to the group account to update the files. Normally the code will be automatically synchronized and published each night, but projects can also request an earlier update by running
It is also possible for a project that has too few people to be a VO but has a lot of files to have their own repository like VOs do. The advantage of having their own repository is that the performance of publishing is better. The way that is done is by officially registering the project with the OSG under the fermilab VO.
OASIS/CVMFS process for handling partially reused data files (StashCache)¶
Regular CVMFS is designed for distributing code, where all the jobs in a large batch of jobs execute the same code; the caches are engineered to expect a lot of repetition in the files accessed. If a VO wants to access data files in CVMFS in much the same way as code, where the number and size of files are similar to executable code and all jobs in a batch access the same files, that works OK. However if there is different data accessed by different jobs and the total dataset is relatively larger than the amount of code accessed by a typical job (on the order of a couple of GB), then regular CVMFS does not work well.
If the data files are at least partially reused, where some of the jobs read the same files, there is a separate OASIS/CVMFS solution for that. The example application that motivated this solution is Genie flux files. With this solution, the files are stored in dCache, they can be transparently cached at a number of OSG StashCache servers distributed around the OSG, and they can be accessed via a special per-VO osgstorage.org CVMFS repository. With the special repository, the POSIX metadata about the files is distributed normally through CVMFS, but when the files are actually accessed they are read either directly from dCache (when on Fermilab worker nodes) or via distributed StashCache servers (when running out on the OSG). There is a very small 1GB local cache for all osgstorage.org data files on worker nodes, but generally it is expected that the files will stream from dCache or the StashCache servers whenever they are read. For the end user, it looks like like a typical POSIX file path and can be normally read without any special copying.The process to set up a per-VO osgstorage.org repository is as follows:
- The VO sets up a persistent dCache area /pnfs/fnal.gov/usr/VO/persistent/stash, making sure that the "persistent" directory is marked as persistent so it is never archived to tape and taken offline.
- A VO representative makes a ticket for FIFE support requesting a VO.osgstorage.org repository, indicating that dCache path.
- FIFE support then makes an OSG ticket for the University of Nebraska (assigned to the Nebraska StashCache Instance) to create the VO.osgstorage.org repository, indicating the dCache path and asking that a "stash" symlink be created at the top level pointing to the longer path.
- FIFE Support opens a request to the "dCache Disk Cache Storage" Service located at "Scientific Computing Services" -> "Scientific Data Storage and Access" -> "dCache Disk Cache Storage" to ask them to configure WebDAV for anonymous access to the VO's StashCache area, as well as unauthenticated xrootd access to that area.
- FIFE support makes a FNAL service desk ticket for "Scientific Computing Services" "High Throughput Computing" "Data & Application Caching Operations" to configure local worker nodes to read VO.osgstorage.org directly from dCache.
Once the osgstorage.org repository is set up, to publish files a VO only has to put the files under that persistent/stash directory in dCache, and the files should appear under /cvmfs/VO.osgstorage.org/stash/ within a couple of hours. As of this writing not all OSG sites support the osgstorage.org cvmfs repositories, but if users notice a site where it doesn't work and they want to run jobs there, make a FIFE support service desk ticket. In general jobsub users can add a job requirements option that will prevent their jobs from starting on sites that don't have the repository set up:
Where VO is the VO name used above.
Modifying the contents of a StashCache repository
Note that the process of adding or deleting files is different from a regular CVMFS repository; there is no logging in to the oasiscfs machine or any cvmfs publish command. To add files, one simply copies them to the /pnfs/VO/persistent/stash/rest_of_path area (or whatever the base area is if it isn't /pnfs/VO/persistent/stash) using ifdh cp or any of the other usual copying tools. Do NOT try to copy directly to the /cvmfs/VO.osgstorage.org area. The syncing and updating will be done automatically; typically within 30 minute.s The files will then show up in /cvmfs/VO.osgstorage.org/stash/(rest of path) after the update. There will also be a new revision number in the repository. To ensure your jobs have access to the post-update files, you can add the following to your job requirements:
where NNN is the revision number you see after the update shows up in the /cvmfs area. To obtain the revision number, do
attr -q -g revision /cvmfs/VO.osgstorage.org. As before, don't forget to replace VO with the actual name of your VO.
Important note about overwriting StashCache files: As of May 2018 you can run into problems if you remove files and put new files with the same name in their place, because the new files have different checksums. Sometimes it works, but it definitely does not in all cases. If you need to update files, it is best to give them slightly different names or put them in a different directory.
- OASIS/CVMFS section of Integrating Experiments into FIFE
Data Management Overview¶Data handling is the management, movement, and tracking of experiment files at Fermilab and remote grid sites to run experiment written science applications. Fermilab provides a data handling system to maximize efficiency and minimize cost for managing and delivering data to science applications running as jobs on the grid, especially focusing on utilizing opportunistic resources. Data handling unites the appropriate input data files with running batch jobs, and returns the output files to the correct storage area. Task complexity depends on:
- amount of data produced by an experiment
- geographic diversity of its operations.
Efficiently running high-throughput jobs on widely dispersed grid computing elements requires a more advanced infrastructure than running at a single site co-located with the entire dataset.
Picture from Adam Lyon
Data management includes cataloging metadata, maintaining datasets, and coordinating file delivery to jobs. Within the offline processing model for FIFE, one of the largest challenges for analysis is the delivery of data to computing resources. The challenges include: storage element infrastructure, file catalogs, and transfer services. Large scale (many PetaByte) storage elements use two different protocols: BlueArc hardware via NFS and dCache.
The Serial Access via Metadata (SAM) service is a robust and mature file catalog, delivery and tracking system developed at Fermilab based on file metadata while being storage element independent. The SAM service recently transitioned to a web-based interface (SAMWeb) and allows for greater integration into OSG operations. The Intensity Frontier Data Handling Client (IFDHC) service is designed to provide access to all Fermilab storage elements (BlueArc, dCache, and Enstore tape storage) through a
single interface. The IFDHC is focused on making sure that resource brokering keeps any storage element from becoming overloaded while minimizing worker idle time waiting for resource tokens.
A metadata catalogue stores information about the contents of the data files, and has a method of querying the catalogue to obtain a list of files matching whatever criteria the job needs. The metadata catalogue also can be be used to store information on the lineage and processing history of the files.
Locating the data files tracked by the experiment requires a location catalogue. To be useful, the catalogue needs to be kept reasonably up to date, so it must be integrated with any file transfer service.
Picture from Adam Lyon
Each job should provide notifications of its data handling activities to the monitoring system – requesting next file, starting to transfer file to worker node, finished transferring file, etc. This helps with pinpointing problems where jobs get stuck waiting for files.
Data Management References¶
IF Data Handling Client Tools (ifdhc)¶
IFDH (Intensity Frontier Data Handling), is a suite of tools for data movement tasks for Fermilab experiments. IFDH encompasses moving input data from caches or storage elements to compute nodes (the "last mile" of data movement) and moving output data potentially to those caches as part of the journey back to the user.IFDH is:
- easy to use.
- does 'The Right Things' for data transfer, and avoids things that cause phone calls stating,"Your jobs are hanging the BlueArc."
- changes with the updated environment so users don't have to keep changing scripts. (For example, when the new UberCache replaces dCache, it will be supported automatically in ifdh, and related command.)
- can be called from python scripts, plain C++ code, and art framework code.
IFDH ensures large numbers of jobs do not cause data movement bottlenecks. It does this by throttling and locking. IFDH is called in job scripts (e.g. "ifdh cp"), hiding the low-level data movement. The underlying low-level tools can be selected or changed without the need for the user to alter their scripts. Logging and performance monitoring can also be added easily.
The data handling system maximizes efficiency and minimizes cost for managing and delivering data to science applications running as jobs on the grid, especially focusing on utilizing opportunistic resources.
Picture from Adam Lyon
The software is distributed as ups products, installed via upd in the product areas of /grid/fermiapp/products/common/db and /nusoft/app/externals on the Fermilab bluearc file servers. The ifdh_art package provides art service access to the libraries from the ifdhc package. The usage of the packages is largely the same, except art service handles are used to access the various types.
Users are expected to:
● declare datasets for their jobs to analyze, either with the web GUI or with ifdh define_dataset
● submit jobs with DAG-based jobs submit script
● in those jobs, obtain input files with a “ifdh next_file” call
● copy output files back to a staging area with “ifdh cp”
● Use the File Transfer Service to stage the output files to tape and/or semi-permanent disk storage
Sequential Access via Metadata (SAM) is a data handling system organized as a set of servers which work together to store and retrieve files and associated metadata, including a complete record of the processing which has used the files.As much as possible, SAM frees experiments from dealing with the specific details of their data so they can focus on the content rather than the technicalities. SAM provides:
- metadata catalogue — What’s in the data files?
- location catalogue - Where are my files?
- data delivery service - Give me my files.
SAM handles the file meta-data so that the user does not need to know the file name to find data of interest. It delivers files without users having to know where they come from. It does bookkeeping, it keeps track of datasets created and files processed. All of the data is kept in a database.
SAM was designed as a highly-automated, "hands-off" system requiring minimal routine intervention. This is good for experiments which can't provide dedicated expertise for operating data management systems. Data staging from tape and from storage element to storage element is automatic. The philosophy of SAM is, 'Bring the data to the jobs,' not 'bring the jobs to the data.'
SAM was developed and used on the Run II of the Tevatron. It evolved and was improved and modernized for use by multiple current and future Fermilab experiments. SAM has a high level of automation.
Picture from: Robert Illingworth
The dataflow for analysis isn't as simple as taking raw data, reconstructing it, analyzing the reconstructed data and publishing paper(s). Even a small HEP experiment has millions of data files to keep track of, and a complicated process to ensure the correct ones are given to the next stage of processing.
Picture from: Robert Illingworth
The metadata catalogue of SAM answers the question of what is in the data files. SAM provides a file access service that turns a dataset into concrete locations. There is extensive tracking of job activity and output data is reliably added to catalogue and archived on disk or tape.
SAM used to have a heavyweight client implementation which made integration with experiment frameworks difficult, but the new and improved SAM uses lightweight HTTP REST (Representational state transfer.) REST is an architectural style with coordinated architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system. REST ignores the details of component implementation and protocol syntax to focus on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements.
HTTP REST allows use of standard Grid Public Key Infrastructures (PKI) authentication methods. The use of simpler interfaces implemented using a familiar protocol enables a loose coupling between SAM and Grid components such as job submission and storage, making it possible to interoperate with other services, if required.
SAM interacts with the FIFE-Jobsub glideinWMS-based job-submission system. Glidein is a mechanism by which one or more grid resources (remote machines) temporarily join a local Condor pool, a cluster computing open-source solution. GlideinWMS is a Glidein Based Workload Management System used for scheduling and job control.
SAM contains a metadata catalogue to store information about experiment files. Metadata is data about data. It includes physical metadata: file size, checksum. It also includes the physics metadata such as: run number, detector configuration, or simulation parameters. This data is experiment-dependent. Since metadata is used to catalogue the data, it is best to add it early.
The metadata fields mostly consist of key-value pairs which is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. Key-value pairs are frequently used in lookup tables, hash tables and configuration files. Experiments can define their own fields rather than being restricted to a predefined set of metadata fields.
The metadata also stores provenance information for a file. Provenance, also known as pedigree or lineage, refers to the complete history of a file. Provenance data provides sufficient detail to facilitate reproduction and enable validation of results. The metadata in SAM stores the parents files from which it was derived, and the application and version used to create it.
Every file stored in SAM must have metadata which is normally uploaded when the file is first added to the database.
The predefined metadata fields are at:
Each experiment can define its own fields, called parameters. Each parameter has a category and a name, which must be defined in advance. Values may be anything, and can be used to store any data relevant to the file. Parameters must be defined in advance of use.
SAM provides a query language for retrieving sets of files by their metadata values which has been enhanced to allow simple queries like “run_number 13501 and file_format raw” which returns all raw data files from run 13501. It also allows more complex queries such as “run_number 13501 and file_format raw and notisparentof: (application reconstruction and version S12.02.14)” which returns all raw files from run 13501 which do not have a derived file reconstructed with the specified version of the software.
Flexible datasets can be created as stored queries. These datasets are evaluated dynamically; if more files that match the query criteria have been added since the dataset was created they will be included when it is used. End users can define their own datasets. They aren’t restricted to a predefined set created by an administrator.
SAM is neutral as to the access method used for the ‘last-mile’ access of the data files from the storage system to the worker node. The job is responsible for the access method used.
Picture from: Robert Illingworth
Because SAM works with datasets rather than individual files, it can command prestaging of files from tape, or transfers from storage system to storage system, before the end user job needs to access the file. This enables more efficient file access than a purely access driven system while not requiring manual staging of datasets.
All SAM file activity is logged in a database. This keeps a complete record of which files were given to which jobs, and which files and jobs were reported as having completed successfully. And this is all done by SAM, so the user doesn't have to do this bookkeeping.The tracking of all file activity allows for simple recovery for files that were not correctly processed. This can happen for a variety of reasons:
- the storage system may not have been able to provide some of the dataset;
- the job may not have completed processing due to preemption or crashes;
- output may have been lost due to some failure of the copy back procedure.
SAM can automatically create a recovery dataset for any project. This consists of all the files that were not successfully processed in the first pass.
SAM provides on-site near-line direct access to files on tape, or on and off site access of files through a disk cached front end to the tape storage. The tape storage system is called Enstore, which was developed by Fermilab. Files get written to disks and then migrated to Enstore tapes. For file read requests, if the files do not reside in the disk cache, they first get retrieved from Enstore.
SAM is experiment agnostic, providing flexibility. The high level of automation reduces the administrative effort required from experiments.
File Transfer System¶
The File Transfer Service (FTS) is a daemon process that is designed to automate the transfer of files from one storage system to another. It can handle local data transfers, transfers to remote nodes, and third party transfers between remote nodes. FTS works in conjunction with the SAM data cataloguing system to track the location of transferred files. FTS can also be configured to manage the disposition of input files after the transfers have been successfully completed or after the transfer attempts have failed a specified number of retries. Normal user interaction with FTS consists of simply placing the files to be transferred into a designated dropbox area.
Operation of FTS is controlled by a configuration file which allows designation of dropbox areas, dropbox scanning intervals, stored file disposition, transfer destinations, and file types to be considered for transfer.
User monitoring of FTS is available via a web page which tracks current and past transfers, transfer rates, and the operational status of critical underpinning components.
dCache is disk caching software capable of massive data throughput developed jointly by DESY, Fermilab and NGDF. It stores and retrieves large amounts of data, distributed among a large number of heterogeneous disk server nodes, under a single virtual filesystem. A variety of standard protocols support the storage, retrieval and management of data.
dCache combines heterogenous disk storage systems of several hundred tera bytes and lets its data repository appear under a single filesystem tree. It takes care of data hot spots, failing hardware and if configured to do so, ensures that at least a minimum number of copies of each dataset resides within the system to ensure full data availability in case of disk server maintenance or failure.
dCache separates the filename space of its data repository from the actual physical location of the datasets. dCache transparently handles all necessary data transfers between nodes and optionally between the external Storage Manager and the cache itself.
dCache can be used independently for high performance volatile storage, or in conjunction with Enstore as the high-speed front end of a tape-backed hierarchical storage system. In the latter use, dCache decouples the low latency and high speed of network transfer from the high latency sequential access of tapes and provides high performance access to frequently accessed files. Whether the file already exists in the disk cache, or needs to be first retrieved from tape is transparent to the user.dCache systems at Fermilab include:
- US CMS Tier1 dCache System (http://cmsdca.fnal.gov/)
- CDF dCache System (http://cdfdca.fnal.gov/)
- D0 dCache (http://d0dca.fnal.gov)
- Public dCache System (all other experiments) (http://fndca.fnal.gov/)
The dCache was originally designed as a front-end for a set of Hierarchical Storage Managers, namely Enstore, EuroGate and DESY’s OSM. It has since been further developed and can be implemented stand-alone. When used as a front-end, dCache can be viewed as an intermediate “relay station” between client applications and the Hierarchical Storage Manager. dCache communication with Enstore is transparent to the user (high-speed ethernet connection.) Each experiment determines the protocol experimenters may use, and communicates this information to the Enstore administrators who manage the configurations. Local users can access data through dcap (a posix like interface), kerberized FTP, and NFSV4.1. Files can also be accessed via protocols designed specifically for the WAN:SRM and GridFTP. The dCache decouples the potentially slow network transfer (to and from client machines) from the fast storage media I/O to keep Enstore from bogging down.
Data files uploaded to the dCache from a user’s machine are stored on highly reliable RAID (Redundant Array of Independent Disks) pending transfer to Enstore. Files already written to storage media that get downloaded to the dCache from Enstore are stored on ordinary disks.
The dCache is installed at Fermilab on a server machine on which the /pnfs root area is mounted. Since PNFS namespace can only be mounted on machines in the fnal.gov domain, off-site users may only access Enstore via the dCache. On-site users are strongly encouraged to go through the dCache as well.
More details about the different subdivisions within dCache are here: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Understanding_storage_volumes
FermiCloud can be utilized by experiments to quickly instantiate specific configuration of computing nodes to understand portability and scalability of an application. Instead of investing in new hardware, experimenters can request a Virtual Machine be spun up on FermiCloud for testing with little or no cost to the experiment. This allows for rapid development of applications on multiple platforms and configurations while allowing for validation of results.
FermiCloud is an Infrastructure as a Service (IaaS) Cloud Computing capability in support of the Fermilab Scientific Program. Additional IaaS, PaaS and SaaS Cloud Computing capabilities are supported based on the FermiCloud infrastructure at Fermilab.
FermiCloud is based on OpenNebula, an open-source solution to build and manage enterprise clouds and virtualized data centers. OpenNebula orchestrates storage, network, virtualization, monitoring, and security technologies to enable the dynamic placement of groups of interconnected virtual machines on distributed infrastructures, combining data center and remote cloud resources.
The FermiCloud Project focuses on delivering on-demand services to support workflows of scientific users. This includes virtual machine batch submission, data movement, web caching, and testing and integration. These services run both on FermiCloud and on Amazon Web Services and eventually on other community and commercial clouds as well.
Use cases to date for FermiCloud have included monitoring servers, simulation of data acquisition systems, build nodes to test new operating systems, storage system testing, scale testing of new services, experimental GridFTP services, specialized analysis servers.
Any registered employee, contractor, or visitor of Fermilab can use FermiCloud.
Picture from: Kimberly Myles and Canzone Diana
Creating a VM in FermiCloud means being its system administrator (root permissions) and the VM should be maintained according to Fermilab Computing policy available at:
http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1186;filename=Fermilab_NEW_Policy_on_Computing_-_2013.pdf and following Security Essentials for Fermilab System Administrators available at: http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=3431
The conditions database is a general purpose database to hold conditions type of data - data which has time validity information. Conditions data is an umbrella term for information that describes detector and beam conditions. This is metadata generally valid for detector data taken during specific periods of time, sometimes referred to as Intervals Of Validity (IOV). It is a type of metadata necessary to make sense of the detector data, and it includes calibration, alignment, attenuation, pedestal, etc. for detector channels, as well as information about the intensity and characteristics of the beam.
Some of this information is required for processing and analysis of detector data and thus access is required by many clients running simultaneously on interactive and GRID resources. Much of this data is stored in central databases or files, and approaches to scale the delivery to thousands of clients are needed.
Typical values for the following parameters need to be obtained from experiments and/or estimated.
1. Expected request rate: Peak and Average
2. Data unit size
3. Latency requirements
4. Accepted failure rate
5. Some estimate of time correlation between requests
6. Number of clients
7. Location of clients
Also important are boundary conditions such as hardware to be used, network bandwidth available and which technologies to use and not to use.
Picture from Igor Mandrichenko
Conditions Database References¶
An electronic logbook is computer-based software for recording (logging) states, events or conditions used. Electronic logbooks derive from paper-based logbooks used in the maritime sector. A wide variety of implementations are available, although most are based on the classical client-server approach, with the client being a simple web browser.
Scientific research is performed by large collaborations of organizations and individuals. The logbook of a scientific collaboration is an important part of the collaboration record, and often contains experimental data.
The Electronic Collaboration Logbook (ECL) application is used by about two dozen different collaborations, experiments and groups at Fermilab. ECL is the latest iteration of the project formerly known as Control Room Logbook (CRL). The logbook is a collection of logbook entries. Each entry has an author and a timestamp. An entry can be a simple text or a filled form. An entry can have any number of images or documents attached to it. Each entry belongs to one and only one category.
Forms, used to define entries with a standard format, are created by the ECL administrator. Forms are used to define entries with a standard format. Categories are defined by the ECL administrator and organized into a tree structure. An entry can have zero or more tags attached to it. A tag is an arbitrary text string defined by the ECL administrator.
Once an entry is committed to the database, the entry content cannot be modified, but it can be amended via comments or related entries.
An advantage of storing all logbook information in a database is the ability to search data by many different criteria. The ECL allows searching data by:
• Date range
• Presence of attachments
Shift Scheduler was added to ECL as an extension of its functionality. ECL and Shift Scheduler share common Members and Organizations Database.
Electronic Logbooks References¶
Continuous Integration (CI)¶
Continuous integration is a software engineering practice in which changes in a software code are immediately tested and reported.
The Continuous integration Project (CI) aims to apply the Continuous Integration development practice to all projects/collaborations/experiments that make the request.
The goal is to provide rapid feedback helping identifying defects introduced by code changes as soon as possible.
Issues detected early on in development are typically smaller, less complex and easier to resolve.
The CI provides a CI Web Application to monitor the status of the experiment code tested and the history of previous tests.
From the CI Web Application the user can access to detailed information about all the tests results; to the logs of the test and eventually also to plots generated by the tests itself. For each test there are many statistic that are provided to monitor the code performances as memory usage, run time and more.
Continuous Integration References¶
Useful links to other services from CD and SCD¶
- pnfs at: http://computing.fnal.gov/xms/Science_%26_Computing/Scientific_Facilities/Mass_Storage
- Enstore information at: https://cdcvs.fnal.gov/redmine/projects/enstore