The CMS Collaboration at CERN has released more than 300 terabytes (TB) of high-quality open data. These include over 100 TB, or 2.5inverse femtobarns (fb−1), of data from proton collisions at 7 TeV, making up half the data collected at the LHC by the CMS detector in 2011. This follows a previous release from November 2014, which made available around 27 TB of research data collected in 2010.
Available on the CERN Open Data Portal — which is built in collaboration with members of CERN’s IT Department and Scientific Information Service— the collision data are released into the public domain under the CC0 waiver and come in types: The so-called “primary datasets” are in the same format used by the CMS Collaboration toperform research. The “derived datasets” on the other hand require a lot less computing power and can be readily analysed by university or high-school students, and CMS has provided a limited number of datasets in this format.
Notably, CMS is also providing the simulated data generated with the same software version that should be used to analyse the primary datasets. Simulations play a crucial role in particle-physics research and CMS is also making available the protocols for generating the simulations that are provided. The data release is accompanied by analysis tools and code examples tailored to the datasets. A virtual-machine image based on CernVM, which comes preloaded with the software environment needed to analyse the CMS data, can also be downloaded from the portal.These data are being made public in accordance with CMS’s commitment to long-term data preservation and as part of the collaboration’s open-data policy. “Members of the CMS Collaboration put in lots of effort and thousands of person-hours each of service work in order to operate the CMS detector and collect these research data for our analysis,” explains Kati Lassila-Perini, a CMS physicist who leads these data-preservation efforts. “However, once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly. The benefits are numerous, from inspiring high-school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data-preservation co-ordinator, this is a crucial part of ensuring the long-term availability of our research data.”
The scope of open LHC data has already been demonstrated with the previous release of research data. A group of theorists at MIT wanted to study the substructure of jets — showers of hadron clusters recorded in the CMS detector. Since CMS had not performed this particular research, the theorists got in touch with the CMS scientists for advice on how to proceed. This blossomed into a fruitful collaboration between the theorists and CMS revolving around CMS open data. “As scientists, we should take the release of data from publicly funded research very seriously,” says Salvatore Rappoccio, a CMS physicist who worked with the MIT theorists. “In addition to showing good stewardship of the funding we have received, it also provides a scientific benefit to our field as a whole. While it is a difficult and daunting task with much left to do, the release of CMS data is a giant step in the right direction.”
Further, a CMS physicist in Germany tasked two undergraduates with validating the CMS Open Data by re-producing key plots from some highly cited CMS papers that used data collected in 2010. Using openly available documentation about CMS’s analysis software and with some guidance from the physicist, the students were able to re-create plots that look nearly identical to those from CMS, showing what can be achieved with these data. “I was pleasantly surprised by how easy it was for the students to get started working with the CMS Open Data and how well the exercise worked,” says Achim Geiser, the physicist behind this project. Simplified example code from one of these analyses is available on the CERN Open Data Portal and more is on its way.
Prior to the launch of the CERN Open Data Portal with the first batch of research-quality data from CMS, the Collaboration had provided certain curated datasets for use in high-school workshops. These “masterclasses”, developed by QuarkNet and conducted under the aegis of the International Particle Physics Outreach Group, bring particle-physics data to thousands of high-school students each year. These educational datasets are also available on the CERN Open Data Portal, along with an “event display” for visualising the particle-collision events.
“We are very pleased that we can make all these data publicly available,” adds Kati. “We look forward to how they are utilised outside our collaboration, for research as well as for building educational tools.”