Info |
---|
It's a pleasure to present you our second newsletter. We try to keep the release schedule close to one, not exceeding two months, balanced between being informational and not too chatty. Apart from the regular project progress and IT news, there are quite some chapters on policies that will affect how observations will be done and what is required to access data in the future. There's also a section on licensing issues of data no longer embargoed.Taras Yakobchuk introduces the new tool he is developing for visualizing and analyzing calibrated GRIS/GREGOR data. The tool is not only intended to help experts analyzing data offered by SDC, but it should also allow access to laypersons who are not experts in dealing with this type of data.We Welcome to our third newsletter. Regretfully, we are rather approaching a two-month release schedule than the initially envisioned one-month plan — much work going on in parallel. Hopefully, you will still find this newsletter helpful and informative. The tools section is updated, and IT news is up to date. As always, we would like to encourage you to comment openly comment on any parts. Feedback on any subject or raise new topics to which you think we do not pay enough attention in the context of SDC. Feedback is always welcome and helps us to deliver a better product. |
Table of Contents | ||||
---|---|---|---|---|
|
Editorial
📰 Editorial
🔒 Embargos: Definitions, Planned Realization, and Proof of Concept
Embargoes restrict access to observational data and derived higher data products for a certain period after the observation campaign. All data in SDC is planned to be subject to such an embargo. The envisioned period for data where no Ph.D. student is involved (neither in the observation nor in the later evaluation of the data) is 1 year; if there are Ph.D. students involved, the period is prolonged to 2 years.
From a technical standpoint, raw data (and derived data products) belong to a particular group of people. Initially, this group of people consists of the PI of the observation and all participating COIs. We assume that this group of people is relatively static during the embargo, and people are added to or removed from this list only very rarely. It is not intended that users can make such changes themselves, but instead need to contact an admin.
A campaign is a particular group of instruments used during a specific time interval. Any specific instrument can only pertain to one campaign at a time. All data gained with that instrument during that time belongs to the associated group of observers specified in the observing proposal. A posteriori, it's sufficient to know when and with what instrument the original data was taken to associate it to a campaign. Naturally, campaigns do not overlap, but each observer (PIs and COIs) might be part of multiple campaigns (even simultaneous ones if different instruments are involved).
To keep track of campaigns, we are currently developing an online database that will provide a web interface so that any observer can register with this database before observation. This website's functionality is not yet completely defined, but it might well replace the entire current process of submitting observation requests by PDF and mail.
Authentication on the website is done via certificates, which are prerequisites for registering, submitting observation requests and accessing data. The certificates will then be mapped to Linux users, which are then assorted into the ephemeral Linux groups used to restrict access to embargoed data (one group per campaign).
Any data that does not belong to a campaign cannot be subject to an embargo and becomes freely available from the start!
Embargoes on derived higher level products
Enforcing the embargo on raw data is relatively straightforward, however, this is not the case for higher-level data products. Nevertheless, derived data products should be subject to the same embargo as the original data. How this will be done is not completely clear at the moment. Users will probably be able to put data back to their personal scope (a namespace used to distinguish files with the same name). Each entity (an instrument, telescope, user, camera, etc.) will have its own scope. By default, personal scopes are world-readable temporary storage within Rucio (see below). This is a global setting (the same for all users) and can only be changed by admins. We might need to remove this world access to those scopes to guarantee embargoes. That, however, would mean that not even other members of the campaign (COIs) would have access to derived data products during the embargo period. Should that be required, we might end up with one scope per campaign. We would appreciate feedback on the necessity of this feature.
Data products of general interest to a greater public (data that will remain in SDC and become openly available after the embargo) will probably need to be reviewed anyway and be put back by some kind of privileged embargo-aware procedure.
Visibility of embargoed data
We use Rucio to distribute, manage and access data in SDC. A side effect of that decision will be that even embargoed data is listable and will show up as a result of matching data searches. The data will, however, not be downloadable while embargoed.
Third-party users (not part of the original campaign) need to contact the PI to get access. The latter would then have this new user added to the campaign by an administrator or provide the desired data by other means than direct access via Rucio.
Even embargoed data should give some context data like quick looks to make searches useful. It needs to be discussed whether such low-quality data that is not scientifically exploitable can be exempt from the original embargo?
Proof of concept.
We currently run a demo setup of Rucio with a storage unit based on dCache. Uploads will be performed using Webdav. Technically, this is very close to the envisioned design of SDC.
In its simplest version, the one we will probably use in V 1.0 of SDC at the end of 2021, users will point any client to his certificate by setting an environment variable appropriately. A user's key and the certificate need to reside in (adequately protected) files in pem-format within the user's home directory:
Code Block |
---|
export CLIENT_CERT=~/usercert.pem
export CLIENT_KEY=~/userkey.pem |
The user in Rucio needs a so-called identity corresponding to this certificate's Subject for the authentication mechanism X509:
Code Block |
---|
[root@client tmp]# rucio-admin account list-identities root
Identity: tutorial, type: USERPASS
Identity: /C=DE/O=GridGermany/OU=Leibniz-Institut fuer Sonnenphysik (KIS)/OU=SDC/CN=Peter Caligari, type: X509 |
On the storage units, this identity is then mapped to local user-IDs. For this proof of concept, this is done manually in a hard-coded file on the dCache node:
Code Block |
---|
[root@dcache0 ~]# cat /etc/dcache/multi-mapfile
"dn:/C=DE/O=GridGermany/OU=Leibniz-Institut fuer Sonnenphysik (KIS)/OU=SDC/CN=Peter Caligari" username:tester uid:1000 gid:1000,true |
Now uploads by any means (Rucio, gfal-copy, the dCache native method, or even curl using Webdav) will result in files with an owner and group ID of 1000. Files by default are world-readable!
Enforcing an embargo is relatively straightforward:
upload the file to Rucio (but do not register it yet)
remove world-readability from the file
associate it from the default group of the PI to the group representing the campaign
register it with Rucio (now the file would pop up in searches and listings).
In a real-world setup, the mapping between dCache users and certificates will probably be done via an online database instead of a hard-coded file. This mechanism still needs to be developed.
Info |
---|
Feedback welcomeWe would welcome any comments on anything said above, especially on embargoes' assumptions, like the infrequency of adding users to campaign groups, the inheritance of embargoes of higher-level data products, the lag thereof for quick-look data and the like, etc. Please submit any comments and suggestions, preferably before the end of April 2021. |
Project Status
SDC Project Status 02-2021
Inc drawio | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Solution Analysis and Design Phase steps.
SDC Project team has been working hard to find the best possible hardware and software components to build a robust platform for the solar community. The project has now entered a phase where we are creating a detailed solution design meaning that we have already identified many technical pieces which are going to be included in the final version of SDC. The team is now trying to find the best possible ways to integrate these different pieces together. This means a lot of investigations on technical details and testing different scenarios.
📋 Summary
Current project health
Current project status
Project constraints
Status | ||||
---|---|---|---|---|
|
“Create Detailed Solution Design” phase in progress.
Resources and their availability.
Technology POCs taking more time than predicted.
At the core of everything at this point is RUCIO. It’s going to be the most essential piece of software for SDC. RUCIO is going to take care of data transfers, data dissemination, data embargoes, data security and a lot of automation at the same time. SDC’s primary goal is to automate data transfers from OT to SDC archive and make sure that all necessary data policies are being applied at the same time. RUCIO is also going to take care of the data lifecycle meaning that the most relevant data is always available once the oldest data is being archived into long-term storage.
SDC team is also developing pipeline and analytic tools to allow solar scientists to get quick results out from our science-ready L1 data. These tools include GRIS inversions, BBI speckle image reconstruction and GRIS data visualization.
Inc drawio | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
SDC Technical Components (image by Carl Schaffer and Petri Kehusmaa).
📊 Project status
Tip |
---|
Accomplishments |
High-level solution design
Started collecting data policies
Clarified embargo policies
Listed essential use cases for SDC
Next steps
Continue selecting solution components and planning component integrations
Warning |
---|
Risks & project issues |
Lack of resources
Resource availability
Multiple process implementations at the same time
Governance
👩⚖️ Policies, Frameworks & Governance
📜 Proposal for Data License in SDC
There is a fundamental difference between copyright (in the sense of ownership) and the right to use data. It's the latter that is clarified by a license specifying how third parties can use the data.
After an initial embargo period (envisioned: 1 year, 2 years if Ph.D. students are involved), all data in SDC is expected to become freely available. Nevertheless, the first question that must be answered in this context is whether or not post-embargo data retains copyright by either the original observers or SDC? The choice of the licence to which post-embargo data is subject depends on the answer to this question (see below).
Either way, post-embargo data should be put under a license (not to be confused with the license given to software tools and workflows developed in the framework of SDC). Basically, all treatises on this subject agree that it's a terrible idea to publish data without any license at all. Even data in the public domain should be subject to a licence whose sole purpose is to make that fact irrevocably clear. Similarly, most sources agree that one should neither come up with a completely novel proper licence. Incompatibilities with different legal national norms would almost certainly be pre-programmed in this case. Likewise, the combined use of the data offered with data from other sources will practically inevitably lead to licensing conflicts.
The Creative Commons Licenses are standardized data licenses, compatible with most national standards, that can modularly be amended by attributes to prohibit specific data usage. The licences of CC relevant in the context of SDC are:
CC0 1.0 Universal: data is completely public domain, no rights whatsoever are retained, no citation of the source is required upon its use, and data can be freely reused, modified and redistributed (even commercially). Suitable if no copyright is retained.
CC BY 4.0 Attribution 4.0 International: requires to give credit to the source of the data upon reuse; reuse can be amended with further restrictions, which are:
NC: non-commercial use, only
ND: no derivatives (data must stay as-is and cannot be redistributed in modified form)
SA: share-alike (any data derivatives must also be redistributed under a similar license).
These additions are cumulative, so a license stating exclusively non-commercial use and redistribution only if the derived data products are offered freely under the same license would be:
CC BY-NC-SA 4.0
Any additional restriction might prevent combining data from different sources, though. Suppose we subjected our data to the SA-building block. You could then only combine it with third-party data if the latter's license would also allow publication under SA (see e.g. the faqs of CC).
If we allow commercial use, even re-licensing the data and products thereof might probably possible (there are licenses outside CC that explicitly exclude re-licensing only while still allowing commercial use (ODC-By); we did not look into those in detail, though).
Discussions within SDC tend towards using plain CC0 1.0 or CC BY 4.0 for any non-embargoed data, and, following general recommendations, declare meta-data completely public domain by using CC0 1.0. CC BY 4.0 is mainly used for publications and rarely for scientific data, though. The latter is primarily put under CC0 1.0 due to a lack of copyright in the first place. If the sole intention of using CC BY 4.0 is to have users acknowledge the use of our data in their publications, CC0 1.0 might probably enough, as attribution might probably be enforced by other means.
SDC might require written (or implied) consent from the observer for their data to be published under the chosen licence after the embargo expires. The application-form for observing proposals should be modified to include such consent.
It's worth noting that attributing a license of this type to a particular data set is a once-and-forever decision: once put under such a license, it cannot be put under a different license later if one changes one's mind! Therefore, this is not a decision to be made hastily but demands a consciously well-considered and balanced judgment, right from the start.
We would welcome your view and feedback on the above subjects and on the intention to use either plain CC0 1.0 or CC BY 4.0. Especially if you do not agree with either! If you are fine with one of them, which one would you see more appropriate for SDC? If you would like to comment, please send a mail to
We will consider any contributions until the end of April 2021.
Links for further reading:
http://creativecommons.org.au/content/licensing-flowchart.pdf
https://chooser-beta.creativecommons.org
https://creativecommons.org/licenses/
https://www.dfg.de/foerderung/info_wissenschaft/2014/info_wissenschaft_14_68/
https://openaccess.mpg.de/Berliner-Erklaerung
For the postitions of Leibniz, EU, and others, see e.g.:
https://www.mdc-berlin.de/system/files/migrated_files/fiona/ag-oa_0.pdf
🇪🇸->🇩🇪 Data Transport from OT to KIS before Rucio
Background
As long as on-the-spot campaigns were possible, data between OT and KIS was transported using external hard-drives. The latter also served as backup-disks for this valuable data. This raw data was then copied to the central storage at KIS, mainly used to process it and produce higher-level products. The latter were then written back to external disks and tapes again to free space for further data processing.
Similarly, OT's central storage is meant as temporary storage between observation and transport of the raw data to the observer's home institution. As of the beginning of 2021, we will have about 150 TB (usable) disk space.
Current Situation
Due to Covid-19, all observations at OT are done remotely only. Contrary to our expectations at the beginning of 2021, it currently looks more likely that the current lockdown restrictions will be tightened rather than relaxed. Coordinating the data transport of different campaigns over the network becomes a pressing issue. We, therefore, decided to mirror /instruments at OT to the same directory at KIS. Whatever data is written to the former will be replicated with the best effort to KIS. Data on the KIS side is read-only, retains permissions and ownership of the original data on at OT, and will remain even if the source data at OT is deleted. Copying data to other directories at KIS for further processing should not consume disk space, as identical data is really kept only once on the disks (the technology behind that is called deduplication)
Problems with this approach
While entirely automated (and as such very comfortable from a users point of view), there are several drawbacks of this approach:
Any data is copied: excellent data just as well as rubbish. According to what was said above, once transferred to KIS, the latter cannot be deleted there again.
Some data is post-processed right at OT (after arriving on /instruments), and only the outcomes of this process are worth being transported. The procedure described above is prone to unnecessarily copying the unprocessed data to KIS and would not consider the processed data for transport (unless copied back to /instruments on Tenerife).
All raw data is copied. Even data from partners or international campaigns that were not intended for KIS but would instead have been transported directly to their respective home institutions is copied.
The procedure used for copying is intended for an entirely different use case. It's meant to maintain an off-site copy on a best-effort basis for disaster recovery. As such, it does not bother signalling the successful replication of individual files; what makes it to the replica before failure, made it, and about those files that did not, nothing can be done anyway. In our scenario, that means it becomes difficult to see when a file was replicated entirely (and can thus be deleted at OT to free up space there for buffering data from future observation).
To cope with these problems, we might again (like already proposed in 2020) introduce an intermediate folder for copying data to KIS. We would then not automatically copy any data in /instruments but any data copied to that folder. That would allow for post-processing on-site and have only the outcome of that process transferred to KIS. Data not intended for KIS or data not being worth being kept would simply not be copied at all to that folder. However, this reduces the degree of automation as an additional manual step is required.
Once this intermediate directory is set up, we will announce the switch away from copying /instruments by mail. For the time being, one would still have to delete successfully replicated data by hand from the intermediate directory on Tenerife.
In late 2020 we tried to automate the process of checksumming the original data on Tenerife, replicating it to KIS, and, upon success, removing the original on Tenerife. This, however, turned out to be quite tricky to implement, as it requires the synchronization of two totally independent processes at two sites with no shared access. Even though we failed then, we might look into that effort again. Should we succeed this time, we'll let you know.
Space requirements & costs
Tenerife's disk space can accommodate data from around 3 campaigns and thus sufficient if data is deleted promptly.
We have looked at different solutions to store new data from campaigns in 2021 at KIS before SDC V1.0 is fully established. A simple expansion of the existing system is impossible due to a manufacturer's switch in technology (DELL/EMC). One would have to buy a completely new cluster (again with minimal redundancy and size), making this option rather expensive. The price per TB usable disk space range from approximately 360 €/TB to 800 €/TB (including VAT).
We also looked into moving some dormant data (data not accessed for more than, let's say, one year) to the public cloud. Costs there, however, strongly depend on the access pattern; should, against the established access pattern of that data arise the need to download big junks of that data, traffic costs become prohibitive. A traffic-independent flat-rate amounts to about 5000€/month for 300 TB cloud storage; on top of that, license costs would need to be added to access this outsourced data seamlessly (those would not be required, however, if we used a local cloud-appliance that consists of many slow disks; the third variant we looked at). The price per TB/month ranges from approximately 60 €/TB to 200 €/TB (including VAT). The latter is a traffic-independent flat rate.
Please note that SDC will solve this problem: All raw data and most of the large static simulation outputs will go to SDC, freeing up enough space on the yet existing system for every-days work. So once SDC is up and running, there's no need to buy storage for this usage in the foreseeable time. Investing considerable amounts of money just to integrate smoothly with the existing system without a need in the future is a waste.
Solution
We, therefore, decided to buy a first storage node of SDC and use it as temporary storage for new raw data and existing static data. SDC nodes consist of 19" two-socket servers with 24x16 TB disks, two SSDs for caching, and a separate disk mirror for the system. One tier at SDC will span several of these. We might well already use the technology envisioned for each node in SDC to present this temporary solution to the network. This would give us storage at a price of around 170€/TB (usable and including VAT).
Products & Tools
🛠 SDC Products & Tools
GRISView (working title)
GRISView (working title) is an upcoming visualization and analysis tool to work with calibrated GREGOR/GRIS observational datasets. It is intended to facilitate easy data preview and present interactive tools for quick plotting, analysis and export. Tested features in the first release will include:
Advanced view, pan and zoom functions for map images and spectra
Both single map and time-series observations support
Multiple POI (point-of-interest) and ROI (rectangle-of-interest) to study spatial features
Interactive map isocurves generation and profile cuts
Distance measurements between map pixels in different units
Spectral line identification, markers and relative wavelength scale
Supports data format that is distributed by the SDC web archive
Written entirely in Python with GUI using Qt cross-platform framework
SDC data archive
Get access to data from GRIS/GREGOR and LARS/VTT instruments and the ChroTel full-disc telescope at OT.
Updates as of April 2021
Overview Calendar fields are now clickable and will link you to an overview of all observations performed on a given day
HTTPS has been implemented port numbers have been removed from URLs
Speckle reconstruction
https://gitlab.leibniz-kis.de/sdc/speckle-cookbook
This tutorial helps the user run KISIP (Wöger & von der Lühe, 2008) on her favourite BBI and/or HiFI imaging data.
Contact: Vigeesh Gangadharan (vigeesh@leibniz-kis.de)
Coming soon:
A Jupyter Notebook to assist the user on VFISV inversions for GRIS data by Vigeesh Gangadharan , including features like, e.g., wavelength calibration. Stay tuned!
Conferences & Workshops
📊 Conferences & Workshops
Forthcoming Conferences/Workshops of Interest 2021
Every second Thursdays, 12:30-13:30 CET
PUNCH Lunch Seminar (see SDC calendar invitation for zoom links)
11 Feb 2021: PUNCH4NFDI and ESCAPE - towards data lakes
25 Feb 2021: PUNCH Curriculum Workshop
April week 12-16 (3 days, TBD)
ESCAPE WP4 Technology Forum
June 01-02 (16:00 - 17:30)
15th International dCache Workshop
June 10-11
3th International Workshop on Science Gateways | IWSG 2021
Topics:
Architectures, frameworks and technologies for science gateways
Science gateways sustaining productive collaborative communities
Support for scalability and data-driven methods in science gatewayS
Improving the reproducibility of science in science gateways
Science gateway usability, portals, workflows and tools
Software engineering approaches for scientific work
Aspects of science gateways, such as security and stability
June 28, 2021:
Data-intensive radio astronomy: bringing astrophysics to the exabyte era
Topics:
Data-intensive radio astronomy, current facilities and challenges
Data science and the exascale era: technical solutions within astronomy
Data science and the exascale era: applications and challenges outside astronomy
SDC participation in Conferences & Workshops
Nov. 26, 2020:
2nd SOLAR net Forum Meeting for Telescopes and Databases
Talk: Big Data Storage -- The KIS SDC case, NBG, PC & PK, 2nd SOLARNET Forum (Nov 26)
Nazaret Bello GonzalezPetri Kehusmaa Peter Caligari
SDC Collaborations
🤲 SDC Collaborations
SOLARNET https://solarnet-project.eu
KIS coordinates the SOLARNET H2020 Project that brings together European solar research institutions and companies to provide access to the large European solar observatories, supercomputing power and data. KIS SDC is actively participating in WP5 and WP2 in coordinating and developing data curation and archiving tools in collaborations with European colleagues.
Contact on KIS SDC activities in SOLARNET: Nazaret Bello Gonzalez nbello@leibniz-kis.de
ESCAPE https://projectescape.eu/
KIS is a member of the European Science Cluster of Astronomy & Particle Physics ESFRI Research Infrastructures (ESCAPE H2020, 2019 - 2022) Project aiming to bring together people and services to build the European Open Science Cloud. KIS SDC participates in WP4 and WP5 to bring ground-based solar data into the broader Astronomical VO and the development tools to handle large solar data sets.
Contact on KIS SDC activities in ESCAPE: Nazaret Bello Gonzalez nbello@leibniz-kis.de
KIS is one of the European institutes strongly supporting the European Solar Telescope project. KIS SDC represents the EST data centre development activities in a number of international projects like ESCAPE and the Group of European Data Experts (GEDE-RDA).
Contact on KIS SDC as EST data centre representative: Nazaret Bello Gonzalez nbello@leibniz-kis.de
PUNCH4NFDI https://www.punch4nfdi.de
KIS is a participant (not a member) of the PUNCH4NFDI Consortium. PUNCH4NFDI is the NFDI (National Research Data Infrastructure) consortium of particle, astro-, astroparticle, hadron and nuclear physics, representing about 9.000 scientists with a Ph.D. in Germany, from universities, the Max Planck Society, the Leibniz Association, and the Helmholtz Association. PUNCH4NFDI is the setup of a federated and "FAIR" science data platform, offering the infrastructures and interfaces necessary for the access to and use of data and computing resources of the involved communities and beyond. PUNCH4NFDI is currently competing with other consortia to be funded by the DFG (final response expected in spring 2021). KIS SDC aims to become a full member of PUNCH and federate our efforts on ground-based solar data dissemination to the broad particle and astroparticle communities.
Contact on KIS SDC as PUNCH4NFDI participant: Nazaret Bello Gonzalez nbello@leibniz-kis.de & Peter Caligari mailto:cale@leibniz-kis.de
IT news
🖥 IT news
Ongoing & Future developments
Webpage
Status | ||||
---|---|---|---|---|
|
I put both, the preliminary designs and the envisioned page tree in our cloud for anybody to inspect and comment on. Comments are really welcome. I will collect and focus on them and if feasible try to include them in the next design round.
Network
Status of the dedicated 10 Gbit line between KIS & OT
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
We still need to connect it to the switches at OT, though and need to buy some network equipment for the connection between the University of Freiburg and KIS. We expect the line to be functional within a few weeks. We’ll keep you updated.
Test of (application) firewalls at KIS
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
We might not buy exactly this machine (this one is a mid-sized appliance from Paloalto-Networks). This is but a first test of this kind of routers/firewalls for IT at KIS.
We expect the test to terminate in the next few weeks so that we can make a decision on which machine we would like for OT and KIS and how and when to buy them.
Storage
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
We will use this machine as a test-bed for the technology envisioned for SDC and already have all raw-data from observations in 2021 as well as the large files from simulations accumulating on mars stored there.
SDC will consist of at least 4 similar nodes. This is the first one. As soon as the remaining hosts are setup we will move any data still on this first host to the new SDC cluster and join it to the latter, also.
Status | ||||
---|---|---|---|---|
|
The costs per TB of storage space in the cloud are strongly dependent on capacity and, above all, the access pattern. They vary between approx. 60-200 €/TB/a. Access-independent models, in which only a fixed fee is charged per stored GB, but no fees for downloading or uploading, are at the upper end of this scale. At the lower end are public providers such as Amazon, Google and Microsoft, which charge a relatively high fee for each type of data access in addition to the (relatively cheap) price of simple storage.
Additionally, licence fees of a similar magnitude for the software that moves files between the cloud and the local storage at the KIS are required.
We are currently obtaining concrete offers to outsource 100 TB for 1 year to a public cloud. The pricing models are so complicated that we can determine the resulting costs only through a limited real-world test.
We will intentionally design the integration so that it will become apparent to all users which files are in the cloud and which are not. Although this is cumbersome (and artificially induced), we deem this awareness essential (at least initially, where we have no experience of the potential costs involved). The exact model is still to be worked out, and we will inform you about it again in due course.
Status | ||||
---|---|---|---|---|
|
📜Aperio
Traditionally all data processing in solar physics is typically done on files. While this option will prevail in SDC, it is not the best way to deal with large data sets, where computations need to be done where the data resides and not vice versa. To interface data in SDC programmatically, APIs are needed for the most common programming languages like Python and IDL.
We are pleased that we could win Aperio Software to develop a Python API for SDC. Aperio Software is heavily involved in the development of SunPy and Astropy, a community effort to develop Python packages for Solar Physics and Astronomy. Drew Leonard, one of the founders of Aperio, developed the prototype for the VTF pipeline. The contract divides into a design and implementation phase. During the former, Drew will clarify what is expected from the future API and what requirements it must meet through workshops and one-on-one meetings. Expect the first version for mid-2022.
Project Status
SDC Project Status 03-2021 (06.07.2021)
Solution Development and Integration
The project has now shifted into a phase where we are building the actual SDC platform and creating/acquiring all necessary components. These components are in-house developed software for instrument pipelines and analysis, compute, network and storage hardware, middleware (RUCIO, Kubernetes, Docker, etc.), and governance/management/documentation software like Jira Service Management and Confluence.
There is still some work to be done to find all suitable solution components and thus shaping the final scope of SDC. We aim to build SDC as a service platform for the solar community with a continuous focus on users and platform development.
📋 Summary
Current project health | Current project status | Project constraints | ||||||
---|---|---|---|---|---|---|---|---|
| Finalizing some tasks for solution design and creating solution components. Governance model not finalized and implementation not started yet. | Resources and their availability. Technology POCs taking more time than predicted. |
📊 Project status
Tip |
---|
Accomplishments |
High-level solution design
Some software components created (GRIS Viewer)
The hardware acquisition process started
RUCIO test environment established
Next steps
Continue selecting solution components and creating solution components
Warning |
---|
Risks & project issues |
Lack of resources
Resource availability
Multiple process implementations at the same time
No agreed governance model
Governance
👩⚖️ Policies, Frameworks & Governance
ITIL v4 process model going to be partially adopted for service management purposes
Data policies definition started
SDC governance model and scope to be decided
Products & Tools
🛠 SDC Products & Tools
Standardized GRIS Pipeline
The GRIS reduction pipeline was merged to a common version in collaboration with M. Collados (IAC, GRIS PI). The version running at OT and Freiburg now both produce data that is compatible with downstream SDC tools. The latest version of the pipeline can always be found on the KIS GitLab server. The current OT version will be synced to the ulises
branch and merged into the main production branch periodically.
SDC GRIS VFISV-Inversion pipeline
A pipeline code for performing Milne-Eddington inversions of GRIS spectropolarimetric data is now available at,
https://gitlab.leibniz-kis.de/sdc/grisinv
The pipeline uses the Very Fast Inversion of the Stokes Vector (VFISV, Borrero et al. 2011) code v5.0 (node for spectrograph data) as the main backend to carry out a Milne-Eddington Stokes inversion for individual spectral lines.
The current implementation of the pipeline is a Python MPI wrapper around the VFISV code to easily work with the GRIS data. The inversion for the desired spectral line is performed using VFISV and the buffer with the inversion results is communicated to the Python module. The Python module propagates the keywords from level 1 (L1) and packages the inversion results and outputs a FITS file (when used as a command-line interface) or returns an NDarray (when called within a python script).
For more information on installing and using the pipeline, check the above GitLab repository.
Please report any issues with the code using the link below,
https://gitlab.leibniz-kis.de/sdc/grisinv/-/issues/new?issue
SDC data archive
Get access to data from GRIS/GREGOR and LARS/VTT instruments and the ChroTel full-disc telescope at OT.
Updates as of July 2021
The detail pages for observations have been reworked see an example here:
Added dynamic carousel of preview data products
Added flexible selection for downloading associated data
VFISV inversion results have been added for most of the GRIS observations. The website now includes information on line of sight velocity and magnetic field strength
The development process has streamlined:
automated test deployments for quicker iterations and fixes
Changes to the UI will occur in regular sprints. We’re currently collecting ideas here
Added historic ChroTel data for 2013, thanks to Andrea Diercke from AIP for contacting us and providing us with this supplemental archive.
GRISView
GRISView is a new visualization and analysis tool to work with GRIS/GREGOR calibrated datasets as distributed by the SDC website. It is written in Python with GUI made using Qt cross-platform framework.
Currently implemented features include:
Quick panning and zooming of map images and spectra using mouse
Multiple POI (point-of-interest) and ROI (rectangle-of-interest) for easy inspection of spectral changes across the map
Distance measurement between multiple map points given in different units
Intensity profile plots along a given line segment, linking several profiles for radial profiles checking
Interactive color bars used to view histogram, adjust image contrast, select and modify the viewing color scheme
Generating contours for map images, easy levels adjustment, and color setting
Browsing spectra with cursor moving using keyboard and mouse shortcuts, quick navigation using marker list
Relative scale for quick wavelengths difference evaluation at the cursor position
Viewing observation FITS files headers
Support for both individual observations and time-series
Next, it is planned to add the following:
Exporting current spectra and map plots as images and data files
Derived quantities visualization e.g. Q/I, V/I, DOLP (degree of linear polarization) etc.
Various normalizations of spectra e.g. to a selected signal level, local continuum, quiet Sun
Spectral line fitting and line parameters determination
Saving and restoring working sessions
Info |
---|
Feedback welcomeWe strongly encourage all colleagues to try out this new tool and provide feedback. Instructions for installing and using the program can be found on the tool's GitLab page: https://gitlab.leibniz-kis.de/sdc/gris/grisview Please report any issues and bugs on the program GitLab page or using the direct link: https://gitlab.leibniz-kis.de/sdc/gris/grisview/-/issues/new?issue |
Conferences & Workshops
📊 Conferences & Workshops
Forthcoming Conferences/Workshops of Interest 2021
Every second Thursday, 12:30-13:30 CET (currently on summer break)
PUNCH Lunch Seminar (see SDC calendar invitation for zoom links)
KIS internal Typo3 Editors' training
July 13 & 14, 2021, 10:00 - 12:00 CEST registration needed!
SDC Collaborations
🤲 SDC Collaborations
SOLARNET https://solarnet-project.eu
KIS coordinates the SOLARNET H2020 Project that brings together European solar research institutions and companies to provide access to the large European solar observatories, supercomputing power and data. KIS SDC is actively participating in WP5 and WP2 in coordinating and developing data curation and archiving tools in collaborations with European colleagues.
Contact on KIS SDC activities in SOLARNET: Nazaret Bello Gonzalez nbello@leibniz-kis.de
ESCAPE https://projectescape.eu/
KIS is a member of the European Science Cluster of Astronomy & Particle Physics ESFRI Research Infrastructures (ESCAPE H2020, 2019 - 2022) Project aiming to bring together people and services to build the European Open Science Cloud. KIS SDC participates in WP4 and WP5 to bring ground-based solar data into the broader Astronomical VO and the development tools to handle large solar data sets.
Contact on KIS SDC activities in ESCAPE: Nazaret Bello Gonzalez nbello@leibniz-kis.de
KIS is one of the European institutes strongly supporting the European Solar Telescope project. KIS SDC represents the EST data centre development activities in a number of international projects like ESCAPE and the Group of European Data Experts (GEDE-RDA).
Contact on KIS SDC as EST data centre representative: Nazaret Bello Gonzalez nbello@leibniz-kis.de
PUNCH4NFDI https://www.punch4nfdi.de
KIS is a participant (not a member) of the PUNCH4NFDI Consortium. PUNCH4NFDI is the NFDI (National Research Data Infrastructure) consortium of particle, astro-, astroparticle, hadron, and nuclear physics, representing about 9.000 scientists with a Ph.D. in Germany, from universities, the Max Planck Society, the Leibniz Association, and the Helmholtz Association. PUNCH4NFDI is the setup of a federated and "FAIR" science data platform, offering the infrastructures and interfaces necessary for the access to and use of data and computing resources of the involved communities and beyond. PUNCH4NFDI has been granted funds and will start officially its activities on October 1, 2021. KIS SDC aims to become a full member of PUNCH and federate our efforts on ground-based solar data dissemination to the broad particle and astroparticle communities.
Contact on KIS SDC as PUNCH4NFDI participant: Nazaret Bello Gonzalez nbello@leibniz-kis.de & Peter Caligari mailto:cale@leibniz-kis.de
IT news
🖥 IT news
Ongoing & Future developments
Webpage
Status | ||||
---|---|---|---|---|
|
After the content has been moved, the server will be renamed http://www.leibniz-kis.de, and the old site will be shut down.
One of the reasons for the relaunch was to increase support of the particular browsers used by people with disabilities. This requires specific fields in the back-end to be filled in so that the page content can be appropriately classified. We will have a training course on handling the typo3 back-end in general, focusing on the above points on
July 13 & 14, 2021, 10:00 CEST (Editors' training)
We currently plan to avoid any user login in the front end. This would allow us to not have to use cookies at all, rendering the need to use these annoying GDPR popups obsolete. However, this means we might not have any restricted areas on the website at all (including an Intranet)! This is a radical approach, and we might not be able to stringently follow through with this (see below). In that case, the Intranet on the website will be limited to purely informational pages; any documents now downloadable on the old website should be migrated to the cloud (wolke7). Anyhow, Typo3 allows hosting multiple websites under a single installation sharing the basic design and resources. Therefore, any websites requiring user registration and login (like the Intranet or a possible OT-webpage) might be built as separate websites, keeping the publicly accessible website login-free.
Network
Status of the dedicated 10 Gbit line between KIS & OT
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Test of (application) firewalls at KIS
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
We (IT) still very much advocate going for high-availability setups for KIS and OT (in Freiburg) because KIS will host a significant part of SDC and OT because there's no trained personnel on-site, and replacements to the Canary islands take time).
Storage
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Starting in July, six more comparable hosts will be purchased through a public tender. These will have a similar setup and form storage Tier1 (near-line) of SDC at KIS. We expect the hosts to arrive in late September.
We use ZFS on virtualized Debian servers as a basis for the individual dCache-nodes. ZFS uses copy-on-write and checksums any blocks on disk and provides auto-healing. Zpools will most probably use RAIDZ or RAIDZ2, and any file will reside on at least 2 different servers. At the time of this writing, the only other file system offering similar features is BTRFS, but support for BTRFS was recently pulled from some major distributions (e.g. CentOS, the distro that has mainly been used at the KIS so far).
Status | ||||
---|---|---|---|---|
|
Current Resources
Compute nodes
hostname | # of CPUs & total cores | ram [GB] | patty
---|
Status | ||||
---|---|---|---|---|
|
]
patty, legs & louie
Status | ||||
---|---|---|---|---|
|
2 x AMD EPYC 7742, 128 cores
1024
itchy & selma
Status | ||||
---|---|---|---|---|
|
4 x Xeon(R) CPU E5-4657L v2 @ 2.40GHz, 48 cores
512
scratchy
Status | ||||
---|---|---|---|---|
|
quake &halo
Status | ||
---|---|---|
|
hathi
Status | ||||
---|---|---|---|---|
|
4 x Intel(R) Xeon(R) CPU E5-4650L @ 2.60GHz, 32 cores
512
Central storage space
Total available disk space for /home (
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
Status | ||||
---|---|---|---|---|
|
name | total [TB, brutto] | free [TB, brutto] | ||||||
---|---|---|---|---|---|---|---|---|
mars
| 758 | 39 | ||||||
quake
| 61 | 0 | ||||||
halo
| 145 | 44,5 | ||||||
jane
| 130 (-> 198) | 23 |
References
📎 References
Products & Tools
SDC data archive: https://sdc.leibniz-kis.de/
Speckle reconstruction: https://gitlab.leibniz-kis.de/sdc/speckle-cookbook
Forthcoming Conferences/Workshops
June 01-02 (16:00 - 17:30): 15th International dCache Workshop
June 10-11, 2021: 3th International Workshop on Science Gateways | IWSG 2021
June 28, 2021: Data-intensive radio astronomy: bringing astrophysics to the exabyte era
Collaborations
SOLARNET: https://solarnet-project.eu
ESCAPE: https://projectescape.eu/
PUNCH4NFDI: https://www.punch4nfdi.de
Quick links
Computer load: http://ganglia.leibniz-kis.de
Drafts of web-page relaunch: https://wolke7.leibniz-kis.de/s/wEkPRsA5xKRgYbB