Data Technologies-CERN School af Compuing 2019 Data Technologes-CERN School of Computing 2019 Agenda R CERN School of Computing Introduction to data management Data Workllows in scientific computing ◆Scra3 e Models Data management components Name Servers and databases 1st lecture ◆Data Access protocols Data Technologies ◆Reliability Aarabery Access Control and Security 2nd lecture .Crvptoo单aghr Alberto Pace ◆Scalability 3rlecture alberto.pace@cern.ch ·Clud storage CERN Data and Storage Services Group ,日ock storage 4t lecture ◆Data Replication ◆Data Caching 5th lecture ◆Monitoring.Alarms ◆Quota Summary Data Technologles-CERN School af Compuang 2019 -CERN School of Computing 2019 The mission of CERN R CERN School ofComputing CERN uniting people Introduction to data management Research We are here Discovery Accelerating particle Detecting particles Large-scale beams (experiments) computing (Analysis
1 Data Technologies – CERN School of Computing 2019 Data Technologies Alberto Pace alberto.pace@cern.ch CERN Data and Storage Services Group 2 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary 1 st lecture 2 nd lecture 3 rd lecture 4 th lecture 5 th lecture 3 Data Technologies – CERN School of Computing 2019 Introduction to data management 4 Data Technologies – CERN School of Computing 2019 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission of CERN
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 The need for computing in research The need for storage in computing Scientific research in recent years has exploded Scientific computing for large experiments is the computing requirements typically based on a distributed infrastructure Computing has been the strategy to reduce the Storage is one of the main pillars cost of traditional research Storage requires Data Management... At constant cost,exponential growth of performances Scientific Computing Computing has opened new horizons of research not only in High Energy Physics Return in computing investment higher than other fields:Budget available for computing increased, growth is more than exponential Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Why”data management? Can we make it simple Data Management solves the following problems A simple storage model:all data into the same ◆Data reliability container ◆Access control .Uniform,simple,easy to manage,no need to move data Can provide sufficient level of performance and reliability ◆Data distribution Data archives,history,long term preservation ◆In general: Cloud"Storage Empower the implementation of a workflow for data processing
5 Data Technologies – CERN School of Computing 2019 The need for computing in research Scientific research in recent years has exploded the computing requirements Computing has been the strategy to reduce the cost of traditional research Computing has opened new horizons of research not only in High Energy Physics At constant cost, exponential growth of performances Return in computing investment higher than other fields: Budget available for computing increased, growth is more than exponential 6 Data Technologies – CERN School of Computing 2019 The need for storage in computing Scientific computing for large experiments is typically based on a distributed infrastructure Storage is one of the main pillars Storage requires Data Management… DATA CPU NET Scientific Computing 7 Data Technologies – CERN School of Computing 2019 “Why” data management ? Data Management solves the following problems Data reliability Access control Data distribution Data archives, history, long term preservation In general: Empower the implementation of a workflow for data processing 8 Data Technologies – CERN School of Computing 2019 Can we make it simple ? A simple storage model: all data into the same container Uniform, simple, easy to manage, no need to move data Can provide sufficient level of performance and reliability “Cloud” Storage For large repositories, it is too simplistic !
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 Why multiple pools and quality So,..what is data management Derived data used for analysis and accessed by Examples from LHC experiment data models thousands of nodes Need high performance.Low cost,minimal rellability (derived data can be recalculated) Raw data that need to be analyzed Need high performance.High reliability,can be expensive (small sizes) Raw data that has been analyzed and archived Must be low cost (huge volumes).High reliability (must be preserved),perlormanoe not necessary .Two building blocks to empower data processing Data pools with different quality of services Tools for data transfer between pools Data Technologles-CERN School af Compuang 2019 Data Technologles-CERN School of Computing 2019 Data pools But the balance is not as simple Different quality of services Many ways to split(performance,reliability,cost) Three parameters:(Performance,Reliability,Cost) Performance You can have two but not three Cost Reliability Expensive Performance has many sub-parameters Flash,Solld State Disks Cost has many sub-parameters -Mirrored disks Reliability has many sub-parameters Tapes Disks Scalability Electrical consumption Slow Unreliable Latency Ops Cost Throughput Consistency HW cost (manpower)
9 Data Technologies – CERN School of Computing 2019 Why multiple pools and quality ? Derived data used for analysis and accessed by thousands of nodes Need high performance, Low cost, minimal reliability (derived data can be recalculated) Raw data that need to be analyzed Need high performance, High reliability, can be expensive (small sizes) Raw data that has been analyzed and archived Must be low cost (huge volumes), High reliability (must be preserved), performance not necessary 10 Data Technologies – CERN School of Computing 2019 So, … what is data management ? Examples from LHC experiment data models Two building blocks to empower data processing Data pools with different quality of services Tools for data transfer between pools 11 Data Technologies – CERN School of Computing 2019 Data pools Different quality of services Three parameters: (Performance, Reliability, Cost) You can have two but not three Slow Expensive Unreliable Tapes Disks Flash, Solid State Disks Mirrored disks 12 Data Technologies – CERN School of Computing 2019 But the balance is not as simple Many ways to split (performance, reliability, cost) Performance has many sub-parameters Cost has many sub-parameters Reliability has many sub-parameters Reliability Performance Latency / Throughput Scalability Electrical consumption HW cost Ops Cost (manpower) Consistency Cost
Data Technologies-CERN School af Compuaing 2019 Data Technologes-CERN School of Computing 2019 (Sc And reality is complicated Where are we heading? Key requirements:Simple,Scalable,Consistent,Reliable, Software solutions Cheap hardware Available,Manageable,Flexible,Performing,Cheap,Secure. Aiming for"a la carte"services (storage pools)with on-demand “quality of service” .And where is scalability E ensive Mirrored disks Software dafined service cheap hardware Disks Slow Unreliable Slow Unreliable B-Pooiz Data Technologles-CERN School af Compuang 2019 Data Technologies-CERN School of Computing 2019 Agenda ERN CERN School ofComputing Name Servers and databases Data Management Components
13 Data Technologies – CERN School of Computing 2019 And reality is complicated Key requirements: Simple, Scalable, Consistent, Reliable, Available, Manageable, Flexible, Performing, Cheap, Secure. Aiming for “à la carte” services (storage pools) with on-demand “quality of service” And where is scalability ? 0 10 20 30 40 50 60 70 80 Read throughput Write throughput Read Latency Write Latency Scalability Consistency Metadata Read throughput Metadata Write throughput Metadata Read Latency Metadata Write Latency Pool1 Pool2 14 Data Technologies – CERN School of Computing 2019 Where are we heading ? Software solutions + Cheap hardware Slow Expensive Unreliable Tapes Disks Flash, Solid State Disks Mirrored disks Slow Expensive Unreliable Software defined service + cheap hardware 16 Data Technologies – CERN School of Computing 2019 Data Management Components 17 Data Technologies – CERN School of Computing 2019 Agenda Introduction to data management Data Workflows in scientific computing Storage Models Data management components Name Servers and databases Data Access protocols Reliability Availability Access Control and Security Cryptography Authentication, Authorization, Accounting Scalability Cloud storage Block storage Analytics Data Replication Data Caching Monitoring, Alarms Quota Summary
Data Technologies-CERN School of Compuaing 2019 Data Technologes-CERN School of Computing 2019 Name Server Criticality of the name server performance .The name server is"the"database of a managed storage Every meta-data operation requires a database system which contains the catalogue of all data(typically transaction. all files) It is a simple lookup-based,single-key,database ◆It is essential to understand where the“name application for which several implementation exists server"approach is placed... DNS(domain name server)software The name server lookup time dictates the ◆LDAP databases performance of the whole storage system Hash tables /Object databases The database becomes the bottleneck of the Relational Databases entire storage process:low performances are a Name server reliability is critical symptom of major architectural mismatch Name server failure brings down the whole storage system Comment:Cloud storage An architecture that Name server performance is critical replaces the name server DB lookup with a ◆See next slide. "calculated"name resolution (..more to come...) Data Technologles-CERN School of Computing 2019 Short digression on.… Uniform Resource Identifiers(URI) Similar problem in storage systems ◆Example from the web. Example from storage... http://csc.cern.ch/data/2012/School/page.htm storage://cern.ch/data/2012/School/page.htm ↑ 个 ↑↑ ↑ ↑↑ protocol host/domain volume folder/directory file protocol host/domain volume folder/directory file Where is the database lookup when accessing a web page In several implementation,the database lookup is at the host domain level. placed at the“fle”level Every host has its own namespace,managed Impacts all operations,including most popular locally. open()and stat() ◆Excellent example of“federated”namespace Great flexibility but huge performance hit,which Extremely efficient,but some limitations implies more hardware and constant database http://www.ietf.org/rfc/rfc2396.txt tuning
18 Data Technologies – CERN School of Computing 2019 Name Server The name server is “the” database of a managed storage system which contains the catalogue of all data (typically all files) It is a simple lookup-based, single-key, database application for which several implementation exists DNS (domain name server) software LDAP databases Hash tables / Object databases Relational Databases Name server reliability is critical Name server failure brings down the whole storage system Name server performance is critical See next slide … 19 Data Technologies – CERN School of Computing 2019 Criticality of the name server performance Every meta-data operation requires a database transaction. It is essential to understand where the “name server” approach is placed ... The name server lookup time dictates the performance of the whole storage system The database becomes the bottleneck of the entire storage process: low performances are a symptom of major architectural mismatch Comment: Cloud storage ? An architecture that replaces the name server DB lookup with a “calculated” name resolution (… more to come …) 20 Data Technologies – CERN School of Computing 2019 Short digression on ... Uniform Resource Identifiers (URI) Example from the web ... http://csc.cern.ch/data/2012/School/page.htm http://www.ietf.org/rfc/rfc2396.txt protocol host / domain volume folder / directory file Where is the database lookup when accessing a web page ? at the host / domain level. Every host has its own namespace, managed locally. Excellent example of “federated” namespace Extremely efficient, but some limitations 21 Data Technologies – CERN School of Computing 2019 Similar problem in storage systems In several implementation, the database lookup is placed at the “file” level Impacts all operations, including most popular open() and stat() Great flexibility but huge performance hit, which implies more hardware and constant database tuning Example from storage ... storage://cern.ch/data/2012/School/page.htm protocol host / domain volume folder / directory file