NEON relies on computing software and hardware to manage thousands of sensors, billions of data points, and terabytes of output data. Sensors and technicians collect data from sites spread across the nation. The cyber infrastructure team coordinates the transfer of data from field sites to NEON's central data center. Working with science and engineering staff, the team 1) standardizes and automates data collection and processing tasks; 2) stores and processes data; and 3) develops relevant operational tools, such as monitoring, alerting, and mobile applications. Special attention is also paid to how data are documented, through human- and machine-readable formats.
NEON's primary data center is located in Denver, CO. The data center houses servers, storage, networking, and associated peripherals for the NEON project. NEON uses an elastic cloud storage (ECS) archive for primary data storage. The ECS is comprised of three storage systems that use the S3 protocol for data access. One is used for development (~1 PB), a second for production (~2 PB), and a third is a near-real-time backup of the production data. The capacity of the ECS can be expanded by adding more servers.
Raw Data: Instrumented Systems (IS) data are stored in NEON's primary data repository (PDR) first and then transferred to the ECS on a daily basis. In the near future, IS data will be transferred directly to ECS. Observational Systems (OS) data are extracted from Fulcrum cloud storage and stored to ECS. AOP data are extracted from hard drives and copied to the ECS.
Processed Data: L1 IS data are currently stored in PDR, but are being moving into ECS. Once this is complete, the PDR will only store metadata. OS and AOP data are stored in ECS, except for data sent to third-party data hosting services (MG-Rast, PhenoCam, etc.)
NEON design relies on algorithms and processes to convert raw field measurements and observations into calibrated, documented and quality-controlled data products. Delivering the immense volume of diverse sensor-derived data that NEON collects in a user-friendly format requires large-scale automation and computing power. NEON scientists collaborate with cyber infrastructure staff to create data processing algorithms and frameworks that
- Collect and centralize data from thousands of sensors and hundreds of field scientists;
- Process incoming data to create derived data products;
- Assess the quality and integrity of data products; and
- Deliver optimized, useable, high-value data products.
For example, NEON flags sensor-derived data that are out of normal range or implausible, such as a species size measurement outside of the known range. NEON also conducts random recounts, crosschecks collected data with existing data and reconciles conflicting data using documented quality-control methods.
NEON uses a collaborative process (Agile software development), with software engineers and scientists partnering together to develop the code that supports data collection, processing, publication, and distribution. Input is gained from multiple sources, including from members of other departments at NEON, external collaborators, and end users. Development projects are scoped, and then prioritized by internal mixed-department teams.
For processes that require people to collect data in the field, NEON scientists and software developers have leveraged the Fulcrum platform to develop a series of sophisticated, rule-based applications tailored to each specific data collection protocol. These custom applications are then served to field scientists on digital tablets, allowing for real time quality assurance of the data during collection.
Data Ingest and Processing
NEON develops and maintains custom software to ingest data from sensors and Fulcrum apps. Streaming data from sensors is continually monitored for issues with data quality and quantity. Potential failure points in the ingest pipeline are logged and validated. Software has also been developed to monitor near-real-time health of sensors at the field sites to facilitate rapid alerts of outages and to improve response time.
NEON developed pipelines to clean and process raw data into higher level products. QA/QC measures are performed at multiple points in the data processing pipeline, as early as possible, as are system state of health and performance. For observational data, scientists produce machine-readable workbooks that describe data processing rules, using an in-house language called NEON Ingest Conversion Language, or nicl. These workbooks provide a flexible method by which processing rules can be updated as needed. For instrumented and AOP data, scientists are involved in developing the algorithms and modules within the processing code. The data processing algorithms that have been coded into the pipeline are described in detail by Algorithm Theoretical Basis Documents (ATBDs), which are available for download from the NEON Data Portal. The processing code is available to the scientific community mostly by request only, but we are working toward open-sourcing our code. Raw (L0) data is never deleted, except in cases of obvious errors with sensors, communications systems, or field collection.
Data publication involves writing processed data into formatted files and bundling the files with associated metadata and documentation into data products. These products are made to the scientific community through NEON's data portal and API. The publication software is written to correctly associate data streams into a bundle, generate metadata files in both human- and machine-readable formats, and store the files in the ECS where they can be accessed later by end users.
User interfaces and API
Documentation and Interoperability
Making data discoverable, interoperable, and ready for reuse requires consideration of many factors, including human- and machine-readable forms of documentation; well-defined naming conventions or unique identifiers for everything from data streams to files; and protocols for transferring information between systems. We develop standardized documentation, some of which is readable by machines.
Interoperability - Naming Conventions and Formats
Where possible, NEON uses existing vocabularies or ontologies to describe variables or data streams. These include Darwin Core terms, the Global Biodiversity Information Facility vocabularies, and the VegCore data dictionary. In addition, data files are formatted to enhance interoperability between NEON data products and with data from other research programs. This includes the use of CSV, HDF5, LAS/LAZ, and GEOTIFF file formats.
Human-readable documentation is provided in text and PDF files. Each data product includes README files that describe the data product, as well as any files that are included in a downloaded package. In addition, end users can choose to include PDF files that may describe data collection and sample processing protocols, sensor placement in the field, algorithms used in data processing, calibration procedures, and other components of the data life cycle.
Machine-readable documentation is developed using community standards and established schemas. For NEON, this is mostly through three mechanisms: 1) metadata files generated based on the Ecological Metadata Language (EML) schema, which describe data products and the files that comprise data packages; 2) metadata embedded into the Hierarchical Data Format (HDF5) that NEON uses for eddy covariance and AOP data products; and 3) JSON-LD files that follow schema.org conventions, extended with patterns defined by the Schema.org cluster within the Earth Science Information Partners (ESIP) organization.