Construction of the Automated Data Processing and Delayed-Mode Quality Control System for Profiling Floats

Yasushi TAKATSUKI*1 Yasuko ICHIKAWA*2 Taiyo KOBAYASHI*2 Keisuke MIZUNO*1 Kensuke TAKEUCHI*2

Abstract

An automated data processing and quality management system for profiling floats of the ARGO project has been constructed. The system automatically processes profiling float data after about 20 hours from its descent, and stores data in the database system. The float data is publicized through World Wide Web. For the quality control of float data, we prepared historical database such as WOA98 and Hydrobase, and collect spatially/temporally neighboring ocean data in geographical/time on GTS via NEAR-GOOS Regional Real-Time Data Base. The system enables us to quality management using such as overlay plot of the float data and historical/neighboring data and so on.

Keywords: ARGO project, Automated data processing, quality control, database, WWW

*1 Ocean Observation and Research Department
*2 Frontier Observational Research System for Global Change


1. Introduction

An international project was launched in 2000 for the construction of a new ocean observation system through the worldwide deployment of subsurface floats (profiling floats) that measure vertical profiles of water temperature and salinity from the sea surface down to a depth of 2,000 m. The floats that will be used in this project feature the same design as ALACE (Davis et al., 1992). A float drift at the preset depth (normally 2,000 m) and resurface at specified intervals (normally 10 days). During its ascent to the surface, a float measures vertical profiles of temperature and salinity, and transmits the data at the surface via satellite. The Argo project will collect 100,000 temperature/salinity profiles annually by deploying approximately 3,000 floats in oceans worldwide (Roemmich and Owens, 2000). The Argo project is part of the Global Ocean Observing System (GOOS) and will contribute to the Climate Variability and Predictability Study (CLIVAR) and the Global Ocean Data Assimilation Experiment (GODAE). The project is supported by the World Meteorological Organization (WMO) and the Intergovernmental Oceanographic Commission (IOC) of UNESCO.

In Japan, the 5-year project, "Construction of the Advanced Ocean Monitoring System (ARGO project)," was launched in 2000 as part of the millennium project (Mizuno, 2000). This project is unique to Japan, but it is also actual contribution to the international Argo project. The Ocean Observation and Research Department of the Japan Marine Science and Technology Center and the Frontier Observational Research System for Global Change (JAMSTEC/FORSGC) will be in charge of constructing an observational data-processing and management system for this project, which will consist of storing the data from deployed floats, performing delayed-mode quality control, and conducting data distribution through the "database system." Here, we will describe a summary of the automated data-processing and quality control system implemented in FY 2000, which is part of the "database system".

2. System Overview

Data quality control in the Argo project is performed in two steps: the automated quality control performed within 1 to 2 days after data acquisition, and delayed-mode quality control is performed within 3 months after data acquisition. The system carries out both real-time data processing, which performs data acquisition including automated quality control on a server, and the delayed-mode quality-control processing, which compares the acquired data with historical data and/or in-situ data using a quality control program on a PC, and also performs quality flag assignment and data correction. Figure 1 shows the conceptual design of the system. The system is operated on local network in JAMSTEC, so data retrieval from the Internet must be conducted by e-mail through a mail server of JAMSTEC and by the use of telnet/ftp through the firewall. Due to security problems, data distribution to the Internet at the present time will be performed using the HTTP protocol on the World Wide Web. The data will be managed by the database system and manupulated by SQL.

Figure 1. Conceptual drawing of the data processing system for profiling floats.

3. System Design

The following points were taken into consideration in the system construction. First, all processing, from float data reception to automated quality control to distribution via the Web, will basically be performed automatically. Even in the event of a system failure, services will continue whenever possible. Second, the quality-control procedures for the system should require minimal knowledge of the database system management and the like. The programs were written primarily in Perl, a scripting language, for easy maintenance, and IDL was used as the graphics tool.

3.1. Improvement of system availability

Figure 2 shows the hardware components of the server part. To prevent an interruption in service due to hardware failure, two server machines with identical components are provided, and clustering software (VERITAS Cluster Server) is installed on both machines. The clustering software monitors the operational status of the each other machine through a direct Ethernet connection and periodically monitors the operational status of the softwares it manages. Each service managed by the clustering software is assigned a virtual host name ("floatdb1" for the database server function and "floatdb2" for the Web server function) and an IP address. By accessing the virtual host, the user may access a service without having to know which server machine actually performs the service. Normally, the database server (floatdb1) runs on the machine float1adm, and the Web server (floatdb2) runs on the machine float2adm. However, in the event of a functional failure in one of the servers, the service of the failed machine is terminated and the other machine takes over the service.

Figure 2. Hardware components of Database server and Web server.

To enable data recovery in the event of a failure in the database file, a backup of the database is made daily by the automatic backup software (VERITAS Netbackup) and the DLT tape changer. Currently, the backup data is set to be stored on tape for at least 7 days. To protect the entire system from unexpected power failures, it is connected to an uninterruptible power supply (UPS) unit. The entire system is shut down if the power failure lasts longer than 10 minutes. After power is restored, the system restarts automatically.

3.2. Automated Data Processing

The system automatically processes float data at constant time intervals. The flow of the automated processing is shown in Fig. 3. The details of each step are provided in the following sections. The data processing functions described in Sections 3.2.1 to 3.2.4 are performed on the database server, and those in Section 3.2.5 are performed on the Web server.

Figure 3. Flowchart of automatic float data processing.

3.2.1. Retrieve and Classify Data

A float transmits data using the ARGOS system. The ARGOS system currently (July 2001) has five polar orbiting satellites. These satellites have sun-synchronous polar orbits with orbiting altitudes of 850 km. It takes about 102 minutes to complete one revolution around the Earth, and shift its orbit 25 degrees westerly per revolution. Two satellites pass over a certain region 6 to 7 times per day at the equatorial regions, and roughly 28 times per day at the polar regions. Transmitting data is possible for 8-15 minutes each time it passes over a float (CLS/Argos, 1996). The ARGOS system normally delivers only the data received by two satellites, but use of the multi-satellite service makes it possible to acquire all available data received by satellites in operation.

The maximum data volume that can be transmitted using the ARGOS system is limited to 32 bytes. The size of data transmitted by a float in a single resurfacing (float internal information and the pressure, temperature and salinity data measured for approximately 60 layers) is approximately 400 bytes. Therefore, the float divides the data into 12 to 14 message blocks and transmits them in sequence. Due to the restrictions of the ARGOS system, the transmission repetition period of the ARGOS Platform Transmitter Terminal (PTT) onboard the floats that have been deployed thus far has been either 44 or 90 seconds. It is therefore difficult to transmit all of the data during single satellite pass over the float. Furthermore, as the position of the float on the surface is determined by the ARGOS system, all of the float data obtained while it is on the surface should also be collected. The data received by the ARGOS satellite is delivered by e-mail via ground stations. With the current e-mail delivery settings, all received data should be delivered. However, compared to the data delivered by floppy disk once per month, the data sent by e-mail is sometimes lack some data blocks. To prevent problems in the e-mail delivery, the system also access to the Service ARGOS host by telnet every 6 hours in order to retrieve data.

The data acquired by e-mail or telnet is classified by ARGOS ID, and is then stored in the respective work files for each ARGOS ID to process later. During the classification process, data not in the ARGOS format is rejected as invalid.

Figure 4 shows the histogram distribution of the time required to deliver data by e-mail. The time required for a satellite to move to a position above a ground station is included in this time. Over 40% of the data is delivered within 1 hour after data transmission, and over 98% of the data is delivered within 12 hours. The longest elapsed time has been 22 hours.

The current system begins the decoding process when no new data has been added by e-mail nor telnet within the past 12 hours, since (1) the flyover interval of a satellite is 2 to 3 hours, (2) more than 20 hours may be required for the satellite to deliver the received data, and (3) the ARGOS system may recalculate the positioning data. As parallel acquisition of data is made by e-mail and telnet, there have yet to be any cases in which new data was acquired after the decoding processing during normal operations. In rare cases in which new data is acquired, it is handled as a duplicate data error by the database when insert data if the data contains profile-number information. However, if the profile-number information is lacked, the data will be inserted to database as new data and must be confirmed and deleted manually from the database. Therefore, all steps in the processing are recorded in the job report, and we need to check the decoded data against the float schedule.

Figure 4. Histogram and cumulative receive rate as a function of the elapsed time after transmission.

3.2.2. Decoding Data

The received data may contain errors occured during transmission. Therefore, the first byte of the 32 bytes of transmission data is assigned to the 8-bits Cyclic Redundancy Check (CRC) code calculated from the remaining 31 bytes. The CRC effectively detects burst errors (successive bit errors) and has revealed that nearly 20% of the received data contains errors (Nakajima et al., 2001). An average of 120 pieces of data are received every time a float resurfaces, so if data containing CRC errors is removed, an average of 90 pieces of data are obtained. As a single float datum contains 12 to 14 message blocks, the entire datum cannot be decoded unless all of the blocks are received. It can be seen from the number of messages received up to July 2001 (Fig. 5) that approximately 1% of the message blocks were not received at all. The probability means that one incomplete profile will occur per resurfacing of 10 floats. Fortunately, the non-received block tends to be found in a specific profile, so of the 235 profiles obtained thus far, only 10 were incomplete (less than 5%).

Figure 5. Histogram of duplicate number for each data message block.

For the 8-bits CRC, a bit error cannot be detected with a probability of 1/2^8. Several cases were found in which the data sent from the floats contained errors that could not be detected by the CRC (Fig. 6). However, as the probability of the same errors occur for different transmission is very low, erroneous data is rejected in the actual processing based on most frequently data pattern. If there is no difference in the pattern frequency or there is only one piece of data for the corresponding block, there is a possibility that the block contains an error. Therefore, these blocks are recorded in the job report.

Figure 6. Example of error data that passed CRC check.

3.2.3. Automated Quality Control

For the Argo project, common automated quality-control processing has been adopted internationally. Currently, discussions are underway regarding the actual processing method, and a decision should be made at the Argo Data Management Team Meeting scheduled for September 2001. As soon as the contents of the automated quality-control processing are decided, quality control at JAMSTEC/FORSGC will be conducted in accordance with the decided procedure. Meanwhile, some form of automated quality control should be performed to prevent erroneous data from being distributed on the WWW. Therefore, we perform the automated quality-control procedure referenced to the Real-time Quality Control Manual of the Global Temperature and Salinity Pilot Program (IOC, 1990). The current automated quality-control procedures are described below.
  1. Positioning Check:
    The position of a float is calculated from the Doppler effect of the signal received by the ARGOS satellite. On average, 10 positions are determined in this manner while a float is on the surface. The routine confirms that the float's drifting velocity calculated from these positions and time does not exceed 5 kt, and also that the drifting velocity does not exceed the average drifting velocity during the surfacing by a factor greater than 2.5.
  2. Global Range Check of Temperature, Salinity, and Depth:
    The routine confirms that the data values fall within the range of -2 to 35 deg-C for temperature, 0 to 40 for salinity, and 0 to 10,000 km for depth, which are values normally observed in the open sea.
  3. Depth Check:
    The depth data, ETOPO5 (NOAA, 1988) for a point nearest the resurfacing point of the float and the maximum depth recorded in the float data is compared to confirm that the seafloor depth is deeper.
  4. Pressure Inversion Check:
    Confirms that the pressure value increases in the proper sequence.
  5. Range Check by Depth:
    Confirms that the temperature and salinity values by depth do not fall outside the range specified in Table 1.
  6. Freezing Point Check:
    Confirms that the water temperature is not lower than the freezing point calculated by the following equation (UNESCO, 1983):
    T=-0.0575 S+1.710523 E-3 S^(3/2) -2.154996E-4 S^2-7.53E-4 P
    Here, S is the practical salinity and P is the pressure (dbar).
  7. Spike Check:
    Confirms that the value calculated for the vertical profile of temperature and salinity using the following equation has no spike higher than the threshold value (water temperature of 2.0 deg-C and salinity of 0.3):
    Vtest = |V2 - (V3 + V1)/2|-|V1 - V3|/2
    Here, V2 is the value of the layer to be tested, and V1 and V3 are the values of the layers directly above and below the tested layer, respectively.
  8. Slope Check:
    Confirms that the value calculated using the following equation for the vertical profile of temperature and salinity has no slope steeper than the threshold value (temperature of 10 deg-C and salinity of 5.0):
    Vtest = |V2 - (V3 + V1)/2|
    Here, V2 is the value of the layer to be tested, and V1 and V3 are the values of the layers directly above and below the tested layer, respectively.
  9. Density Inversion Check:
    Confirms that there are no inversions in the density calculated from the temperature and salinity values.
If the data applies to any of the check items above, the data flag is set to 3, that means "possibly erroneous value".

Table 1. Range check value of temperature and salinity for each depth range used for automatic quality control.

3.2.4. Data Insert to the Database

The number of messages received by the ARGOS satellite, and the time and date of the last update is inserted to the database along with the pressure, temperature, salinity, and float internal information obtained through decoding; all positioning information from the ARGOS system; and the data flag following quality control processing. If the observation values are corrected or a change is made to the data flag in later quality control processing, the entire revision history and the values prior to revision are also recorded in the database.

3.2.5. Automated Update of Webpages

The webpages currently has the structure shown in Fig. 7. The system checks the database for new or updated data every 3 hours, and generate/update the pages for the corresponding float data. All pages are updated once every 24 hours.

Figure 7. World Wide Web Site structure of the "Japan ARGO Delayed-mode Data base" (http://www.jamstec.go.jp/ARGO/).

3.3. Float Information Management

Metadata such as the float type, serial number, float deployment information, and settings for the drift/resurface time is inserted to the database, as with float observation data. Insertion to or updating of the database is made using a Web browser for the sake of convenience (Fig. 8). These pages are only accessible from the internal network, and another http server is dedicated to float information management.

Figure 8. Top page of the "Float information management system."

3.4. Quality control of Float Data

The Argo project aims to achieve a accuracy of 0.005 deg-C in temperature and 0.01 in practical salinity. However, conductivity sensors are greatly affected by slight physical deformations and/or impurities on the sensor surface and likely to suffer large deviations in accuracy. It is therefore extremely difficult to maintain high accuracy during long-term operations. Conductivity sensors with long-term stability are currently under development, but it is also important to control the quality of data acquired by floats that have already been deployed. In the past, some studies have been made by Freeland (1997) and Bacon et al. (2001) to correct salinity data obtained by the floats using historical data or temporally/spatially neighboring data. At JAMSTEC as well, quality-control methods are being examined in the ARGO project (Kobayashi et al., 2001).

The developed system enables a comparison between the float data and the historical data (Fig. 9), a comparison of pieces of float data, and comparison of the float data to temporally/spatially neighboring data on a PC. The system also makes it possible to change the flag on the screen and to update the database according to the above comparison. The World Ocean Atlas 1998 (NOAA, 1999), published by the National Oceanographic Data Center of the National Oceanic and Atmospheric Administration, and HydroBase (Macdonald et al., 2001) data are prepared in the database as historical data for quality control. For temporally/spatially neighboring data, the global subsurface temperature and salinity data obtained through the GTS managed by the Regional Real-Time Database (RRTDB) of the North East Asian Regional GOOS (Yoshida and Toyoshima, 2001) is automatically retrieved every day by ftp and inserted to the database for comparison.

Figure 9. Sample screen shot of the "Float data quality control program." Comparison graph of Float data and historical data (WOA98).

4. System Improvement

The data has been publicized on the WWW by the present system since April 1, 2001. On one occasion, the database server went down due to a file system full caused by inappropriate script settings, but otherwise there have been no major problems in the automated data processing. The following items are being examined as themes for the future with the progress of the Argo project.

4.1. Full adaption to All Argo Floats

The present system is designed to handle only data from the profiling floats deployed by JAMSTEC. At the International Argo Data Management Group workshop held in October 2000, it was agreed that all float data should have a common format, and that all data should be exchanged through two global centers. The common format is to be decided at the Argo Data Management Team Meeting scheduled for September 2001. When the format has been finalized and the global centers begin operations, the system at JAMSTEC will be revised so that the database will be able to handle all float data and that JAMSTEC will be able to provide global Argo float data.

4.2. Periods to real-time processing

At present, 12 to 32 hours are required after a float ends transmission and descends for the completion of all real-time processing in automated data processing (Fig. 10). The average processing time is approximately 20 hours, and data processing is complete within 24 hours for 90% of the data. To achieve the goal of the millennium project, which is to improve the precision of long-term prediction, the real-time data should quickly be assimilated into the model. To reduce the time required for the real-time processing of float data, it is necessary to reduce the time required for data acquisition, the most time-consuming process. It is also required to examine the suitable standby time before the decoding processing is performed, which is currently uniformly set at 12 hours. Adopting a satellite communication system other than the ARGOS system may also reduce the time required for data acquisition.

Figure 10. Histogram of the elapsed time from the float descent to decode data.

4.3. Reconstruction of Error Data

As previously mentioned in Section 3.2.2, the present communication system has a high error rate, and CRC errors may even be detected in all data for certain data block. In the performance test on the ARGOS system conducted by Sherman (1992), no difference was found in the characteristics of errors for the two patterns of a repetition of 1 and an alternation of 1 and 0, and it was therefore concluded that errors in the ARGOS system result not from simple bit loss, but rather from a noise burst that causes errors in several successive bits.

Figure 11 shows an example of data block with CRC errors. There are some portions in which many bits have been inverted, and other portions in which only a few bits have been inverted. However, even if all the data received for certain message data block contain errors, it may be possible to reconstruct the most likely data sequence by comparing each bit in the received blocks and calculating the CRC. Figure 12 shows an example of such reconstruction for the data in Fig. 11. It is preferable to reconstruct data by examining the results of decoding and following this method. We are therefore examining convenient methods for the reconstruction of error data on the quality-control PC.

Figure 11. Example of bit error in ARGOS message.

Figure 12. Example of recovery data from bit error shown in Fig. 11.

4.4. Efficient Delayed-Mode Quality Control

Delayed-mode quality control at JAMSTEC is basically performed for data acquired by floats deployed by JAMSTEC. With the progress of the Argo project, the number of floats collecting data for quality control will increase significantly, and it will therefore be necessary periodically to execute a data quality-control program that supports the quality-control jobs, and to efficiently perform quality control using the report produced by the program. We must examine what type of information the report should contain to advance the quality-control jobs and incorporate it into the system.

5. Conclusions

A system for automated data-processing and quality control of float data was constructed as a database system to store data from deployed floats, perform delayed-mode quality control, and publicize data, for the processing and management of observation data collected by the project, "Construction of the Advanced Ocean Monitoring System (ARGO Project)," which is part of the Millennium Project. The system has two main functions: automated real-time data-processing of float data, and delayed-mode advanced quality control. The automated real-time data processing can distribute observation data on the WWW within one to two days after the float's resurfacing. For quality-control purposes, historical data (WOA98, HydroBase) and the temporally/spatially neighboring data are provided to enable on-screen comparison with the float data.

In 2001, the system will be revised for compatibility with data from all Argo floats, to enable reconstruction of error data, and to perform efficient delayed-mode quality control.

Acknowledgement

D. Swift of the University of Washington provided the informations on the data processing system for the profiling floats operated at the University of Washington. Dr. R. Molinari and the staff members at the GOOS Center of the Atlantic Oceanographic and Meteorological Laboratory (AOML) of the National Oceanic and Atmospheric Administration (NOAA) provided the informations on the quality control processing conducted at the GOOS Center. We would like to express our deep appreciation to all of these individuals.

References

  1. Davis, R.E., D. C. Webb, L. A. Regier, and J. Dufour, "The Autonomous Lagrangian Circulation Explorer (ALACE)", J. Atm. and Oceanic Technol., 9 (3), 264-285 (1992).
  2. Roemmich, D. and W. B. Owens, "The Argo project: global ocean observations for understanding and prediction of climate variablity", Oceanography, 13, 45-50 (2000).
  3. Mizuno, K., "A plan of the establishment of Advanced Ocean Observation System (Japan ARGO)" (in Japanese), Techno Marine, 854, 485-490 (2000).
  4. CLS/Service Argos, Users Manual 1.0. (CLS/Service Argos, Inc., January 1996).
  5. Nakajima, H., Y. Takatsuki, K. Mizuno, K. Takeuchi, and N. Shikama, "Data communication status of the ARGO floats" (in Japanese with English abstract), JAMSTECR, 44, 153-161 (2001).
  6. IOC, Manuals and Guides #22 "GTSPP Real-time Quality control manual" (1990).
  7. NOAA, Data Announcement 88-MGG-02, Degital relief of the Surface of the Earth. (NOAA, National Geophysical Data Center, Boulder, Colorado, 1988).
  8. UNESCO, "Argorithms for Computation of Fundamental Properties of Seawater." UNESCO Technical Papers in marine science, 44 (1983).
  9. Freeland, H., "Calibration of the Conductivity Cells on P-ALACE Floats" 1997 U.S. WOCE Report, 37-38, (1997).
  10. Bacon, S., L. R. Centurioni and W. J. Gould, "The Evaluation of Salinity measurements from PALACE Floats", J. Atm. and Oceanic Technol., 18 (7), 1258-1266 (2001).
  11. Kobayashi, T., Y. Ichikawa, Y. Takatsuki, T. Suga, N. Iwasaka, K. Ando, K. Mizuno, N. Shikama, and K. Takeuchi, "Quality control of Argo data based on high quality climatological data set (HydroBase) I" (in Japanese with English abstract), JAMSTECR, 44, 101-114 (2001).
  12. NOAA, World Ocean Atlas 1998 (WOA98), (NOAA, National Oceanographic Data Center, Ocean Climate Laboratory, April 1999).
  13. Macdonald, A. M., T. Suga and R. G. Curry, "An isopycnally averaged North Pacific climatology", J. Atm. and Oceanic Technol., 18 (3), 394-420 (2001).
  14. Yoshida, T. and S. Toyoshima, "Present status and future view of data management in NEAR-GOOS" (in Japanese), Kaiyo Monthly, 33 (5), 311-316 (2001).
  15. Sherman, J., "Observations of Argos Performance", J. Atm. and Oceanic Technol., 9 (6), 323-328 (1992).

Appendix. Data Format of the Profiling Float

The floats that have been deployed to date by the Japan Marine Science and Technology Center and the Frontier Observational Research System for Global Change (JAMSTEC/FORSGC) use the ARGOS system for data transmission. The data is transmitted in blocks of 32 bytes in hexadecimal notation (Fig. A1). There are currently two types of data formats, which are presented in Fig. A2. The conversion to water temperature (T), salinity (S), and pressure (P) is performed using the equations shown below. Here, BH and BL are the values of the higher byte and lower byte, respectively (both have the range 0x00 - 0xFF [0-255 in decimal notation]).
[Type A1]
T=(BH*256+BL)/1000(deg-C)
S=(BH*256+BL)/10000+30(in PSS78)
P=(BH*256+BL)/10(dbar)

[Type A2]
T=(BH *256+BL)/1000 (deg-C, for 0<=(BH *256+BL)<=62536)
T=(BH *256+BL-65536)/1000 (deg-C, for (BH *256+BL)>62536)
S=(BH *256+BL)/1000(in PSS78)
P=(BH *256+BL)/10(dbar)
Furthermore, the power supply voltage (V), the internal pressure of the float (p), and the piston motor drive time (t) are calculated using the following equations. Here, B is the value of the corresponding byte and BH and BL are the values of the higher and lower bytes, respectively.
V=B/10+0.6(V)
p=B*--0.376+29.15(inHg)
t=(BH *256+BL)*2(sec *only type A2)
In addition, note that 5 dbars are added to the "pressure at the surface immediately before the last descent" in the encoding, so it will be necessary to subtract 5 dbars from the value converted using the equation above. All other items take the value of the corresponding byte.

Figure A1. Sample message from ARGOS system.

Figure A2. Data format arrangement of profiling floats in operation.
(a) First message of type A1 (b) First message of type A2 (c) Other messages

Table A1. Data table of "Profile termination flag byte" in hexadecimal notation.
ValueMeaning
00Pressure reached surface pressure (Normaly terminated)
02Pressure reached zero.
04Pressure unchanged for 25 minutes. (Does not termintae profile)
08Piston fully extended before surface.
10UP time expired before surface and UP time was reset. (only for type A1)