Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM FOR PROCESSING DATA AND METHOD THEREOF
Document Type and Number:
WIPO Patent Application WO/2015/163754
Kind Code:
A1
Abstract:
The present invention relates to system and method for processing data. The system comprises a data source (100), a staging unit (101) and a data target (103). The data source (100) is configured for providing data including a dirty set of data. The staging unit (101 ) is coupled to the data source (100) for staging the data and cleansing anomalies from the dirty set of data to generate a clean set of data for transmission to the data target (103). The system can be characterized by a feedback unit (102) that is configured for providing corrective feedback based on a decision rule through a communication link connecting the data source (100), staging unit (101), and data target (103). The method includes the steps of providing data including a dirty set of data; staging the data and cleansing anomalies from the dirty set of data; and generating a clean set of data.

Inventors:
RAMACHANDRAN ARVIND (MY)
HII WAN KIN CHARLES (MY)
Application Number:
PCT/MY2015/050024
Publication Date:
October 29, 2015
Filing Date:
April 09, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MIMOS BERHAD (MY)
International Classes:
G06F21/64; G06F17/30
Domestic Patent References:
WO2006113707A22006-10-26
Foreign References:
US20100010979A12010-01-14
Other References:
None
Attorney, Agent or Firm:
ABDULLAH, Mohamad Bustaman (Jalan Selaman 1,Dataran Palm, Ampang Selangor, MY)
Download PDF:
Claims:
CLAIMS

1. A system for processing data, comprising:

a data source (100) for providing data including a dirty set of data;

a staging unit (101 ) coupled to the data source (100) configured for staging the data and cleansing anomalies from the dirty set of data to generate a clean set of data, wherein the staging unit (101) connectable to a feedback unit (102) providing feedback to the data source (100); and

a data target (103) coupled to the staging unit (101 ) configured for receiving the clean set of data; and

characterized in that,

the feedback unit (102) configured for providing corrective feedback based on a decision rule through a communication link connecting the data source (100), the staging unit (101), and the data target (103), comprising:

a repository (102a) configured for receiving the data uncleansed by the staging unit (101) and for registering the data in associated data tables;

a security management unit ( 02b) coupled to the repository (102a) configured for applying a security credential on the data routed to the data source (100) for cleansing, and for verifying the security credential upon receiving the cleansed data; and

a validation unit (102c) coupled to the security management unit (102b) configured for authenticating data structure of the cleansed data prior to propagating to the data target (103). 2. A system according to Claim 1 , wherein the feedback unit (102) further comprising a queue management unit (102d) deployed to the repository (102a) configured for generating a data queue comprising the data uncleansed by the staging unit (101). 3. A system according to Claim 2, wherein the queue management unit (102d) can perform interoperable format conversion for the data.

4. A system according to Claim 1 , wherein the feedback unit (102) further comprising a communication management unit (102e) configured for facilitating communication therein and allowing for two-way communication between the security management unit (102b) and the data source (100).

5. A system according to Claim 1 , wherein the repository (102a) including a metadata repository configured for storing metadata whose structure reflects source queue name, security credential, data source mapping and staging mapping.

6. A system according to Claim 1 , wherein the data can be registered into the data tables comprising error table, merge table and correction table in respect to a staging table generated by the staging unit (101).

7. A system according to Claim 1 , wherein the security management unit (102b) assuring the correct cleansed data transmitted from the data source (100) enters the feedback unit (102).

8. A system according to Claim 1 , wherein the validation unit (102c) assuring consistency of the data structure between the cleansed data received from the data source (100) and the data registered in the data tables.

9. A method of processing data, comprising:

providing data including a dirty set of data;

staging the data and cleansing anomalies from the dirty set of data; and generating a clean set of data;

characterized In that,

the method further comprising the steps of:

generating corrective feedback based on a decision rule upon receiving the data uncleansed by the staging and cleansing step;

registering the data in associated data tables;

applying a security credential on the data;

propagating the data to a data source (100) for cleansing the data;

verifying the security credential upon receiving the cleansed data from the data source (100);

validating data structure of the cleansed data; and propagating the cleansed data to a data target (103) containing the clean set of data.

10. A method according to Claim 9 including generating data tables comprising error table, merge table and correction table in respect to a staging table generated in the staging and cleansing step.

Description:
SYSTEM FOR PROCESSING DATA AND METHOD THEREOF

FIELD OF THE INVENTION The present invention relates generally to arrangement and method for data processing. More particularly, the present invention relates to an improved system for processing data comprising dirty data and to the method thereof.

BACKGROUND OF THE INVENTION

Data quality is a perception or an assessment of data's fitness for serving its purpose in a given context. It is an essential characteristic that determines the reliability of data for making decisions. The aspects of data quality may include accuracy, completeness, relevance, and consistency across data sources. Quality of a large real world data depends on various factors, notably data source that supplies the data.

Data can be gathered from a variety of different data sources in various formats, or conventions. Each data source may be either stored separately or integrated together to form a single source point. It is noted that much effort has been made to the data source or so-called front-end process with respect to reduction in entry error, however, the errors in the data set are still common factual. Data cleansing which detects and removes errors and inconsistencies or anomalies from the data based on the manual process can be very laborious, time-consuming and prone to additional errors. Hence, computational procedures for cleansing the data have been proposed to address such issue. Despite the fact that the computational procedures are outstanding, limitation on unidirectional data flow which routes from a data source to a staging unit and later to a data target can significantly compromise the data quality. In addition, the data including dirty data requires a manual feedback to the data source due to lack accessibility therebetween to perform data cleansing. Also, to cleanse, the data disadvantageously has to be routed again over the same unidirectional data flow that may result in duplicate and redundant collections of data. Therefore, a need exists for system and method for processing data including dirty data that provides feedback on the system such that the data is cleansed and pushed into a data target through a communication link related to the feedback.

SUMMARY OF THE INVENTION

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later. Accordingly, the present invention provides a system for processing data.

The system comprises a data source, a staging unit and a data target. The data source is configured for providing data including a dirty set of data. The staging unit is coupled to the data source for staging the data and cleansing anomalies from the dirty set of data to generate a clean set of data. Preferably, the staging unit is connectable to a feedback unit providing feedback to the data source. The data target coupled to the staging unit is configured for receiving the clean set of data.

The system of the present invention can be characterized by the feedback unit that is configured for providing corrective feedback based on a decision rule through a communication link connecting the staging unit, the data source and the data target. The feedback unit preferably comprises a repository, a security management and a validation unit. The repository is configured for receiving the data uncleansed by the staging unit and for registering the data in associated data tables. The repository includes a metadata repository configured for storing metadata whose structure reflects source queue name, security credential, data source mapping and staging mapping. Preferably, the data can be registered into the data tables comprising an error table, a merge table and a correction table in respect to a staging table generated by the staging unit.

The security management unit coupled to the repository is configured for applying a security credential on the data routed to the data source for cleansing, and for verifying the security credential upon receiving the cleansed data. Preferabiy, the security management unit assures the correct cleansed data transmitted from the data source enters the feedback unit. More preferabiy, if the cleansed data meets verification, the cleansed data is transmitted to the validation unit, if otherwise, the cleansed data may be discarded.

The validation unit is coupled to the security management unit and is configured for authenticating data structure of the cleansed data prior to propagating to the data target. Preferabiy, the validation unit assures consistency of the data structure between the cleansed data received from the data source and the data registered in the data tables. More preferably, if the data structure meets authentication, the cleansed data is transmitted to the data target and the data tables are updated, if otherwise, the cleansed data may be discarded. The feedback unit further comprises a queue management unit. The queue management unit is deployed to the repository for generating a data queue comprising the data uncieansed by the staging unit, it is preferred that the queue management unit can perform interoperable format conversion for the data. The feedback unit further comprises a communication management unit that is configured for facilitating communication therein. It is preferred that the communication management unit allows for two-way communication between the security management unit and the data source. in accordance with another aspect, the present invention provides a method of processing data. The method comprising providing data including a dirty set of data; staging the data and cleansing anomalies from the dirty set of data; and generating a clean set of data. The method can be characterized by the steps of generating corrective feedback based on a decision rule upon receiving the data uncleansed by the staging and cleansing step, i.e. earlier step; registering the data in associated data tables; applying a security credential on the data; propagating the data to a data source for cleansing the data; verifying the security credential upon receiving the cleansed data from the data source; validating data structure of the cleansed data; and propagating the cleansed data to a data target containing the clean set of data. Preferably, the method including generating data tables comprising an error table, a merge table and a correction table in respect to a staging table generated in the staging and cleansing step.

An advantage of the present invention is that the system and method provide corrective feedback to cleanse the dirty data uncleansed by the staging unit. In addition, the corrective feedback provides two-way communication between the data source and the feedback unit compared to the conventional computational procedures for data cleansing. Advantageously, such two-way communication in the present invention permits the data to be corrected or cleansed in-situ in the data source before routing back to the feedback unit and the data target through the established communication link instead of propagating in the unidirectional data flow that may cause data redundancy and duplication.

Another advantage of the present invention is that the system and method excel in improving and maintaining high data quality especially in terms of accuracy, reliability, validity and consistency. The present invention comprises a security management unit and a validation unit for assuring the correct cleansed data transmitted from the data source enters the feedback unit and assuring consistency of the data structure between the cleansed data received from the data source and the data registered in the data tables, respectively.

The foregoing and other objects, features, aspects and advantages of the present invention will become better understood from a careful reading of a detailed description provided herein below with appropriate reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the invention and many of the attendant advantages thereof will be readily as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

Figure 1 is a block diagram showing the system for processing data according to one embodiment of the present invention; and

Figure 2 is a flow diagram showing the method of processing data according to one embodiment of the present invention; It is noted that the drawings may not be to scale. The drawings are intended to depict only typical aspects of the invention, and therefore should not be considered as limiting the scope of the invention.

DETAILED DES&RiPTtQN OF THE INVENTIO

The present invention aims to provide system and method for processing data including dirty data which overcomes the shortcomings of the conventional data processing systems and methods that produce low data quality especially in terms of accuracy, reliability, consistency and validity. Preferably, the present invention provides corrective feedback to cleanse the dirty data uncleansed by the earlier stage(s) and device(s) in the conventional systems and methods where the data flow is unidirectional.

The term "accuracy" as used herein refers to a characteristic of data, where the data has the correct value, is valid and is attached to the correct record or database. This term also can be used to describe the closeness or agreement of a data value to the truth or an expected range of values.

The term "reliability" as used herein refers to a characteristic of data, where it means the ability of a system, process or parameter to do what is expected or specified under set of conditions and designated time interval without failure.

The term "consistency" as used herein refers to a characteristic of data, where different values conform to a similar style or pattern.

The term "validity" as used herein refers to correctness and reasonableness of data. This term may refer to the process of checking database to ensure that the information gathered from different data sources is clean, accurate and in a standard format and data structure.

According to one preferred embodiment, the system of the present invention comprises a data source 100, a staging unit 101 and a data target 103, as schematically shown in Figure 1. The data source 100 preferably provides inputs to the system. The inputs may include data comprising a dirty set of data that can be obtained from one single database. The data can also be obtained by consolidating which involves aggregation and summarization of data from heterogeneous, i.e. varied or diverse sources. The staging unit 101 comprises a data cleansing and staging server unit that can be configured for staging the data and cleansing anomalies from the dirty set of data, thus, a clean set of data is generated. The data target 03 coupled to the staging unit 101 is adapted to receive the clean set of data. The data propagates unidirectionally, i.e. flow in one direction from the data source 100 through the staging unit 101 to the data target 103 .

The system for processing data can be characterized by a feedback unit 102 deployed at the staging unit 101. The feedback unit 102 is configured for providing corrective feedback based on a decision rule. The corrective feedback can be achieved by way of establishing a communication link that connects the feedback unit 102 to the data source 100, the staging unit 101 and the data target 103. The decision rule may be a process of determining whether the data meets the requirements for which it is intended on a predefined scheduled basis. The decision rule, for example, may comprise specific criteria that are used to control and check the characteristic, i.e. cleanliness of the data, in this embodiment, the decision rule is configured to determine whether the dirty data is cleansed. If the data is cleansed, the data which contains a clean set of data is propagated to the data target 103. The data which is uncleansed by the staging unit 101 is subjected to the feedback unit 102, if otherwise. The feedback unit 102 comprises a repository 102a, a security management unit 102b. and a validation unit 102c according to one preferred embodiment of the present invention. The repository 102a is configured to receive the data uncleansed by the staging unit 101 routed thereto as the data fails to meet the decision rule set out in the system. The data received by the repository 102a is registered into a plurality of data tables associated to the data processing. The repository 102a may comprise two sub-repositories such as a metadata repository and a data repository reside therein. The metadata repository comprises metadata containing data about the source database. The metadata repository preferably stores the metadata whose structure reflects source queue name, security credential, data source mapping and staging mapping. The metadata can also include names, definitions, logical and physical data structure, data integrity, data accuracy, and other data about data source. The data repository, on the other hand, comprises data of facts and statistics containing errors or anomalies collected for the purpose of data processing, cleansing and storing.

The data can be registered into the data tables such as error table, merge table and correction table in respect to a staging table generated by the staging unit 101. For example, the metadata can be assigned to the staging table. The staging table defines staging areas of the data in preparation to moving the data to the data target 103. The staging table may be an intermediate storage area for data processing that resides between the data source 100 and the data target 103. Preferably, there is no user updating or analysis performed on the data at the staging areas. The staging table allows collects changes that need to be applied to a correction table to synchronize it with the contents of the table. The correction table preferably stores the corrected or cleansed data received from the data source 100. The merge table is preferably configured to merge and store the data uncleansed by the staging unit 101 , i.e. dirty data that is supposedly unique over the record or database. The error table is preferably comprised of validation rules and staging tables when both of these are associated. A queue management unit 102d is deployed at the repository 102a. The queue management unit 102d is configured to generate a data queue comprising the data uncleansed by the staging unit 101 that is based upon the staging table. It is preferred that the queue management unit 102d can perform interoperable format conversion for the data that sets a standard data format for the system. The standard data format, for example, may be XML format and JSON format.

The security management unit 102b connected to the repository 102a is configured to apply or impose a security credential to the data prior to routing to the data source 100 for correction and cleansing. The security management unit 02b is also configured to verify the security credential of the cleansed data propagated thereto by the data source 100 when the correction and cleansing has been completed. The security credential of the data is preferably in conformity with the one stored in the metadata repository. The security management unit 102b may function as a security check point that identifies and separates unverified cleansed data received from the data source 100. This is important so as to assure the correct cleansed data transmitted from the data source 100 enters the feedback unit 102. If the cleansed data meets the security credential verification, the data is transmitted to the next unit which is the validation unit 102c for data validation. The cleansed data which fails the verification may be discarded. This embodiment allows for monitoring and maintaining of the data quality, particularly in terms of its accuracy, reliability and consistency. In one embodiment, the security management unit 102b may comprise a security stamping module which applies or imposes a security credential to the data, and a data monitoring and verification module that checks and verifies the data based on the security credential. The security management unit 102b also can comprise a memory module for storing security credentials of the data and validation rules associated to the security credentials.

Advantageously, the system for processing data according to the present invention prevents the data-to-be-corrected or cleansed to propagate through the unidirectional flow from the data source 100 to the staging unit 101 and later to the data target 103. Instead, via the communication link established for corrective feedback, the system permits the corrected or cleansed data to be routed back to the security management unit 102b. Hence, the present invention which incorporates corrective feedback to the system has substantially avoids duplication and redundancy of collection of data.

Upon the security credential verification, the cleansed data is transferred to the validation unit 102c. The validation unit 102c is configured for authenticating data structure of the cleansed data. In this embodiment, the data structure of the cleansed data is considered as valid if the structure conforms to the one registered in the metadata repository. The data structure is preferably constructed based on name of the staging table as its root while records and column names as sub-elements. To identify and authenticate the data structure, format of the data, i.e. in XML format or JSON format is transformed to object data having the structure as described earlier. Preferably, the validation unit 102c assures consistency of the data structure between the cleansed data received from the data source 100 and the data stored in the data tables. If the data structure meets the authentication, the cleansed data is transmitted to the data target 103 for storing. The authenticated cleansed data is also propagated to the repository 102a for data tables updating so as to avoid data redundancy and duplication. The cleansed data which fails the authentication may be discarded.

In one embodiment, the validation unit 102c may comprise a data structure authentication module that checks and authenticates the data based on its structure. Essentially, the validation unit 102c comprises a memory module for storing data structures of the data and authentication rules associated to the data structure.

The system of the present invention further comprises a communication management unit 102e. The communication management unit 102e is configured to facilitate system communication internally and externally. It is preferred that the internal communication includes communication within the feedback unit 102 The external communication preferably includes communication for outside of the feedback unit 102 such as communication with the data source 100, staging unit 101 , and data target 103 fn this embodiment, the communication management unit 102e allows for two-way communication between the security management unit 102b and the data source 100. The communication management unit 102e can comprise a signal generating module for generating a command signal. The command signal may be distributed over the system to manage and control communication therein. For example, a command signal is generated to and instructing the staging unit 101 to transfer the cleansed data to the data target 103.

The method of processing data according to the present invention will be described by way of examples, with reference to Figure 2.

Example 1

In operation, the method is initialized by the step of providing data comprising a dirty set of data. The data is provided by the data source 100 that can be obtained from heterogeneous, i.e. varied or diverse sources. In the next step, the data is staged based on a staging table and errors or anomalies in the data is corrected or removed from the dirty set of data thus, a clean set of data is generated. The clean set of data is next propagated to the data target 103. A decision rule is then applied to the data. The cleansed data is transferred to the data target 103 while the data uncleansed by the earlier steps is routed to the feedback unit 102 where corrective feedback is generated. Such the data can be registered into the data tables associated with the data processing and cleansing. The data tables may include error table, merge table and correction tabie. For correction or cleansing, a security credential is applied or imposed on the data by the security management unit 102b and the data is next transmitted to the data source 100

The data cleansed by the data source 100 routes back to the feedback unit 102. The security management unit 102b verifies the security credential of the data prior to transmitting to the validation unit 102c. If the security credential meets the verification, the cleansed data is propagated to the validation unit 102c. If otherwise, the data can be discarded. At the validation unit 102c, data structure of the data is authenticated. If the data structure meets the authentication, the cleansed data is transmitted to the data target 103 and the data tables are updated. If otherwise, the data can be discarded.

Example 2

The data provided by the data source 100 is registered into a metadata repository and a data repository, results of which are shown in the following tables.

Metadata Repository

Table 1 : A metadata repository table storing data about the source database.

Data repository

Table 3: A staging table demonstrating data of patients extracted from the data source 100.

Staging no. Patient_Name Card ...No Gender

1 John 212Ϊ2Ύ2Ύ 7 [ 5/5/1960

323454544 " 3 j 2Ϊ74 2083 Table 4: An error table demonstrating validation rules associated with the staging table.

Table 5: A merge table demonstrating detected errors or anomalies of data

Table 6: A correction table demonstrating corrected or cleansed data through the feedback unit 102.

While this invention has been particularly shown and described with reference to the exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as defined by the appended claims.