Methodology for Multi-Lingual
Methodology for Multi-Lingual
Global E-Governance programs have brought into sharp focus for the need to manipulate data efficiently in a suite of multiple languages.
Survey results indicate that the demographics of the Internet are steadily becoming multilingual. The non-native English speaking users of the Internet has grown from about half in mid-90’s, to about two-thirds now and it is assumed that the majority of the Internet information will be multilingual by 2010. It has been found that a user is likely to stay twice as long at a site and four-times more likely to buy a product or consume a service, if the information is presented in their native language. Hence, it is important that the information systems support efficient handling of multilingual data.
Essential for an efficient Multilingual Data is to overcome the limitations of multilingual data handling capability of the existing database and for better searching and browsing capabilities in different languages, accessing information stored in different languages, accelerating globalization of businesses.
With ever-growing importance for data quality in growth markets, many companies have an immediate need to cleanse unstructured data. However, one of the challenges during this exercise is language. Territory such as continents, where there are multiple languages, must effectively handle linguistic data. The official language of each country is different, and data is available in English and local languages.
To get the most out of article, you should have a basic level of skills with designing and running ETL (Extract, transform, load three database functions that are combined into one tool to pull data out of one database and place it into another database. Extract is the process of reading data from a database).
Data quality from Data Stage and Quality Stage Designer. In addition:-
- Understanding of the structure of a standardization rule set.
Once the investigation report (token report) has been generated, the task of a rule set developer Is to use this report as a reference and enhance the quality of standardization rule set. The standardization rule set is composed of a classifications file, a patterns file, tables, and dictionary file. All variations of a token must be added in the classification file, including spelling variations and lingual variations.
- Understanding of the semantics of input language.
This article explained the data cleansing operation with examples when input is multilingual. The output was used to enhance the quality of standardization by adding classifications and standardization rules.
- Pattern action coding skills.
New pattern action rules are written in the patterns file to handle the patterns in incoming data.
Pre-processing multilingual input through paraphrasing
The first step in a data quality process is to bring about consistency in the input. In emerging economies where the official language of government is not English, inconsistencies in the input are likely. By running a paraphrasing job, you can bring the input data to a standard format for further action.
While existing database provide some means of storing and querying multilingual data, they suffer from redundancy proportional to the number of language support.
We propose a system for multilingual data management in distributed environment that stores data in information theoretic way in encoded form with minimum redundancy. Query operation can be performed from the encoded data only and the result is obtained by decompressing it using the corresponding language dictionaries for text data or without dictionary for other data.
The system has been evaluated by both syntactic data and real data obtained from a real life schema. We have compared the performance of our system with existing systems. Our system outperformed the existing systems in terms of both space and time.
Efficient storage and query processing of data spanning multiple natural languages are of crucial importance in today’s globalized world. As Internet has become a primary medium for information access and commerce, multilingual data management in database environment can be treated as a vital issue for the availability of information in the native language of the Internet users.
We have to consider generally three main considerations for Multilingual Data Management (MDM).
Firstly, there should be a technique by which the data will be represented in a language - independent way.
Secondly, an efficient paraphrasing is needed for performing translation among languages.
Thirdly, an efficient mechanism is required to perform different types of multilingual operations in a distributed database environment.
These are the crucial issues in multilingual data management. This perspective presents a system to store multilingual data in a language independent way such that database evolution is easier.
In this MDM approach when information in a specific language is provided, the translator will generate its corresponding information in the target language. Schema evolution which is difficult in the existing systems, is simple and easier in this system to maintain database consistency. Query performance is also significantly faster. Queries can be performed using a translator-based approach.