Correction Modules
Modules for the XML Correction Manager
When the user starts the correction of an XML file, a specific correction configuration has to be selected. The web interface of the Correction Manager will offer a list of all correction config files in the correction config directory. A correction config file, is an XML file which is valid to a specific schema. It specifies what correction modules will be called, in what order and with what parameters. The structure of such a file is relatively simple:
<modules xmlns="http://rebind.bgbm.org/modules" name="configuration-name">
<module name="module-name" description="module-description">
<setting name="module-setting-name" value="module-setting-value"/>
</module>
<setting name="general-setting-name" value="general-setting-value"/>
</modules>
The root element is modules
. The name
attribute is optional. It is the name with which the module will be displayed in the web interface of the Correction Manager.
There can be several module
elements within the element modules
.
Each module
element must have a name
attribute which specifies the name of the module to be loaded. This should either be the complete name of a Java class which implements the Module
interface or the name specified within the method getName()
of that class.
The description
is optional and is used to distinguish different instances of the same module which are run with different settings.
Each module
element can have any number of setting
elements. Each setting
element has mandatory attributes for the setting name
and value
. What settings are used for each module is specified in the module descriptions below.
The modules
element can also have setting
elements. These general settings are also accessible by the module and are overwritten if a setting with the same name is specified in the module
element.
Contents
ElementTextReplacer
What it does: Replaces the text content of specific elements according to specific rules.
Full Name: org.bgbm.rebind.correction.modules.ElementTextReplacer
Settings:
- address
- The name (including element prefix) of the element whose text should be replaced. Or an XPath expression pointing to the element. If the value is interpreted as a name or as an XPath depends on the attribute
isXPath
. - Mandatory: yes
- Example Values:
abcd:Sex
or//abcd:RecordBasis[matches(.,'^Specimen$')]
- The name (including element prefix) of the element whose text should be replaced. Or an XPath expression pointing to the element. If the value is interpreted as a name or as an XPath depends on the attribute
- isXPath
- A flag indicating if the
address
element contains an XPath expression or just the name of an element. - Mandatory: no
- Default Value:
false
- Allowed Values:
true
orfalse
- A flag indicating if the
- key
- The part of the content that should be replaced. This could either be plain text or a RegEx, depending on the attribute
isRegEx
. Regardless whether it is plain text or an attribute, it could several keys to be replaced or just one, depending on the attributeisBatch
. If the batch mode is used, the character or string with which the different parts are separated can be specified in the attributesplitter
. - Mandatory: yes
- Example Values:
Hello World
(plain text),(H[ea]llo World)(\!?)
(RegEx),Hello World;Lorem Ipsum
(plain text, Batch mode with ';' as splitter),H[ea]llo World\!?;[Ll]orem [iI]psum
(RegEx, Batch mode with ';' as splitter),
- The part of the content that should be replaced. This could either be plain text or a RegEx, depending on the attribute
- value
- The new content with which the content specified in
key
will be replaced. This could either be plain text or a RegEx, depending on the attributeisRegEx
. Regardless whether it is plain text or an attribute, it could several keys to be replaced or just one, depending on the attributeisBatch
. If the batch mode is used, the character or string with which the different parts are separated can be specified in the attributesplitter
. If the batch mode is used, the key fragments will be replaced by the corresponding value fragments (e.g. the third key fragment will be replaced by third value fragment). Therefore the number of fragments must be the same for key and value, otherwise the replacement will stop after the number of fragments in the smaller one. - Mandatory: yes
- Example Values:
Hello World again
(plain text),$1 again $2
(RegEx),Hello World again;Lorem ipsum dolor sit amet
(plain text, Batch mode with ';' as splitter),$1 again $2;$& dolor sit amet
(RegEx, Batch mode with ';' as splitter),
- The new content with which the content specified in
- isRegEx
- A flag indicating if the
key
and thevalue
elements are regular expressions or just plain text. - Mandatory: no
- Default Value:
false
- Allowed Values:
true
orfalse
- A flag indicating if the
- isBatch
- A flag indicating if the
key
and thevalue
elements contain just one fragment which is supposed to be replaced, or several. If it is true, the character or string with which the different fragments ofkey
and thevalue
elements are separated can be specified in the attributesplitter
. - Mandatory: no
- Default Value:
false
- Allowed Values:
true
orfalse
- A flag indicating if the
- splitter
- The character or string with which the
key
and thevalue
elements are broken into their fragments, if they are in batch mode. The splitting is done using the function String.split(String), which interprets the parameter string as a regular expression. This could cause errors when the splitter contains characters with syntactical meaning in RegEx, like<setting name="splitter" value="."/>
which would cause any character to be matched and therefor only returning empty fragments. - Mandatory: no
- Default Value:
;
- Example Values:
,
or\.
- The character or string with which the
Examples:
<module name="org.bgbm.rebind.correction.modules.ElementTextReplacer" description="replaces 'Specimen'">
<setting name="address" value="//abcd:RecordBasis[matches(.,'^Specimen$')]"/>
<setting name="isXPath" value="true"/>
<setting name="key" value="^(Specimen)$"/>
<setting name="value" value="Preserved$1"/>
<setting name="isRegEx" value="true"/>
<setting name="isBatch" value="false"/>
<setting name="splitter" value=";"/>
</module>
<module name="org.bgbm.rebind.correction.modules.ElementTextReplacer" description="corrects abcd:Sex">
<setting name="address" value="abcd:Sex"/>
<setting name="isXPath" value="false"/>
<setting name="key" value="female;male;hermaphrodite"/>
<setting name="value" value="F;M;X"/>
<setting name="isRegEx" value="false"/>
<setting name="isBatch" value="true"/>
<setting name="splitter" value=";"/>
</module>
<module name="org.bgbm.rebind.correction.modules.ElementTextReplacer" description="corrects abcd:Rank">
<setting name="address" value="abcd:Rank"/>
<setting name="isXPath" value="false"/>
<setting name="key" value="[f.];[subvar.];[var.]"/>
<setting name="value" value="f.;subvar.;var."/>
<setting name="isRegEx" value="false"/>
<setting name="isBatch" value="true"/>
<setting name="splitter" value=";"/>
</module>
<module name="org.bgbm.rebind.correction.modules.ElementTextReplacer" description="corrects abcd:Rank">
<setting name="address" value="//abcd:Rank[matches(.,'^(f|var)$')]"/>
<setting name="isXPath" value="true"/>
<setting name="key" value="^f$;^var$"/>
<setting name="value" value="f.;var."/>
<setting name="isRegEx" value="true"/>
<setting name="isBatch" value="true"/>
<setting name="splitter" value=";"/>
</module>
EmptyElementDeleter
What it does: Removes empty elements which have neither text content (except white spaces) nor child elements nor attributes. Currently there is a hardcoded exception regarding the attributes. If the only attribute is abcd:language
then the element will be deleted as well. Such exceptions will be adjustable via the settings in the future.
Full Name: org.bgbm.rebind.correction.modules.EmptyElementDeleter
Settings: none
Example:
<module name="org.bgbm.rebind.correction.modules.EmptyElementDeleter" description="first iteration"/>
ElementDeleter
What it does: Deletes specific elements including all its content and child elements.
Full Name: org.bgbm.rebind.correction.modules.ElementDeleter
Settings:
- xpath
- The XPath address of the element(s) to be removed. Also works for attributes.
- Mandatory: yes
- Example Values:
//abcd:LogoURI
or//abcd:TelephoneNumber[abcd:Device="Fax"]
Examples:
<module name="org.bgbm.rebind.correction.modules.ElementDeleter" description="delete abcd:language attributes">
<setting name="xpath" value="//*/@abcd:language"/>
</module>
ElementRenamer
What it does:
Full Name: org.bgbm.rebind.correction.modules.ElementRenamer
Settings:
- xpath
- The XPath address of the element(s) to be renamed. Also works for attributes.
- Mandatory: yes
- Example Values:
//abcd:LogoURI
or//abcd:TelephoneNumber[abcd:Device="Fax"]
- newName
- The new name of the element, without namepsace prefix.
- Mandatory: yes
- Example Values:
newElementName
- useOldNamespace
- A flag indicating if the old namespace (and namespace prefix) of the element should be used after renaming as well.
- Mandatory: no
- Default Value:
true
- Allowed Values:
true
orfalse
- newNamespace
- The namespace url of the new namespace, if
useOldNamespace
is set tofalse
. - Mandatory: no
- Default Value: (empty string)
- Example Values:
http://example.com/ns/xyz
- The namespace url of the new namespace, if
- newNamespacePrefix
- The namespace prefix of the new namespace, if
useOldNamespace
is set tofalse
. If the colon at the end is missing, it will be added automatically. - Mandatory: no
- Default Value: (empty string)
- Example Values:
xyz:
- The namespace prefix of the new namespace, if
Examples:
<module name="org.bgbm.rebind.correction.modules.ElementRenamer" description="rename abcd:language attributes">
<setting name="xpath" value="//*/@abcd:language"/>
<setting name="newName" value="language"/>
<setting name="useOldNamespace" value="false"/>
<setting name="newNamespace" value=""/>
<setting name="newNamespacePrefix" value=""/>
</module>
<module name="org.bgbm.rebind.correction.modules.ElementRenamer" description="rename wrong ISO Dates">
<setting name="xpath" value="//abcd:ISODateTimeBegin[not(matches(.,'^(\d\d\d\d(\-(0[1-9]|1[012])(\-((0[1-9])|1\d|2\d|3[01])(T(0\d|1\d|2[0-3])(:[0-5]\d){0,2})?)?)?|\-\-(0[1-9]|1[012])(\-(0[1-9]|1\d|2\d|3[01]))?|\-\-\-(0[1-9]|1\d|2\d|3[01]))$'))]"/>
<setting name="newName" value="DateText"/>
<setting name="useOldNamespace" value="true"/>
</module>
DummyModule
What it does: Waits 1-11 seconds before returning a quote from either the homicidal computer HAL 9000 from the movie "2001: A Space Odyssey" or the maniacally depressed robot Marvin from the book/movie "The Hitchhiker's Guide to the Galaxy". This module does not alter the XML code in any way, it only sends the quote back to the Correction Manager. It is only used for testing purposes.
Full Name: org.bgbm.rebind.correction.modules.DummyModule
Settings: none
Examples:
<module name="org.bgbm.rebind.correction.modules.DummyModule" description="just wait a bit for a snappy robot remark" />
<module name="org.bgbm.rebind.correction.modules.DummyModule" description="Play it once again, Marvin, for old times' sake." />
Work in Progress
These modules are currently in the making. So some of these descriptions might not reflect the current state of development.
ABCDDateCorrector
What it does: Checks the dates in the elements abcd:ISODateTimeBegin
within abcd:Date
and abcd:DateTime
. If they are not formatted according to the ISO norm it tries to parse and fix them or renames the element to abcd:DateText
. However if there are any abcd:DateText
elements which are correctly formatted or can be converted, it will make them into abcd:ISODateTimeBegin
elements.
Full Name: org.bgbm.rebind.correction.modules.ABCDDateCorrector
Settings: none
Examples:
<module name="org.bgbm.rebind.correction.modules.ABCDDateCorrector" description="fixing the dates" />
SimpleCountryCodeChecker
What it does: Compares the content of the abcd:ISO3166Code
element with a list of Country Codes provided by Java and warns if it doesn't occur there. Also has some hardcoded exceptions for the commonly used but unspecified codes:
ZZ Unknown XA Unknown or unspecified Africa XB Unknown or unspecified Middle and South America XC Unknown or unspecified Asia XD Unknown or unspecified Australia and Oceania XE Unknown or unspecified Europe XF Unknown or unspecified North America
Full Name: org.bgbm.rebind.correction.modules.SimpleCountryCodeChecker
Settings: none
Examples:
<module name="org.bgbm.rebind.correction.modules.SimpleCountryCodeChecker" description="checking the country codes" />