EDN Admin
Well-known member
I am currently trying to load XML files as just well-formed XML (without validation or anything) to run a few XPath Queries. Most if not all of those files reference either DTDs using a DOCTYPE Declaration or XML-Schemas using a Namespace (or noNamespaceSchemaLocation).
It is not possible to include each and any DTD/XSD that might be referenced from any of the files, so I simply disabled Validation and set the XmlResolver to null to avoid XmlReader trying to find them.
However, by doing so, I run into specific problems when the document contains named Character Entities (such as ™ for the Trademark-Symbol) since they would typically be defined in the referenced DTD or Schema. As Result, parsing fails (obviously),
since the unresolved Entity cannot guarantee even well-formedness.
I got all possibly Entities (at least the ones I know of that might be in those documents) as DTD fragments (and possibly as XSD fragments, in case DTD is not enough) and could include them - luckily the ones referenced from the DTDs/Schemas are usually
standard ISO Entities, and thus the same for all of them.
I tried quite a few things to see if I can at least get to see how XmlReader resolves the Entities and how I could mess with the result:<br/>
- I tried looking at a Reflectord version of XmlReader (and XmlTextReader, and XmlTextReaderImpl and what-not), the open-source Version from Mono and the MindTouch implementation of SgmlReader - none of them really helped me (mostly due to complexity).<br/>
- I tried implementing a custom XmlReader (and corresponding XmlWriter) to catch XmlNodeType.EntityReference and replace them by &entity; (and vice-versa in the Writer) - this works to some degree, but breaks Entities that are defined inline, directly
in the internal subset of the doctype, making this solution rather pointless for things that Id actually want resolved because we could.<br/>
- I tried messing around with the rest of XmlReader (especially ResolveEntity) to see if I could change its outcome; but everything seems to be internally handled/delegated to non-public classes; and it simply returns void, doing its black magic hidden from
plain sight.<br/>
- I tried implementing a custom XmlResolver that attempts to feed all my known ISO Entity files to the XmlReader when something is requested - this failed on the XmlResolver never being called (for reasons I dont know; maybe it would have worked?)<br/>
- I tried overriding the SchemaInfo behavior by passing a XmlParserContext with my custom InternalSubset which defines all entities - this fails as soon as the original document contains its own DOCTYPE declaration ("Cannot have document with multiple DOCTYPE
definitions" or something like that).
All that messing around basically gets me back to zero. Searching the internets leaves me with the impression that Im the first trying to do something like that; or at least the first one that dares to ask.
Could anyone shed some light on what happens, how I could affect it, or general hints on how to achieve this?
Regards, BhaaL
View the full article
It is not possible to include each and any DTD/XSD that might be referenced from any of the files, so I simply disabled Validation and set the XmlResolver to null to avoid XmlReader trying to find them.
However, by doing so, I run into specific problems when the document contains named Character Entities (such as ™ for the Trademark-Symbol) since they would typically be defined in the referenced DTD or Schema. As Result, parsing fails (obviously),
since the unresolved Entity cannot guarantee even well-formedness.
I got all possibly Entities (at least the ones I know of that might be in those documents) as DTD fragments (and possibly as XSD fragments, in case DTD is not enough) and could include them - luckily the ones referenced from the DTDs/Schemas are usually
standard ISO Entities, and thus the same for all of them.
I tried quite a few things to see if I can at least get to see how XmlReader resolves the Entities and how I could mess with the result:<br/>
- I tried looking at a Reflectord version of XmlReader (and XmlTextReader, and XmlTextReaderImpl and what-not), the open-source Version from Mono and the MindTouch implementation of SgmlReader - none of them really helped me (mostly due to complexity).<br/>
- I tried implementing a custom XmlReader (and corresponding XmlWriter) to catch XmlNodeType.EntityReference and replace them by &entity; (and vice-versa in the Writer) - this works to some degree, but breaks Entities that are defined inline, directly
in the internal subset of the doctype, making this solution rather pointless for things that Id actually want resolved because we could.<br/>
- I tried messing around with the rest of XmlReader (especially ResolveEntity) to see if I could change its outcome; but everything seems to be internally handled/delegated to non-public classes; and it simply returns void, doing its black magic hidden from
plain sight.<br/>
- I tried implementing a custom XmlResolver that attempts to feed all my known ISO Entity files to the XmlReader when something is requested - this failed on the XmlResolver never being called (for reasons I dont know; maybe it would have worked?)<br/>
- I tried overriding the SchemaInfo behavior by passing a XmlParserContext with my custom InternalSubset which defines all entities - this fails as soon as the original document contains its own DOCTYPE declaration ("Cannot have document with multiple DOCTYPE
definitions" or something like that).
All that messing around basically gets me back to zero. Searching the internets leaves me with the impression that Im the first trying to do something like that; or at least the first one that dares to ask.
Could anyone shed some light on what happens, how I could affect it, or general hints on how to achieve this?
Regards, BhaaL
View the full article