On Windows 10, I use osx.exe from OpenSP in order to convert a SGML file to XML.
In the SGML, there are html entities as – é and a lot more.
The parser forces me to declare them:
reference to entity "ndash" for which no system identifier could be generated
So, in my DTD, I tried to declare them as follow:
<!ENTITY ndash "–">
But then I obtained this error:
"8211" is not a character number in the document character set
Finally, I tested adding the character itself:
<!ENTITY ndash "–">
And I obtained those errors:
non SGML character number 226
non SGML character number 8364 non SGML
character number 8220
To answer @imhotap, I post here the SGML declaration given with my document:
<!SGML "ISO 8879:1986"
-- Basic SGML declaration using Reference Concrete Syntax --
CHARSET
BASESET "ISO 646-1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY SGMLREF
TOTALCAP 35000
ENTCAP 35000
ENTCHCAP 35000
ELEMCAP 35000
GRPCAP 35000
EXGRPCAP 35000
EXNMCAP 35000
ATTCAP 35000
ATTCHCAP 35000
AVGRPCAP 35000
NOTCAP 35000
NOTCHCAP 35000
IDCAP 35000
IDREFCAP 35000
MAPCAP 35000
LKSETCAP 35000
LKNMCAP 35000
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
BASESET "ISO 646-1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR "-."
UCNMCHAR "-."
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 250
ATTSPLEN 960
BSEQLEN 960
DTAGLEN 16
DTEMPLEN 16
ENTLVL 16
GRPCNT 250
GRPGTCNT 96
GRPLVL 16
LITLEN 900
NAMELEN 50
NORMSEP 2
PILEN 240
TAGLEN 960
TAGLVL 40
FEATURES
MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC NO FORMAL NO
APPINFO NONE>
I declare then entities in the DTD as follow:
<!DOCTYPE gp [
<!ENTITY % MYDTD SYSTEM ".\my_dtd.dtd">
<!ENTITY ndash SDATA "–">
%MYDTD;
]>
How can I handle those HTML entities in the SGMl-> XML conversion please?
It's difficult to tell when you're not including the SGML that
osxcomplains about, but those error messages you're receiving are becauseosxis assuming an incorrect document character set. Most probably,osxis told to assume a document character set by a so-called SGML declaration or SGML declaration reference at the begin of your file, though theoretically it's possible thatosxassumes another default character set on Windows machines, on your particular locale, is given Windows-like byte order marks, or is deriving an SGML declaration via catalog resolution rules.Or at least,
osxdoesn't complain on my Unix machine when run with the following test document using its implicit SGML declaration defaults:For a detailed explanation of SGML declarations, see eg. https://sgmljs.net/docs/sgmlrefman.html#sgml-declaration. Note sgmljs.net SGML supports ISO 8879 Annex K (aka WebSGML) and makes use of predefined entities for HTML in the SGML declaration as decribed in https://sgmljs.net/docs/w3c-html51-sgmldecl.html, but for OpenSP's
osx, which doesn't (fully) support WebSGML, you need to declare these as entities in the DTD, just like you're already doing. By chance you can sidestep your problem by declaring these asSDATAentities to make the error messages got away; that is, by declaring these asIf that doesn't work, or the resulting output file causes trouble, you could include the following SGML declaration taken from https://sgmljs.net/docs/sgmlrefman.html#sgml-declaration-for-html5 as the first thing in your SGML. The important part is the line
160 55136 160in theDECSCET(described character set) section telling the SGML parser that UCS code points 160 through 55136 are allowed in the document. Note theBASESETis assumed to be UTF-8 which might or might not match your document data; moreover, this SGML declaration switches on tag inference, attribute name omission, and other options appropriate for HTML but not necessarily your SGML; I have no way of telling.Update: Based on the SGML declaration you specified in your update, here's an edited version of it that allows ndash and other UCS code points above 128, where I've edited the base character set like in my earlier answer/example above, and also added the described set ranges once again, but otherwise have copied the details of your SGML declaration. Keep in mind that you're basically expanding the character set of your document; if you don't want to do that, you can map
ndashto a plain U+002D HYPHEN-MINUS (-) instead and leave your SGML declaration as it is to leave your document character set within the 7-Bit ASCII (ie. the "IRV") code set.