does anyone recognize this MARC JSON format?

Question

does anyone recognize this MARC JSON format?

360 Views Asked by Vladimir Alexiev At 11 March 2018 at 11:58

Does anyone recognize this format (see paste at bottom)? It's from Répertoire de vedettes-matière (RVM). It's neither of these two:

I can program in Perl, also posted as https://github.com/LibreCat/Catmandu-MARC/issues/88.

I can hack it with just XS::JSON but I don't know how to deal with this weird accent encoding (a few sample lines shown from 325):

{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A

Here's the strange MARC JSON:

{
"rows" : [
{
    "RecordNumber" : "1",
    "Tag" : "LDR",
    "Indicators" : "",
    "Content" : "00533nz   2200205n  4500"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "001",
    "Indicators" : "\"  \"",
    "Content" : "201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "005",
    "Indicators" : "\"  \"",
    "Content" : "20121025110000.0"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "008",
    "Indicators" : "\"  \"",
    "Content" : "790704\\nfanvnnbabn\\\\\\\\\\\\\\\\\\\\\\b\\ana\\\\\\\\\\\\"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "016",
    "Indicators" : "\\\\",
    "Content" : "$a0509B3366"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "\\\\",
    "Content" : "$a(ISM)8013850"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "9\\",
    "Content" : "$a201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "040",
    "Indicators" : "\\\\",
    "Content" : "$aCaQQLa$bfre"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "150",
    "Indicators" : "\\\\",
    "Content" : "$aAlg{grave}ebres de Von Neumann"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$wnne$aVon Neumann, Alg{grave}ebres de"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$aW*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aC*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aEspace de Hilbert"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "697",
    "Indicators" : "\\\\",
    "Content" : "$amm."
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\7",
    "Content" : "$aVon Neumann, Alg{grave}ebres de$2ram"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\0",
    "Content" : "$aVon Neumann algebras"
}
]
}

ADDED: this accent encoding is from MARCmkr. I used the following:

use MARC::File::MARCMaker; # https://metacpan.org/pod/MARC::File::MARCMaker
# for some reason can't be found by module name, so use:
# cpanm http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-File-MARCMaker-0.05.tar.gz
my $marc_charset = MARC::File::MARCMaker::usmarc_default();
$content = MARC::File::MARCMaker::_maker2char ($content, $marc_charset);

But when I test it eg on this text https://github.com/gmcharlt/marc-perl/blob/e8e0ecc92946d6dcb3c2270706041a30eff0f68d/marc-marcmaker/t/marcmaker.t#L92, it just translates the accents/ligatures to XML entities. I tried opening the translated text in a browser: some entities are not interpreted, and none has accented the next char. So I guess I now need to use some "XML to Unicode" module to finish the translation

This a test of diacritics like the uppercase Polish L in
Ł´od´z, the uppercase Scandinavia O in &Ostrok;st, the
uppercase D with crossbar in Đuro, the uppercase Icelandic
thorn in Þann, the uppercase digraph AE in Ægir, the
uppercase digraph OE in Œuvres, the soft sign in
rech&softsign;, the middle dot in col·lecci´o, the musical
flat in F♭, the patent mark in Frizbee®, the plus or minus
sign in ±54%, the uppercase O-hook in B&Ohorn;, the
uppercase U-hook in X&Uhorn;A, the alif in
mas&mlrhring;alah, the ayn in &mllhring;arab, the lowercase
Polish l in Włocław, the lowercase Scandinavian o in
K&ostrok;benhavn, the lowercase d with crossbar in đavola,
the lowercase Icelandic thorn in þann, the lowercase digraph
ae in være, the lowercase digraph oe in cœur, the lowercase
hardsign in s&hardsign;ezd, the Turkish dotless i in masalı,
the British pound sign in £5.95, the lowercase eth in
verður, the lowercase o-hook (with pseudo question mark) in
S&hooka;&ohorn;, the lowercase u-hook in T&uhorn; D&uhorn;c,
the pseudo question mark in c&hooka;ui, the grave accent in
tr`es, the acute accent in d´esir´ee, the circumflex in
cˆote, the tilde in ma˜nana, the macron in T¯okyo, the breve
in russki˘i, the dot above in ˙zaba, the dieresis (umlaut)
in L¨owenbr¨au, the caron (hachek) in ˇcrny, the circle
above (angstrom) in ˚arbok, the ligature first and second
halves in d&llig;i&rlig;ad&llig;i&rlig;a, the high comma off
center in rozdel&rcommaa;ovac, the double acute in
id˝oszaki, the candrabindu (breve with dot above) in
Ali&candra;iev, the cedilla in ¸ca va comme ¸ca, the right
hook in viet˛a, the dot below in te&dotb;da, the double dot
below in &under;k&under;hu&dbldotb;tbah, the circle below in
Sa&dotb;msk&ringb;rta, the double underscore in
&dblunder;Ghulam, the left hook in Lech Wał&commab;esa, the
right cedilla (comma below) in khŗong, the upadhmaniya (half
circle below) in &breveb;humantuˇs, double tilde, first and
second halves in &ldbltil;n&rdbltil;galan, high comma
(centered) in g&commaa;eotermika.

Original Q&A

There are 1 best solutions below

**jorol** · Answer 1 · 2018-03-12T13:18:50.983000

This is an encoding problem. The record leader says that the data is encoded in MARC-8. Your JSON data should be encoded in UTF-8. _maker2char() uses usmarc_default(), which maps the mnemonic accent encodings to MARC-8 encoded chars. Use MARC::Charset to convert the data to UTF-8. This should work:

#!/usr/bin/env perl

use 5.014;

use utf8;
use strict;
use autodie;
use warnings;

use MARC::File::MARCMaker;
use MARC::Charset qw(marc8_to_utf8);

my $data = q{This is a test of diacritics like the uppercase Polish L in {Lstrok}{acute}od{acute}z
the uppercase Scandinavia O in {Ostrok}st
the uppercase D with crossbar in {Dstrok}uro
the uppercase Icelandic thorn in {THORN}ann
the uppercase digraph AE in {AElig}gir
the uppercase digraph OE in {OElig}uvres
the soft sign in rech{softsign}
the middle dot in col{middot}lecci{acute}o
the musical flat in F{flat}
the patent mark in Frizbee{reg}
the plus or minus sign in {plusmn}54%
the uppercase O-hook in B{Ohorn}
the uppercase U-hook in X{Uhorn}A
the alif in mas{mlrhring}alah
the ayn in {mllhring}arab
the lowercase Polish l in W{lstrok}oc{lstrok}aw
the lowercase Scandinavian o in K{ostrok}benhavn
the lowercase d with crossbar in {dstrok}avola
the lowercase Icelandic thorn in {thorn}ann
the lowercase digraph ae in v{aelig}re
the lowercase digraph oe in c{oelig}ur
the lowercase hardsign in s{hardsign}ezd
the Turkish dotless i in masal{inodot}
the British pound sign in {pound}5.95
the lowercase eth in ver{eth}ur
the lowercase o-hook (with pseudo question mark) in S{hooka}{ohorn}
the lowercase u-hook in T{uhorn} D{uhorn}c
the pseudo question mark in c{hooka}ui
the grave accent in tr{grave}es
the acute accent in d{acute}esir{acute}ee
the circumflex in c{circ}ote
the tilde in ma{tilde}nana
the macron in T{macr}okyo
the breve in russki{breve}i
the dot above in {dot}zaba
the dieresis (umlaut) in L{uml}owenbr{uml}au
the caron (hachek) in {caron}crny
the circle above (angstrom) in {ring}arbok
the ligature first and second halves in d{llig}i{rlig}ad{llig}i{rlig}a
the high comma off center in rozdel{rcommaa}ovac
the double acute in id{dblac}oszaki
the candrabindu (breve with dot above) in Ali{candra}iev
the cedilla in {cedil}ca va comme {cedil}ca
the right hook in viet{ogon}a
the dot below in te{dotb}da
the double dot below in {under}k{under}hu{dbldotb}tbah
the circle below in Sa{dotb}msk{ringb}rta
the double underscore in {dblunder}Ghulam
the left hook in Lech Wa{lstrok}{commab}esa
the right cedilla (comma below) in kh{rcedil}ong
the upadhmaniya (half circle below) in {breveb}humantu{caron}s
double tilde
first and second halves in {ldbltil}n{rdbltil}galan
high comma (centered) in g{commaa}eotermika.
Alg{grave}ebres de Von Neumann
{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A
};

my $marc_charset = MARC::File::MARCMaker::usmarc_default();
my $marc8 = MARC::File::MARCMaker::_maker2char($data, $marc_charset);

# prepare STDOUT for utf8
binmode(STDOUT, 'utf8');

# convert marc8 to utf8
my $utf8 = marc8_to_utf8($marc8);

say $utf8;

Output:

This is a test of diacritics like the uppercase Polish L in Łódź
the uppercase Scandinavia O in Øst
the uppercase D with crossbar in Đuro
the uppercase Icelandic thorn in Þann
the uppercase digraph AE in Ægir
the uppercase digraph OE in Œuvres
the soft sign in rechʹ
the middle dot in col·lecció
the musical flat in F♭
the patent mark in Frizbee®
the plus or minus sign in ±54%
the uppercase O-hook in BƠ
the uppercase U-hook in XƯA
the alif in masʼalah
the ayn in ʻarab
the lowercase Polish l in Włocław
the lowercase Scandinavian o in København
the lowercase d with crossbar in đavola
the lowercase Icelandic thorn in þann
the lowercase digraph ae in være
the lowercase digraph oe in cœur
the lowercase hardsign in sʺezd
the Turkish dotless i in masalı
the British pound sign in £5.95
the lowercase eth in verður
the lowercase o-hook (with pseudo question mark) in Sở
the lowercase u-hook in Tư Dưc
the pseudo question mark in củi
the grave accent in très
the acute accent in désirée
the circumflex in côte
the tilde in mañana
the macron in Tōkyo
the breve in russkiĭ
the dot above in żaba
the dieresis (umlaut) in Löwenbräu
the caron (hachek) in črny
the circle above (angstrom) in årbok
the ligature first and second halves in di͡adi͡a
the high comma off center in rozdelo̕vac
the double acute in időszaki
the candrabindu (breve with dot above) in Alii̐ev
the cedilla in ça va comme ça
the right hook in vietą
the dot below in teḍa
the double dot below in k̲h̲ut̤bah
the circle below in Saṃskr̥ta
the double underscore in G̳hulam
the left hook in Lech Wałe̦sa
the right cedilla (comma below) in kho̜ng
the upadhmaniya (half circle below) in ḫumantuš
double tilde
first and second halves in n͠galan
high comma (centered) in ge̓otermika.
Algèbres de Von Neumann
è
Z̊
h̥
s̥
a
A

does anyone recognize this MARC JSON format?

There are 1 best solutions below

Related Questions in JSON

Related Questions in PERL

Related Questions in MARC

Trending Questions

Popular # Hahtags

Popular Questions