PHP, XML, and Character Encodings: 【zz】

上一篇 / 下一篇  2007-03-02 11:37:44 / 个人分类:LAMP

http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss
`_ sy5P(g0木铎校园 BBS 社区m-F H;Ci

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Geekism

木铎校园 BBS 社区}|^TX0r"@$C1s&Ze

Update:This code has been finalized and debugged, and is now shipped as part ofMagpieRSS 0.7! Sadness and rage no more!木铎校园 BBS 社区yu{,`~ D+lf}

q f/t Iq#bJn0So I have this little program, calledFeed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called byMagpieRSS, the RSS and Atom parser used by FoF.木铎校园 BBS 社区T_ MGy9ak$x g

木铎校园 BBS 社区4^0FS gzlSj

Here’s how Magpie was creating the XML parser:

N$} q{%?jG&@)L0

v/F s\9f0$parser = xml_parser_create();

{/n*y"\O.b.N2}%{0

{gM0J&j0Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:木铎校园 BBS 社区G6u'OL ?(W Q%e;j0@H

木铎校园 BBS 社区"r/Y6@5]t

$parser = xml_parser_create();木铎校园 BBS 社区UO4u4[/u/K%q
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

QRPt.Wh `'^0

V.zl;Nr0MZQw0This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

9K#_ p#Snj0

JxbZ$eCl0So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some moredocumentationandbug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

%Vt@%I0X0J0
木铎校园 BBS 社区0u%i^ VUv7S U

$parser = xml_parser_create("EBCIDIC");木铎校园 BBS 社区 M&TWL^eCb
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
木铎校园 BBS 社区:K-wGil} u6}B

Z.I7d!U?1C;xW} v$s0This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

I` ~z._ m!P,T0木铎校园 BBS 社区mGy-f/t8V

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!木铎校园 BBS 社区7A]%Q9`-{;m,O[k

木铎校园 BBS 社区;Q)Dtm Ot

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';木铎校园 BBS 社区s FK/Cd,R

yD&D-XE `:UOO.QG0if (preg_match($rx, $xml, $m)) {
&VGInc*t9TI6|0  $encoding = strtoupper($m[1]);
?O4p7TdvWk6F0} else {木铎校园 BBS 社区 RZ1V*y9i%V,_"{d-ae
  $encoding = "UTF-8";木铎校园 BBS 社区3A2h_+T*{&g&ltg|
}

7pY]r6A,^W y0

bg%[j!m#R#me!v0That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

v4y9F3j NK*q\n}0

5]FWj8Nab0So the full code is now:

$A~9x O+]a {!L0

+f"Lh,}(Q0Z%k0$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';木铎校园 BBS 社区4al Lhu*e$X\R

木铎校园 BBS 社区!U(AF[4a

if (preg_match($rx, $xml, $m)) {
K.~ Dj)EtA0  $encoding = strtoupper($m[1]);
;[ g%i%]0l R0} else {木铎校园 BBS 社区S k$nk8@o
  $encoding = "UTF-8";
;A2l%C5CMdP!\,Z0}

] V\3@5X.CgM0

+X&S E&t KHk"A&v!r8s(T0$parser = xml_parser_create($encoding);
:SL9D,W `%VK0xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

5[*Nb`n%j{0y0

+wz;L?ah0That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.木铎校园 BBS 社区-SmIJ pey)G

木铎校园 BBS 社区U&{M%lZP

Even PHP 5 won’t help here, when it is released: Itsort-ofsupports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

S&U!X SJ]6C x7WM0

D})[ Z5?&ql5p0So I searched the PHP docs some more, and came up with a potential solution:mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that saysencoding="utf-8"and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.木铎校园 BBS 社区4s%Y-Q;^;u"t,}2A5^

木铎校园 BBS 社区Q9~'zHqALj+ma_

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

J Zd#Ri0

} O*n^\~%}y0if (preg_match($rx, $source, $m)) {
L9m? ~ ? J0  $encoding = strtoupper($m[1]);木铎校园 BBS 社区/^ @ ox pW
} else {木铎校园 BBS 社区C R.q^ `eF6dr$n\ A
  $encoding = "UTF-8";木铎校园 BBS 社区9i'ihE-X6oO~0o;y
}

"B9L*fk|\-z1J0木铎校园 BBS 社区gDA'|i1TE

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
J/O}6R5`0  $parser = xml_parser_create($encoding);木铎校园 BBS 社区(hm^7W ee5`
} else {

ZP|"]pc2eR0

m)d:sw;u!R~D0  if(function_exists('mb_convert_encoding')) {木铎校园 BBS 社区%b#|S%IRm P
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
Hh!e0[s6V0  }

q)L,_ `-kM*ZD Z0

OSt Ai7i0  if($encoded_source != NULL) {
[gp5O&r-Y0    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
\$K"a[{#^/b1Z%m0  }木铎校园 BBS 社区F_Y y{C E |G

木铎校园 BBS 社区(O9\\1J)A$h)I8R

  $parser = xml_parser_create("UTF-8");木铎校园 BBS 社区 Wd&w$Be0C,r5c
}木铎校园 BBS 社区Ql/\(v2y ON

木铎校园 BBS 社区 V\PlaIm,M2}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

ZR F}%}0
木铎校园 BBS 社区G1TA o:cR@)j

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.

[5itq6};`0
木铎校园 BBS 社区7xq1jb ^0Cnk/x

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

i7M2G$p8^^ r8j5d-R0

M;{p6J7E1vT0$parser = xml_parser_create("");木铎校园 BBS 社区 r$]$U`Kop7}

1\5a(K*u W%I xli0Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:木铎校园 BBS 社区4bd1g4l tn2q UR

木铎校园 BBS 社区 W:zg&y2X;i

$parser = xml_parser_create("");木铎校园 BBS 社区/ufV8^b
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
木铎校园 BBS 社区!OO3@0F O"z U'U

木铎校园 BBS 社区|AqF\Dy

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

'oR-IX0F1Lv6^k0

Fn(uP`9G M"f0At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

8^h$vQ-u@e*|!^;S0

TiwY4vgpY.E0
64 Feedbacks zu "PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss"

Phil Ringnalda

*?+C5C}k~2\C%q1N0If you really want to cover all the bases, don’t just look for mb_convert_encoding(): you can also look for (and rarely find) iconv() and recode(). Then, on *nix, try to shell_exec(’iconv…’), though you’ll fail in safe mode and most shared hosts disable shell_exec. But, there’s still one last hope! Most of them don’t realize that they should also disable the strange and terrible proc_open(), so you can actually fork a process to run iconv, and open input and output pipes to feed and read.木铎校园 BBS 社区n#e-Ms2sw$Iw

木铎校园 BBS 社区K8iQ5q G/hs?

Or, sigh, write ten lines of Python to call Mark’s Universal Feed Parser and return the output as a PHP include. Sometimes, PHP really ticks me off.

I3x3jg9};a3N1?0
木铎校园 BBS 社区0{it?dP4v\4O
木铎校园 BBS 社区0Yi9H ?9\XJ

steve

木铎校园 BBS 社区9v Y!E8G!t!Z9a8{ \5@u

OOooooo… I didn’t know about those!

/D5FC"j P3C2g0木铎校园 BBS 社区+tl8Dee7m+Rv]

Or I could just create a web service:http://xml-transcoder.com/?to=utf-8&url=http://nasty.feed/in/EBCIDIC/or/some/such. Then you’d just seamlessly subscribe to the transcoded version of the feed.

/e v x#DSb0

,?0t!P?fmPS0
_s Vl#k A,c0

isis

%At8qTSfU fq0good job and funny title.

F].g6E0F5Lt6`0
木铎校园 BBS 社区 D3j0uGd-^N*Q+W }r

CqOr {3D0

steve

)z&p]?|+{ie0Thanks, isis. I used your feed (since it’s big5) as one of the test cases!

{1C7F;d*T}7D0

y5e/[,O/M^W0木铎校园 BBS 社区 hm*vA0@ `

zonble

木铎校园 BBS 社区s.m;N"c~t!w

Steve. May you permit me to translate your post into Chinese and share it to the Chinese readers? I consider that there might be lots of people in Asia would like to know how to handle with the International characters while programing PHP.木铎校园 BBS 社区 xaLnr9h


R"EK,H*{!w0
|l?O3@$fBT(nc0

steve

木铎校园 BBS 社区$Fy+[ XM.hS

Yes! You certainly can! If you can wait until Monday, you can post your translation of this article along with a pointer to Feed on Feeds 0.1.7 which will contain the working code.木铎校园 BBS 社区Ou!D8NN

木铎校园 BBS 社区i4T\h.x9S,f/@u
木铎校园 BBS 社区%S9S.PZs"W+sQQ

Mark Wu

木铎校园 BBS 社区RC0r z^*n

Hi Steve:木铎校园 BBS 社区Zg.DM n m"J

}4QR(e}5I`'SDv0I just plan integrate the FoF into pLog, this tale really help me learn a lot about PHP’s stupid encoding …木铎校园 BBS 社区 k.s}kg{p`:^

K-T0g2@ `aE&W8}0Regards, mark木铎校园 BBS 社区-ikJ9M%aH$E

木铎校园 BBS 社区0]+a[6xDOhH \
木铎校园 BBS 社区6@3I9W+HY-iB

Mark Wu

u:M _DFxOJ0Only one thing, my ISP does not support iconv … my god … How can I do?木铎校园 BBS 社区G)I:q'vfzc#L


/N[*w)fi.\(P-D[Q0
.PE)C'v,^~}*B0

steve

木铎校园 BBS 社区 ~ uuMR

Uh oh… how about ‘mbstring’? If you don’t have iconv or mbstring, then you will only be able to work with feeds that are in UTF-8, ISO-8859-1, or ASCII.木铎校园 BBS 社区A t/fv9sT:v


[)g#Deu5]0
bd'@,a gV*^-M4s,D2y+@8[0

Thomas Clavier

8]Y/wqiq_iR0for me it isn’t a good idea to search charset in xml header. it’s good to search in http header because for w3c if charset it isn’t specify in http header you can use xml charset.木铎校园 BBS 社区&j,h#Vc1Un-e8p!g

木铎校园 BBS 社区zg] ?2f

http://www.w3.org/TR/WD-html40-970708/charset.html

(h*Vv1h(s2o*v0
木铎校园 BBS 社区7?D.vo-I6v)R:^
木铎校园 BBS 社区(m7^V)k)F

steve

木铎校园 BBS 社区K5I H!N MX)I%NBJ

To really get it right you have to checkboth. In practice, I haven’t yet found any feeds where checking the headers is necessary or even helpful.木铎校园 BBS 社区W0{ Um5XjK|1D

木铎校园 BBS 社区,bPu;eL2]~aCs

N'r;iM5W_(C(a0

Mark Wu

木铎校园 BBS 社区.Ay S*M:S

Hi Steve:

5v)Ee4Dng0木铎校园 BBS 社区/D/y!^-qI:n E

Thanks, It works,I already asked my ISP install iconv …. It works..now!!木铎校园 BBS 社区/Ai3f wn#i

{+Y%r\z0tX C9n Qh0And, I just only use the hacked magpie-rss that shiped with FoF. With it, I can let pLog support the encoding, convert GB & BIG5 to UTF without change any code … really thanks!!

8? G'Xl {H0木铎校园 BBS 社区ONXF'RK

Please go my site to see the result, you really help me a lot!! ^-^.木铎校园 BBS 社区h3Y(x2?4m

zVkE8\]E!qdH*g0RSS Feeds by Site ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedulebysite木铎校园 BBS 社区P"G(H4Zn%ZOmw*U

1tS-^?.@l:Q4f0RSS Feeds by Time ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedule

x-g2l:Bx9O ?z0木铎校园 BBS 社区YT*P5e4HI F"Y

And, may you allow me to submit the my code with your hacked magpie-rss version to pLog ??木铎校园 BBS 社区O m @]${3i(e2t

L*r1m1|fw H D0Thanks!木铎校园 BBS 社区 |+VvP5{\5k

木铎校园 BBS 社区!F ?%g6q \+l,BG

Regards, Mark木铎校园 BBS 社区-`7kPxO2K]


%?R9Cz-j/?j0
?"L2[Nie eU5{_0

steve

木铎校园 BBS 社区+cz6e jQ&o-bJ^`

Mark: Great news! The code is GPL, so you can share it with anybody you like, as long as they follow the terms of the GPL as well. The author of Magpie is looking at these changes now, and refining them, and they may be included in a future “official” version of MagpieRSS.木铎校园 BBS 社区9i0^(XA|

木铎校园 BBS 社区5Gqm},_r#s

&M[-B;L8[4h0

Anonymous

o-SX7[ V-y x+j0god. this is a boring blog.木铎校园 BBS 社区Q Dk4|Rw
possibly, because i’m stupid and know nothing of php encoding.木铎校园 BBS 社区}%xDn`CP)`
but for all our sakes, go to a party and get hammered.

#W w?{-HF~0

#v%VP'\1w"U0木铎校园 BBS 社区.|p)AL??5oV;b

David

木铎校园 BBS 社区\ WO O"e C u_6RT_7u

Thomas makes a good point, but it’s even worse than that–not only are there character encodings specified by both HTTP and the XML prolog, but either or both of them could be completely wrong. I would imagine there are many feeds served as ISO-8859-1 that are actually windows-1252 (or whatever it is) and they just don’t happen to contain any of the characters that are different between the two formats, yet. When one of them does, your code might handle it, or maybe it’ll start spitting out gibberish again. And if someone gets UTF-8 and UTF-16 mixed up, I think you could wind up with everything shifted off by a byte, and now the feed is completely unreadable again.木铎校园 BBS 社区#wg0e8^/hB\1D

木铎校园 BBS 社区c} fE}

Basically, until we can convince everyone to use UTF-8/16/32 for everything, this will be a bloody pain in the arse. It looks like PHP just makes it even more painful. I would second Phil’s recommendation: use Mark’s Universal Feed Parser, and when something breaks, just get him to fix it. :-D木铎校园 BBS 社区U0`!n*iD8e


x0q;m4l w9p0
9K+v?;w k;s:l0n0

steve

木铎校园 BBS 社区\4{bY2oJgI

At this point I’m still in the “get it right when the feed is right” stage. FoF still doens’t even do that, all the time. Once I’ve got that one licked, I may move on to “get it right even when the feed doesn’t”.木铎校园 BBS 社区,e9_8]hy2F?


4T d;I/g c_Z;U0
owF aQ'L U0

So Much Geek, So Little Time » Reject Incorrectly encoded Pingbacks

VUj'G&ZC8F0[…] ject Incorrectly encoded Pingbacks Filed under: Life — unteins @ 12:11 pmThis articlehas some info about […]木铎校园 BBS 社区"CE/F ?vR7d"Q

木铎校园 BBS 社区,D0Lk.C*Ci?{9a
木铎校园 BBS 社区F3}?Q.Csh-t

Mark Wu

,lj#YS4K+~$v0Hi Steve:

k(N Azdp0木铎校园 BBS 社区 R@C9fzw

I wrote a blog about pLog RSSFetcher Plug-in. Thanks for such good work.

X6r[\Oyd*@0木铎校园 BBS 社区1X WN _1[a5Y8F6Pe

http://blog.markplace.net/index.php?op=ViewArticle&articleId=119&blogId=1

Tr a+OgXDy x0

S GvtkLo OU0Regards, Mark

KGmy%F2u/U$n0

d` B2@l0木铎校园 BBS 社区?5CKHz[O8Z

Harry Fuecks

木铎校园 BBS 社区,I&e7h5R.~

Great post. Many thanks. Seems many php developers are largely oblivious to character encoding issues. Looking at some of the feed generation libraries out there, seems it’s a similar story - last time I looked onlyRssWriterpays any attention to UTF8. The author is Portugese I believe, which may be why…木铎校园 BBS 社区J w6DPa

木铎校园 BBS 社区e2[z ~t}7f&E

Gives me some ideas for further features forHTMLSax- right now it shouldn’t (not that I’ve tested carefully) choke on anything but also doesn’t support the user by taking care of encoding issues.

^HC#@kC-F9B0
木铎校园 BBS 社区+p `,j[5Oz$@

0cm4LB8GQTT0

inertia

木铎校园 BBS 社区U S&Y_+yY

hi Steve,

Wf3i5Z:`*A4bW0

.E,CWt8H#b Xi0@0I heard this good tool from isis, and install it on the share hosting to test. And one thing I can’t figure out is that after reading all docs carefully and asking my ISP had mbstring and iconv both complied, I still can’t fetcher big5 blogs, ex isis’s blog. Do you thnik where may I get worng?木铎校园 BBS 社区m;rP(Y2u-VB?

S ogfV#yP0I know this porblem is guite “ambiguous”, say, the ISP didn’t compiled well ,or some installation step got wrong. but I also wish get some ideas form you.木铎校园 BBS 社区jE"aJ(Aa.T5D

-r:h3@K8mI6_0regards
5W^nU/V%L f#Jn*K ]0inertia木铎校园 BBS 社区fK%fi]SaP


h_&_!hGA Pb)Y0
1H1WL8DbD0

Peter Van Dijck\'s Guide to Ease

1v;]p;^x0
"o*Zli)L2jS#[6m:Z0Steve Minutillo :: messy-78  PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss: a must read if you’re wrestling with PHP, XML and character encodings!…木铎校园 BBS 社区 ?1K)?W|/\

木铎校园 BBS 社区] ~UcB/B |5~#h\t
木铎校园 BBS 社区+ry |S;wN^

MeriBlog : Meri Williams\' Weblog

'a)jJ6Z(CO \&C2@s0Multitasking Mice Fertilitiy木铎校园 BBS 社区qta&ZB,@7Y \^
Interesting article over at OK/Cancel all about multitasking PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss — worth reading just for the title ;-) Very comprehensive guidebook to developing with web standards I love the…

I*L5h OQ&O@*qZ0

U!nT I|;wDU A#@0木铎校园 BBS 社区5av M+B D8A L

Grace and peace to you! » 2004 » June

7{z6?D }MA0[…] Grace and peace to you!

*D7l [&|NW0

Hoo(t;Q0»PHP, XML, and Character Encodings: a […]木铎校园 BBS 社区{l4c%L%N ~f { R$Y


6~$X6s;Oe&i7Ra_0木铎校园 BBS 社区+{B,E] B}aF;y

Pete Prodoehl

木铎校园 BBS 社区G\8YaM+B J

I too must suggest that using Mark’s Universal Feed Parser would be a good idea. In fact, since the process of harvesting the feeds, parsing them, and storing them in MySQL can be separated out from the whole UI/reading part of things, this is a great suggestion. I know it sort of makes fof a weird combo PHP/Python app, but it could be an option for those of use who don’t have a problem with that.

XML!~D;Sys0
木铎校园 BBS 社区{%s.pW0D M0O-CF
木铎校园 BBS 社区u:WmE$^-aYS

LinuxBrit

木铎校园 BBS 社区-n!VN Zd v

PHP, XML, and Character Encodings
*y ?V}^i p;t0Man, PHP can be really unbelievably stupid sometimes :(

NjV*t;d^}1qe0

#e0C3W6[!R8R M0
Vf6?en h4X,Be0

Elaine

木铎校园 BBS 社区1s4oTh_

yipes…I started taking a look at the problem when I was mucking about with a personal variant of FoF and could never quite figure out what was going on. I always thought it was a problem with Magpie, but I had no idea it went that deep. thanks for all your work on FoF, and for sharing all the gory details!

.d2@#N%BZw2[@0
木铎校园 BBS 社区"er"t{\;~'py
木铎校园 BBS 社区E4d.H}%?4\ zA%@

Simon Jessey

木铎校园 BBS 社区 Z)P J&L+wv `%d:SPK

Multibyte string functions are not part of the PHP default install, so many webhosts do not include it - my own webhost refused to add it. Just thought I’d let you know.木铎校园 BBS 社区 t"KGijg#v

木铎校园 BBS 社区m/U MMR ^(x E

#D*q0ha$]-f0

Dominic Mitchell

木铎校园 BBS 社区-Bn]tUb6bM'QK

What about UTF-16? Your regex won’t be able to pick up the XML declaration then…木铎校园 BBS 社区S9RC-u/b Q A

木铎校园 BBS 社区R PU2e fx

I love spanners. :-)

!k{Z"M`b,N8Z3W0

H6b.Ow:g)q ?d7|0-Dom木铎校园 BBS 社区1rS-QKpY4q1C3`2h


bh9I6iwV0UC0木铎校园 BBS 社区Re8W|Ui

steve

6u@(aXkYk0inertia: I’m lucky enough to be on a host where mbstring and iconv are both included and work perfectly. I’ve actually never even compiled PHP myself, so I don’t really know what can go wrong there, other than the obvious “check phpinfo()”

-XQ sW'B,Lf0木铎校园 BBS 社区LJ#|$V$BO9C

Simon: I know. There’s nothing I can really do about that. In the next version of FoF the installer will inspect your system and tell you what it finds, so at least this will be less confusing.木铎校园 BBS 社区\5i#X/_Rf C

)U QXuh@;I0aZ0Dominic: I knew there’d be a hitch somewhere. Are you sure? Have you tried it with UTF-16? If it doesn’t work, is there any workaround?

2x tH&h@0
木铎校园 BBS 社区+z7Z+r-e5@S0E7z*L
木铎校园 BBS 社区 I+^i"?n

David

f V P7_+i9b0x-m0Thanks, excellent post. Good to know someone’s working so we don’t have to :)

[%\-fzcH{T2Z0

0_M Y;`{1D7L0
.v*J[h~ V lBvJ(?0

snapping links » lookin’ good

$gyFybQ&]0[…] n into the problem with character encoding (darn curly quotes)…so I went back to look atwhat Steve Minutillo had to say abou […]木铎校园 BBS 社区q*s8i3qoj+}

木铎校园 BBS 社区0^oS0v-^+bKc

#Qj*E+]ZSEU,X3I#y0

Andrew

木铎校园 BBS 社区0E E$VS/Dk ?"R8oi

Thanks for your post. It’s unfortunate that php’s support for i18n is so poor here. For what it’s worth (probably not much), EBCDIC is not spelled with an extra ‘I’, even though people pronounce it as if it did. Man, I hope nobody sends out news feeds in EBCDIC O:-).

q CjV9t { ]0
木铎校园 BBS 社区IiE _-[2|\3g
木铎校园 BBS 社区9d%E(y%]*pT dj

车东Blog^2

木铎校园 BBS 社区a5Xf,a}:y&W3S

Lilina:RSS聚合器构建个人门户(Write once, publish anywhere)
"CU*N2O r#F0最近搜集RSS解析工具中找到了MagPieRSS 和基于其设计的Lilina;Lilina的主要功能:木铎校园 BBS 社区q;r3Ife]u"\PkQ v

木铎校园 BBS 社区@m c&D6PCn QT

1 基于WEB界面的RSS管理:添加,删除,OPML导出,RSS后台缓存机制(避免对数据源服务器产生过大压力),scrīptLet…木铎校园 BBS 社区&qm&hguF

木铎校园 BBS 社区4f!urLS n
木铎校园 BBS 社区Z6y0Co*P-N

车东BLOG

木铎校园 BBS 社区)J p~;my{F:a

MagPieRSS中UTF-8和GBK的RSS解析分析:php中的面向字符编程详解
u eh;BI;V2U9aT!Wx%T0第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mtstring的支持,我今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为’MAGPIE_O…

#{2B2G?7^7of0

Ux2i0QsDWRW]0木铎校园 BBS 社区o/} XFE:\

grace

木铎校园 BBS 社区 QD8CfJ ed4x

how to do it even without using XML ? I just wan my page to be able to key in chinese character, submit to mysql in code form, then display on page the chinese character.木铎校园 BBS 社区'y gy@S:X

%X;mI.TfH0What is the testing scrīpt which I can test for this?
8u0OUf/ke0Thank you

LG*MWAu.lN0

EW M:s| u2d9v0
nQ]9wQ0

steve

木铎校园 BBS 社区,iYY3gX+p Y&pT

grace: I don’t have any links to tutorials on that, but what you’re talking about is fairly easy to do. In fact, the free weblog engine that powers this site, Wordpress, is capable of doing just that. You could download Wordpress and examine how international characters are handled.

8g^Pl7W9Y*s~tI0
木铎校园 BBS 社区hM \/I3V!c

O2F\"c*g0

Valery\'s Mindlog

木铎校园 BBS 社区8t }W!G*D3MSE

 
?CT!kJUG,rLw0      , ..  ,     00:01.       ,      (    “ ”,      ).  …木铎校园 BBS 社区8w:u \}p r{"J8w


h Ak]0M.Y0s0EhF0木铎校园 BBS 社区M;Z)G,M/^+h4iSR

Sascha Carlins Linkdump

$qC7av:{"pPJM0This page has been linkdumped
3?cU"}%D0Charsets…

^V~-_xy0

U[Y)IuiZd0
1K4Aujne9Z$LP1W0

Andy

木铎校园 BBS 社区N oK N5On P Z!c7On

I’m developing a library called mbstring emulator which emulate mbstring functions. I already published mbstring emulator for Japanese(supports Shift_JIS, EUC-JP, UTF-8),and now I’m working with western language version(iso-8859-1 and utf-8). I’d like to know what encoding do you need .

2L}-S k)r b0
木铎校园 BBS 社区7}o8lam%]}

7S.D\g?%G9h0

steve

}L*k Q3v&Z)k0BIG-5 would be nice.木铎校园 BBS 社区 uKvS+pQ


$C*b'v-f0OL0木铎校园 BBS 社区7Q4t;NKlfO

ryh

~oL;] |3cA;da|b0Good job, I’m using this on my site.

/`+n;{bTgQp%h0
木铎校园 BBS 社区BR*m!EZ^N
木铎校园 BBS 社区b*}DT/|C5g

+CMS

Wmg3Z^1c0Andy your mbstring emulator is perfect. thx

*{;]0zf L_L0
木铎校园 BBS 社区7?4hOcjJ
木铎校园 BBS 社区'A f8r:p N l+r

Charl47

木铎校园 BBS 社区|LE9rumH [O

Hi steve木铎校园 BBS 社区#k9?BDDZo0E}L
I’m using magpierss 7.0. Where may i put your synthaxe exactly. I don’t unsderstand all your explain, i’m newbee in RSS and I’m french. So it’s difficult for me to translate this.木铎校园 BBS 社区1c9wW;T'])fe%^
Thank’s.

F|4i$~X(??Tl0
木铎校园 BBS 社区w3A:AY_7V8r

.o(\/M$VO\0

steve

n!u;LwO'J s0If you are using MagpieRSS 0.7, then this code is already built in! This was written before MagpieRSS 0.7 was created.木铎校园 BBS 社区`rT9{"Yr


~p5p5k4@G]C0
]YckW'i ~1s)O,jf0

nwestwood

)TR,Qv mP0I’m using magpieRSS to read multiple feeds and create 1 feed with news of interest to our industry and it works, mostly. When I write out the data I retrieve from magpie, I get “not well-formed” XML, for example & symbols in the URL’s that Feed Validator doesn’t like. Is there a way to get the data back and have it encoded correctly? or what do you suggest?

Fnbw&J2f*a Pa0木铎校园 BBS 社区MQ)EW#z2N v8\'u rW y

-thanks - Neal木铎校园 BBS 社区 S4Qy0d,yH2f

木铎校园 BBS 社区(b&s q/V c/XNm8w
木铎校园 BBS 社区S4z]G%z/?J/Z

steve

木铎校园 BBS 社区 Tj@$n5y,OyWf

Magpie is working the way Feed on Feeds uses it. You could try asking on the magpierss mailing list with some more specifics on your problem.木铎校园 BBS 社区(wj1D l"r a

木铎校园 BBS 社区o ['\q#B[,_
木铎校园 BBS 社区p)W6Q h!Ch*Oa

junesnow17

uTQ8R(sIr0我不知道我在這裡輸入中文字是否可以顯示出來
y?M4}6n3?-mZB0我的英文很差所以只好輸入中文

6Yu N3O wg+pd0

:HD?` I7{;z&Ag&i0因為各種原因,我在學習php的同時發現我現在安裝的php版本太舊,所以花了兩天時間更新到最新的版本

;\r#D0_S/F2Dg0

M]Id%q@Y0結果論壇中的會員名字全部變成亂碼@@

j7Zd/N.oz/O0木铎校园 BBS 社区3Wx(N"M2]"}9]KV,o

看了這篇文章,雖然問題沒有解決,我老公最終還是把資料庫還原到以前的版本,但是我十分感動,有人會在關注這件事情.

Fh pM3SO0木铎校园 BBS 社区8[ I C ho

因為文字,我們使用方塊字的這些群體都被忽略,往往寫一些程式的時候就會因為文字的限制而弄得暈頭轉向.希望能早一天,有國人都可以使用的php.木铎校园 BBS 社区0m{ h*L'X Jx

木铎校园 BBS 社区x1m"g:Cg*\0L

不要再讓我們覺得自處受制了.木铎校园 BBS 社区5vx;ZR7`N#E5X

木铎校园 BBS 社区Dm ]4xj)Tq
木铎校园 BBS 社区j zBf^k5w,\yu

Jari

木铎校园 BBS 社区'] ? } g1kB

Hello Steve木铎校园 BBS 社区 Q.w N*\m,P)R G
I was looking for this scrīpt for a long time. It looks great. But I get a parse error , unexpected ‘*’ , in the line $rx=
? WNuj0I made a copy/paste in Dreamweaver. What’s wrong in the syntax? Thanks to help me. Best regards

D(G-P+r2T.b@#D&O%jO0

d4D0q%NC;G2m!}"l'l0
:G/VjW0L0

jari

木铎校园 BBS 社区oT-P1` V5g

Hello木铎校园 BBS 社区tk9r:V4s'g

Z*CP;Sj-yy$p0I tried magpie rss_parse.inc scrīpt that use your post. Doing copy with notepad I have now no syntax error. I want to read xml files in ISO or UTF-8 . I tried the function : function php4_create_parser to test the encoding of the input files. I use the regular expression : ‘//m’ with preg_match, but if I do an echo of this regular expression I get nothing. What’s wrong? How can I get the encoding of the xml files ? May you help me.木铎校园 BBS 社区z+Z KGI8J#}/M6]F`

&qa T,Zv0@]QWc0Best regards

e+@~L,G j h Y0
木铎校园 BBS 社区n\,NbW$|r
木铎校园 BBS 社区6SXL1i9F7H"N ybh

Javier

木铎校园 BBS 社区4oF(lrjC sSB

Thanks so much man, this has solve a huge problem I had. Ive been searching for a solution for a while, parsing XML with PHP 4.x could be frustrating specially if your XML files need to use entities for different languages (in my case, spanish).

!K_'E4DZxL6`0木铎校园 BBS 社区,}:jS.N3NjU(oH VQl

Ive learned a lot with this, thanks again ^_^木铎校园 BBS 社区g+CEI1q9~sz0Y)G WK


WQc9d$Dd8k)SV QWo0
/~\)TC!z0

Asha

木铎校园 BBS 社区 |\w`Np Bo

Hello,
Y,{5G5Q6]#ExG0I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?木铎校园 BBS 社区Rqz0r+nJg


3@{)~l$ls4L4UJ)p0木铎校园 BBS 社区3Z,O;W{(fR

K

木铎校园 BBS 社区^q j!xfw%O&M

Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16木铎校园 BBS 社区l%N,K/J7V V'y


L!RK-Re0
;K-V |'M8W JH{0

K

.Un Uk"N|0h+f tl@0ok, my last comment was “mangled”

#cb4C#WaEf0木铎校园 BBS 社区uGY-U;Pl8G

Replace all 0-bytes from the string. That’s \ by the way木铎校园 BBS 社区7W$@*DQu#_!l^

木铎校园 BBS 社区/f+J@b"K6Z

rW2e!bws`(B)e4yZ'`0

oink

木铎校园 BBS 社区X;{_syr

i suggest to go a little further with the regular expression in a network world full of screwed up source codes:木铎校园 BBS 社区M7y#Y/}9p1v|q

v&dR$d G?H0preg_match( “/]+encoding\s*=\s*[’”]?([\w._]+(-[\w._]+)*)[’”]?[^>]\?>/i”, $xml, $m )

,h NK&ha@0

)m/YJo` l:y$L0it’s possible i screwed up here myself, double check advised, i didn’t test it yet.

x:Jq;M*Uc X0e,j^6J0
木铎校园 BBS 社区WX9Z0jG ]j!C

BG;o1`"{)Fl+N H0

oink

g*K'xjk%B0well seems like your comment page doesn’t translate “smaller than”.

4z!t9[q6m-H0
木铎校园 BBS 社区3b)S bR-H1b
木铎校园 BBS 社区G)v@:S-jS4zE4u"l#mv

steve

({T2iF2m1]1xT|0Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.

#w/Lue.Hr*zV;j7g0
木铎校园 BBS 社区d8UA:X2l7t7O:x%W

#x0YB0EdB a| V0

pepe

6`F5eI'dD0hola egañádos

@tj$d,P2|0
木铎校园 BBS 社区C_PH!S:G8cZk
木铎校园 BBS 社区 g$y |Fz

Andrew

;e j0O3Z]_F*E0Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:木铎校园 BBS 社区Gx.yHM
www.andrewzahn.com/crissa

,C [k1l5E0

8`+gN4n `?qe1^0m7h0can you tell what could cause this?木铎校园 BBS 社区:V9M a7~i QncpF|
thanks木铎校园 BBS 社区D6X;[(| c

木铎校园 BBS 社区9@o fX8Q4@ S
木铎校园 BBS 社区E&Q]:@v

Jason Judge

Y)w!gV(au+Q@ B-_C0Just a note on this bit:木铎校园 BBS 社区 k+{j|2A!_1mo9{$c

木铎校园 BBS 社区L$tJ|&fby:Z!B S

“If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”

CQ'K)\y5zbh0木铎校园 BBS 社区{&J'N,z:oyV

It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.

3GB"M+HQ4dA$s0木铎校园 BBS 社区u*@?#Y]F

The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.

Qm(b&oPYg)el py2`0

3Z kU]keVIxx0The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.木铎校园 BBS 社区$S!H(nWUw:^

Q gNp u9g#u1]0I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.木铎校园 BBS 社区iw+Y9mH:Y ? e

木铎校园 BBS 社区HgKhD8k
木铎校园 BBS 社区#i okSVf

matt

木铎校园 BBS 社区|2j-] ?J

Steve,木铎校园 BBS 社区Nn@+@9nI6V7Dux

木铎校园 BBS 社区*SD \w:C-L"Q(r

Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).

hj z D;C.^)dIjJ0

-WJ2A-LA'DI D3Fn.U1T0All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.木铎校园 BBS 社区o5KE#e1G!_/YJ^n

!?gS$BR'y0The Yahoo feed is UTF-8. Here is the link to the actual XML file:

DX*e0O7hp"Np,l0

,oD X!Y | |.c0v0j0link木铎校园 BBS 社区+@7Ax9DQi

木铎校园 BBS 社区]r7GhwZV

Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descrīptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.

8i6h!T&S0BN/h$DlG0i0

uA2t5r0E1k)g0木铎校园 BBS 社区$i'oIJ%gw

unclepiak ลุงเปี๊ยก

木铎校园 BBS 社区c[c2o2D~,{}

sound interesting ! ขออนุญาตทดสอบภาษาไทย

g I3T7FKQ^-ER^0

e-@*a6? d\/b6@]Y0
w3pj0Wp0

Rasmus

木铎校园 BBS 社区u4j+d*eri |

I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.木铎校园 BBS 社区#{r6^6|v


|7Iux Q(Poc v1y0
VI,_ bT[0

Oliver

!K[[2jd0Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.木铎校园 BBS 社区p0[:d@,J.i PrT)\7d f


z hW4E } [0木铎校园 BBS 社区 o*G2_ kk0i5J;y

Matt

木铎校园 BBS 社区O#E2x.h/Q!D$itO

If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…木铎校园 BBS 社区$GEq"Ym

osTb[%o B0when you include and define your Magpie params be sure to include this line:

"l k7z|"K3S+|C t]0木铎校园 BBS 社区&Y$u:qg\1hJ5|N

define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);木铎校园 BBS 社区jxTgdK:S-w

#FK+f}:U*? co A9\W0and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:木铎校园 BBS 社区ckQOE

x Q:Y;BY&rT;|h V&Q0(less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)木铎校园 BBS 社区'Na!k [?c

木铎校园 BBS 社区"DmJ#`]cX

of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe木铎校园 BBS 社区{Nuu YU

木铎校园 BBS 社区0w[e^} |lJ

TAG: LAMP

 

评分:0

我来说两句

显示全部

:loveliness: :handshake :victory: :funk: :time: :kiss: :call: :hug: :lol :'( :Q :L ;P :$ :P :o :@ :D :( :)

关于作者