Just another idiot with a blog

Remove Non Ascii Characters using PHP and Regular Expressions

So there I was taking forever to fix content from a Word Perfect document converted to Word then pasted as HTML. I was spending a while removing all the bullshit non Ascii Characters and I thought to myself and said…Wes, you idiot, you are a programmer, do this the right way and filter it out. So I did and here is a simple way of doing it through regular expressions and php.


<?php
//Your nasty string of word and non ascii chars
$Contentz = "These are shitty chars „ and we dont like them nor want them.";

//Array of content I want to make a space
$badContent = array("&nbsp;");

//Replace the bad arrays with a space
$Contentz = trim(str_replace($badContent," ",$Contentz));

//Specific string replaces for ellipsis, etc that you dont want removed but replaced
$theBad = 	array("“","”","‘","’","…","—","–");
$theGood = array("\"","\"","'","'","...","-","-");
$Contentz = str_replace($theBad,$theGood,$Contentz);

//Whatever might be left over...
//Remove all non ascii chars (aka: bad Microsoft Word and Word Perfect Shit shit)
$Contentz = preg_replace('/[^(\x20-\x7F)\x0A]*/','', $Contentz);

echo $Contentz;
?>

$Contentz will show up removing the characters.

Cheers,
Wes S .Ray

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • StumbleUpon

6 Responses to “Remove Non Ascii Characters using PHP and Regular Expressions”

  1. will says:

    Wow…I’ve needed something like this before. Way to go, and thanks!

  2. Maqatac says:

    I don’t understand why you haven’t made a million dollars in book sales yet!

  3. phonejail says:

    why do you keep sharing company secrets?

  4. 1 minute: Why are non-ascii characters in you target document? Could it be possible while converting from the old word perfect document loosing some format information? Or there was france or arabian text in it?

    For your solution simply use
    > strings mydoc.wpf > mydoc.txt

    Good night …

  5. admin says:

    Christian, look at the title (with php and regular expressions). Why do people create huge php classes to zip files when you can use the “zip” in a CLI interface. This is just a solution for people who are new to programming and paste text into a IDE like dreamweaver leaving characters that look like spaces but parse as ╤⌂Ñ weird shit.

  6. [...] I have been looking for a great function to do this for awhile and I finally found one… does the job perfectly! I only need to concern myself with English so am not worried about losing non-ascii characters that might make up Arabic or some other language. Kudos to Wes for originally writing and posting this on his blog here: http://www.wessray.com/php/strip-and-remove-non-ascii-characters-using-php-regular-expressions/ [...]

Leave a Reply

Copyright ©2012 wessray.com
Protected by Copyscape Unique Content Validation