Utf8.php CI Explained

I’ve stopped work on this series to start a new one for CI 3.

Utf8.php is a core CodeIgniter class. This post is part of a series which explains the CodeIgniter (CI) source code. This post explains the CI 2.1.3 Utf8.php.

./system/core/Utf8.php

The Script’s Boilerplate

<?php  if ( ! defined('BASEPATH')) exit('No direct script access allowed');
/**
 * CodeIgniter
 *
 * An open source application development framework for PHP 5.1.6 or newer
 *
 * @package		CodeIgniter
 * @author		ExpressionEngine Dev Team
 * @copyright	Copyright (c) 2008 - 2011, EllisLab, Inc.
 * @license		http://codeigniter.com/user_guide/license.html
 * @link		http://codeigniter.com
 * @since		Version 2.0
 * @filesource
 */

// ------------------------------------------------------------------------

/**
 * Utf8 Class
 *
 * Provides support for UTF-8 environments
 *
 * @package		CodeIgniter
 * @subpackage	Libraries
 * @category	UTF-8
 * @author		ExpressionEngine Dev Team
 * @link		http://codeigniter.com/user_guide/libraries/utf8.html
 */
class CI_Utf8 {

Code: __construct()

/**
 * Constructor
 *
 * Determines if UTF-8 support is to be enabled
 *
 */
function __construct()
{
	log_message('debug', "Utf8 Class Initialized");

	global $CFG;

	if (
		preg_match('/./u', 'é') === 1					// PCRE must support UTF-8
		AND function_exists('iconv')					// iconv must be installed
		AND ini_get('mbstring.func_overload') != 1		// Multibyte string function overloading cannot be enabled
		AND $CFG->item('charset') == 'UTF-8'			// Application charset must be UTF-8
		)
	{
		log_message('debug', "UTF-8 Support Enabled");

		define('UTF8_ENABLED', TRUE);

		// set internal encoding for multibyte string functions if necessary
		// and set a flag so we don't have to repeatedly use extension_loaded()
		// or function_exists()
		if (extension_loaded('mbstring'))
		{
			define('MB_ENABLED', TRUE);
			mb_internal_encoding('UTF-8');
		}
		else
		{
			define('MB_ENABLED', FALSE);
		}
	}
	else
	{
		log_message('debug', "UTF-8 Support Disabled");
		define('UTF8_ENABLED', FALSE);
	}
}

Looking at the constructor piece-by-piece:

/**
 * Constructor
 *
 * Determines if UTF-8 support is to be enabled
 *
 */
function __construct()
{
	log_message('debug', "Utf8 Class Initialized");

	global $CFG;

The code above is self explanatory.

if (
	preg_match('/./u', 'é') === 1					// PCRE must support UTF-8
	AND function_exists('iconv')					// iconv must be installed
	AND ini_get('mbstring.func_overload') != 1		// Multibyte string function overloading cannot be enabled
	AND $CFG->item('charset') == 'UTF-8'			// Application charset must be UTF-8
	)
{

Q: What does this do?

preg_match('/./u', 'é') === 1

The dot meta-character matches to a single character.

The u pattern modifier says the pattern is to be treated as UTF-8.

preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.

These conditions are TRUE if the environment is UTF-8 friendly.

So, assuming it is UTF-8 friendly:

log_message('debug', "UTF-8 Support Enabled");
define('UTF8_ENABLED', TRUE);

Record the UTF-8 friendly status in log and as a constant.

// set internal encoding for multibyte string functions if necessary
// and set a flag so we don't have to repeatedly use extension_loaded()
// or function_exists()
if (extension_loaded('mbstring'))
{
	define('MB_ENABLED', TRUE);
	mb_internal_encoding('UTF-8');
}
else
{
	define('MB_ENABLED', FALSE);
}

Establish whether or not the mbstring extension is loaded; and, set internal encoding for multibyte string functions if necessary.

Otherwise:

else
{
	log_message('debug', "UTF-8 Support Disabled");
	define('UTF8_ENABLED', FALSE);
}	

Log the fact that "UTF-8 Support Disabled" and set the UTF8_ENABLED flag to FALSE.

Code: clean_string()

/**
 * Clean UTF-8 strings
 *
 * Ensures strings are UTF-8
 *
 * @access	public
 * @param	string
 * @return	string
 */
function clean_string($str)
{
	if ($this->_is_ascii($str) === FALSE)
	{
		$str = @iconv('UTF-8', 'UTF-8//IGNORE', $str);
	}

	return $str;
}

Q: What does this do?

iconv('UTF-8', 'UTF-8//IGNORE', $str)

This will remove any non-UTF-8 characters from $str. See manual page.

Code: safe_ascii_for_xml()

/**
 * Remove ASCII control characters
 *
 * Removes all ASCII control characters except horizontal tabs,
 * line feeds, and carriage returns, as all others can cause
 * problems in XML
 *
 * @access	public
 * @param	string
 * @return	string
 */
function safe_ascii_for_xml($str)
{
	return remove_invisible_characters($str, FALSE);
}

Q: What does this do?

remove_invisible_characters($str, FALSE)

remove_invisible_characters() is a CI common function — it is NOT a PHP function.

The FALSE specifies that $str is not url-encoded.

As the comment in the code says: Removes all ASCII control characters except horizontal tabs, line feeds, and carriage returns, as all others can cause problems in XML.

Code: convert_to_utf8()

/**
 * Convert to UTF-8
 *
 * Attempts to convert a string to UTF-8
 *
 * @access	public
 * @param	string
 * @param	string	- input encoding
 * @return	string
 */
function convert_to_utf8($str, $encoding)
{
	if (function_exists('iconv'))
	{
		$str = @iconv($encoding, 'UTF-8', $str);
	}
	elseif (function_exists('mb_convert_encoding'))
	{
		$str = @mb_convert_encoding($str, 'UTF-8', $encoding);
	}
	else
	{
		return FALSE;
	}

	return $str;
}

Code:

/**
 * Is ASCII?
 *
 * Tests if a string is standard 7-bit ASCII or not
 *
 * @access	public
 * @param	string
 * @return	bool
 */
function _is_ascii($str)
{
	return (preg_match('/[^\x00-\x7F]/S', $str) == 0);
}

preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.

What does the pattern /[^\x00-\x7F]/S match?

The S modifier has this description:

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

What does the carrot/circumflex (^) inside the character class signify?

A character class matches a single character in the subject; the character must be in the set of characters defined by the class, unless the first character in the class is a circumflex, in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash.

x00-\x7F is the range for ASCII characters.

Hence, _is_ascii($str) returns TRUE if no non ASCII bytes exist in $str. FALSE otherwise.

Advertisements

About samehramzylabib

See About on https://samehramzylabib.wordpress.com
This entry was posted in CI Source Code Explained. Bookmark the permalink.

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s