Monday, October 5, 2015

Downsides of Mixed Identifiers When Porting Between Oracle and PostgreSQL Databases

Both the Oracle database and the PostgreSQL database use the presence or absence of double quotes to indicate case sensitive or case insensitive identifiers. Each of these databases allows identifiers to be named without quotes (generally case insensitive) or with double quotes (case sensitive). This blog post discusses some of the potential negative consequences of mixing quoted (or delimited) identifiers and case-insenstive identifiers in an Oracle or PostgreSQL database and then trying to port SQL to the other database.

Advantages of Case-Sensitive Quoted/Delimiter Identifiers

There are multiple advantages of case sensitive identifiers. Some of the advertised (real and perceived) benefits of case sensitive database identifiers include:

  • Ability to use reserved words, key words, and special symbols not available to identifiers without quotes.
    • PostgreSQL's keywords:
      • reserved ("only real key words" that "are never allowed as identifiers")
      • unreserved ("special meaning in particular contexts," but "can be used as identifiers in other contexts").
      • "Quoted identifiers can contain any character, except the character with code zero. (To include a double quote, write two double quotes.) This allows constructing table or column names that would otherwise not be possible, such as ones containing spaces or ampersands."
    • Oracle reserved words and keywords:
      • Oracle SQL Reserved Words that can only be used as "quoted identifiers, although this is not recommended."
      • Oracle SQL Keywords "are not reserved," but using these keywords as names can lead to "SQL statements [that] may be more difficult to read and may lead to unpredictable results."
      • "Nonquoted identifiers must begin with an alphabetic character from your database character set. Quoted identifiers can begin with any character."
      • "Quoted identifiers can contain any characters and punctuations marks as well as spaces."
  • Ability to use the same characters for two different identifiers with case being the differentiation feature.
  • Avoid dependency on a database's implementation's case assumptions and provide "one universal version."
  • Explicit case specification avoids issues with case assumptions that might be changeable in some databases such as SQL Server.
  • Consistency with most programming languages and operating systems' file systems.
  • Specified in SQL specification and explicitly spells out case of identifiers rather than relying on specific implementation details (case folding) of particular database.
  • Additional protection in cases where external users are allowed to specify SQL that is to be interpreted as identifiers.

Advantages of Case-Insensitive Identifiers

There are also advantages associated with use of case-insensitive identifiers. It can be argued that case-insensitive identifiers are the "default" in Oracle database and PostgreSQL database because one must use quotes to specify when this default case-insensitivity is not the case.

  • Case-insensitivity is the "default" in Oracle and PostgreSQL databases.
  • The best case for readability can be used in any particular context. For example, allows DML and DDL statements to be written to a particular coding convention and then be automatically mapped to the appropriate case folding for various databases.
  • Avoids errors introduced by developers who are unaware of or unwilling to follow case conventions.
  • Double quotes (" ") are very different from single quotes (' ') in at least some contexts in both the Oracle and PostgreSQL databases and not using case-sensitive identifier double quotes eliminates need to remember the difference or worry about the next developer not remembering the difference.
  • Many of the above listed "advantages" may not really be good practices:
    • Using reserved words and keywords as identifiers is probably not good for readability anyway.
    • Using symbols allowed in quoted identifiers that are not allowed in unquoted identifiers may not be necessary or even desirable.
    • Having two different variables of the same name with just different characters cases is probably not a good idea.

Default Case-Insensitive or Quoted Case-Sensitive Identifiers?

In Don’t use double quotes in PostgreSQL, Reuven Lerner makes a case for using PostgreSQL's "default" (no double quotes) case-insensitive identifiers. Lerner also points out that pgAdmin implicitly creates double-quoted case-sensitive identifiers. From an Oracle DBA perspective, @MBigglesworth79 calls quoted identifiers in Oracle an Oracle Gotcha and concludes, "My personal recommendation would be against the use of quoted identifiers as they appear to cause more problems and confusion than they are worth."

A key trade-off to be considered when debating quoted case-sensitive identifiers versus default case-insensitive identifiers is one of being able to (but also required to) explicitly specify identifiers' case versus not being able to (but not having to) specify case of characters used in the identifiers.

Choose One or the Other: Don't Mix Them!

It has been my experience that the worst choice one can make when designing database constructs is to mix case-sensitive and case-insensitive identifiers. Mixing of these make it difficult for developers to know when case matters and when it doesn't, but developers must be aware of the differences in order to use them appropriately. Mixing identifiers with implicit case and explicit case definitely violates the Principle of Least Surprise and will almost certainly result in a frustrating runtime bug.

Another factor to consider in this discussion is case folding choices implemented in Oracle database and PostgreSQL database. This case folding can cause unintentional consequences, especially when porting between two databases with different case folding assumptions. The PostgreSQL database folds to lowercase characters (non-standard) while the Oracle database folds to uppercase characters. This significance of this difference is exemplified in one of the first PostgreSQL Wiki "Oracle Compatibility Tasks": "Quoted identifiers, upper vs. lower case folding." Indeed, while I have found PostgreSQL to be heavily focused on being standards-compliant, this case folding behavior is one place that is very non-standard and cannot be easily changed.

About the only "safe" strategy to mix case-sensitive and case-insensitive identifiers in the same database is to know that particular database's default case folding strategy and to name even explicitly named (double quoted) identifiers with exactly the same case as the database will case fold non-quoted identifiers. For example, in PostgreSQL, one could name all identifiers in quotes with completely lowercase characters because PostgreSQL will default unquoted identifiers to all lowercase characters. However, when using Oracle, the opposite approach would be needed: all quoted identifiers should be all uppercase to allow case-sensitive and case-insensitive identifiers to be intermixed. Problems will arise, of course, when one attempts to port from one of these databases to the other because the assumption of lowercase or uppercase changes. The better approach, then, for database portability between Oracle and PostgreSQL databases is to commit either to using quoted case-sensitive identifiers everywhere (they are then explicitly named the same for both databases) or to use default case-insensitive identifiers everywhere (and each database will appropriately case fold appropriately in its own approach).

Conclusion

There are advantages to both identifiers with implicit case (case insensitive) and to identifiers with explicit (quoted and case sensitive) case in both Oracle database and PostgreSQL database with room for personal preferences and tastes to influence any decision on which approach to use. Although I prefer (at least at the time of this writing) to use the implicit (default) case-insensitive approach, I would rather use the explicitly spelled-out (with double quotes) identifier cases in all cases than mix the approach and use explicit case specification for identifiers in some cases and implicit specification of case of identifiers in other cases. Mixing the approaches makes it difficult to know which is being used in each table and column in the database and makes it more difficult to port the SQL code between databases such as PostgreSQL and Oracle that make different assumptions regarding case folding.

Additional Reading

1 comment:

Unknown said...

Nice Article !

Really this will help to people of PostgreSQL Community.
I have also prepared small note on this, Why identifiers are case sensitive in PostgreSQL.

http://www.dbrnd.com/2015/10/postgresql-why-identifiers-are-case-sensitive/