How to enable single and double quotes and marks in a regular expression in a paragraph Question in Google Forms?

56 Views Asked by At

In a paragraph question in google forms, the following settings are used to stop the input of emojis, emdashes (character \x97), and € (character \x80): regular expression matches ^[\x0A-\xFF]*$.

Capture of Google Forms input: Regular expression Matches ^[\x0A-\xFF]*$

In a Chrome browser on a mobile device (not a desktop device) this regular expression restricts the input of: 

  1. Double quotes (character \x22)
  2. Single quotes (character \x27)
  3. Left single quotation mark (character \x91)
  4. Right single quotation mark (character \x92)
  5. Left double quotation mark (character \x93)
  6. Right double quotation mark (character \x94)

although the expression ^[\x0A-\xFF]*$ includes character 10 (\x0A) to character 255 (\xFF).

How can I update the regular expression ^[\x0A-\xFF]*$ to enable the 6 items above?

I've tried inputting different formulas in the regular expression, such as ^([^\\\p{Emoji}]|\\[^p{Emoji}])*$ but this was not helpful, it made the situation worse.

1

There are 1 best solutions below

1
Éric On

TL;DR

You confused Windows Latin-1 and Unicode character sets in your numeral representations of characters, this is why your regular expression did not return the expected results. I corrected this and removed from the class some non-pertinent characters to obtain this regular expression for use in Google Forms: ^[\x0A\x0D\x20-\x7E\xA0-\xFF\x{2018}\x{2019}\x{201C}\x{201D}]*$.

Your problem on mobile devices may result from the behavior of virtual keyboards inputing unexpected quotation marks that are not targeted by your regular expression (please read below).


Detailed answer

In the following, I used 255 for the decimal notation, and \xFF for the hexadecimal notation.

The problem is that you are designating characters with their numeral representation in the Windows Latin-1 (CP1252) character set, when the Google RE2 regular expression library implemented in Google Forms designates characters with their Unicode code points (probably like most – if not all – modern regular expression engines).
For the first 256 positions (\x00 to \xFF), characters are identical in both sets, so the confusion is permitted since the RE2 expression ^[\x0A-\xFF]*$ matches the same characters, which are:

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

N.B.: the blanks above correspond to non-printable characters.

But for building RE2 compatible regular expressions with characters in positions higher than \xFF, you must use the Unicode values ("code points").

Let us compare the numeral representations of the characters considered in your question:

Character Description Position in
Windows Latin-1
character set
Position in
the Unicode
character set
Must match
the regular
expression
" quotation mark (or double quote) 34 or \x22 34 or \x22 yes
' apostrophe (or single quote) 39 or \x27 39 or \x27 yes
left single quotation mark 145 or \x91 8216 or \x2018 yes
right single quotation mark 146 or \x92 8217 or \x2019 yes
left double quotation mark 147 or \x93 8220 or \x201C yes
right double quotation mark 148 or \x94 8221 or \x201D yes
Em dash 151 or \x97 8212 or \x2014 no
Euro sign 128 or \x80 8364 or \x20AC no
grinning face not included 128512 or \x1F600 no
other emojis other emojis not included ... or \x... no

All the above clarifies that your regular expression ^[\x0A-\xFF]*$ will match lower-position characters, but not the left/right quotation marks that stand at high positions (well above \xFF) in Unicode. So you need to extend the character class with the representations of these specific marks, like this: ^[\x0A-\xFF\x{2018}\x{2019}\x{201C}\x{201D}]*$.
Curly brackets are required by RE2 for hexadecimal numbers made of three digits or more.

Incidentally, it seems unecessary to me to include all the control characters between positions \x0A and \x1F (only \x0A and \x0D seem pertinent to me). Also positions \x7F to \x9F are assigned to control (thus non-printable) characters that are not to be input in your case. So a more pertinent, yet longer, expression would be ^[\x0A\x0D\x20-\x7E\xA0-\xFF\x{2018}\x{2019}\x{201C}\x{201D}]*$. You can test it there.

By the way, these expressions exclude the Euro sign, the Em dash and emojis as desired.
The mismatch with characters \x22 and \x27 on mobile device may result from the virtual keyboard not inputing exactly the character targeted in the regular expression (quotations marks are numerous in Unicode and their shape sometimes very similar depending on the font; you could include more quotation marks in your character class).
Also, be aware that the Google RE2 library does not support the \p{Emoji} character class.