Naive Bayes classifier

Naive Bayes

Definition: The naive Bayes assumption on events \(A_1, A_2, .. A_n\) says that these events are mutually independent.

This assumption makes the computation of the classifier much easier. The details of this computation will be shown shortly.

Naive Bayes classifier

Definition: A classifier is a function \(M : Pow(F) \to C\) where \(F = \{f_1, f_2, … f_n\}\) is a set of features and \(C = \{c_1, c_2, .. c_m\}\) is a set of categories. The job of the classifier is to take some subset \(A \subseteq F\) and predict what category \(c_1\) it belongs to.

Example: Let \(D\) be a text document. We choose the features of \(D\) to be the words of \(D\). That is to say, if \(“proof” \in D\) then we may say that \(f_{“proof”}\) is a feature of \(D\).

Definition: Let \(c \in C, f_i \in F\). The Bayesian classifier \(M\) would be defined as

\(M(f_1, f_2, … f_n) = argmax_c Pr(c) \frac{ Pr(f_1, .. f_n | c) }{ Pr(f_1, .. f_n) }\)

Notice that the denominator \(Pr(f_1, .. f_n)\) is the same for all classes \(c \in C\). This means we can actually factor out this term and still retain the same maximum argument. In other words, we achieve the same result with:

\(M(f_1, f_2, … f_n) = argmax_c Pr(c) Pr(f_1, .. f_n | c)\)

However, even with this simplification, we have more problems. This definition requires we have priors of the form \(Pr(f_1, .. f_n)\). In general, each combination of features may be rare and not in our dataset. Because of this, we relax this condition in the next definition, under the assumption that each feature acts independently of the others when forming a hypothesis.

Definition: Let \(c \in C, f_i \in F\) and assume that the events \(f_1, .. f_n\) are all independent events. The naive Bayesian classifier would be defined as

\(M(f_1, f_2, … f_n) = argmax_c Pr(c) \Pi_i Pr(f_i | c)\)

Python implementation

To create a working version of a Naive Bayes classifier, we need to start with some data. The following dataset is quite common in machine learning texts. It shows 14 days of various weather conditions, and if those conditions are sufficient for playing tennis

Day Outlook Temperature Humidity Wind Play Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Stream Ciphers

Secure PRGs

Definition: A PRG \(G\) G is said to be secure if, for any given

Predictability

Definition: Let \(G\) be a PRG and \(s \in \mathbb{N}, k \in \{0,1\}^s\) a seed for \(G\). We say that \(G\) is predictable if there exists a \(i \in \mathbb{N}, A : \{0, 1\}^i \to \{0, 1\}\) such that

\( A(G(k){1..i}) = G(k){i+1} \)

Basically, this equation says that given bits \(1,2,.. i\) of some PRG \(G(k)\) we can use \(A\) to predict the \(i+1\) bit of \(G(k)\).

Theorem: If a PRG is not predictable, it is secure.

Practical Ciphers

Practical Ciphers

In the last section we talked about how to use a OTP cipher to obtain perfect secrecy, but that such a level of secrecy is impractical. How do we solve this? If we could shorten the key dramatically, then transmitting the key to other parties would be easier than transmitting the cipher text.

Pseudo-random numbers

What we would like to do is to somehow take a small key, probably given by the user as a password, and generate a much larger key that can be used for actual encryption. We formalize this concept as follows:

Definition: A pseudo-random number generator, usually just abbreviated PRG, is a function \(G : \{ 0, 1 \}^s \to \{ 0, 1 \}^n \) such that \(s\) is much, much smaller than \(n\). The set \( \{ 0, 1 \}^s\) is called the seed space and is used to generate the much larger key space, as given by \(\{ 0, 1 \}^n\)

Secure random numbers

We have given an understanding of what pseudo-random means, but this alone is not enough for secure random numbers. A stronger condition is needed in order for the system to be resistant to a very smart attacker. To define that, we need an example of what an attacker can do.

Definition: A statistical test is an algorithm that can be run in polynomial time or better. It takes an element \(x \in \{ 0, 1 \}^n\) and produces a \(0\) if it believes that \(x\) was produced by a pseudo-random number generator and \(1\) otherwise.

Example: Let \(A(x) = lsb(x)\), where \(lsb\) produces the first bit of \(x\).

Example: Let \(A\) produce a \(1\) if \(\frac{ numberOf1s(x) }{ length(x) > 1/2}\), and \(0\) otherwise.

In reality a statistical test will be much more complicated. However, while the attacker could provide an exponential algorithm to attack this PRG, we only consider polynomial time algorithms for now.

Definition: Let \(A\) be a statistical test. Then the statistical advantage of \(A\) on a PRG \(G\) is the formula:

\( ADV_{PRG}[A, G] = \| Pr[ A(G(k)) = 1 ] – Pr[ A(r) = 1 ] \| \)

where \(r\) is a truly random bit string the same length as \(G(k)\).

The next section will be on stream ciphers

The XOR function

Definition: The function XOR, often times denoted by \(\oplus\) is defined over a pair in the set \(\{ 0, 1 \}^n\), such that

  • \(0 \oplus 0 = 0\)
  • \(0 \oplus 1 = 1\)
  • \(1 \oplus 0 = 1\)
  • \(1 \oplus 1 = 0\)

The XOR function has a huge role to play in crypto. It’s often said that the only thing cryptographers know how to do is XOR things together. Many crypto algorithms make heavy use of XOR, alongside some other methods of mixing plain text with some secret key.

XOR algebra

Lemma: (0 is an identity) For any \(x \in \{ 0, 1 \}\) we have \(0 \oplus x = x = x \oplus 0\)

Proof: We break this into cases:

  • If \(x = 0\) then \(0 \oplus 0 = 0 = 0 \oplus 0\).
  • If \(x = 1\) then \(0 \oplus 1 = 1 = 1 \oplus 0\).

Lemma: (self cancellation) For any \( x \in \{ 0, 1 \}^n \), we have \( x \oplus x = 0 \).

Proof: By cases:

  • If \( x = 1 \) then \( 1 \oplus 1 = 0 \).
  • If \( x = 0 \) then \( 0 \oplus 0 = 0 \).

Lemma: (left and right cancellation) Due to the property above, \(\oplus\) allows for left or right cancellation.

Proof: Let \(a, b, c \in \{ 0, 1 \}\), and suppose that \(a \oplus b = a \oplus c\). Then:

\(a \oplus (a \oplus b) = a \oplus (a \oplus c) \)

Which can be simplified to:

\((a \oplus a) \oplus b = (a \oplus a) \oplus c \)

Since \(x \oplus x = 0\) then

\(0 \oplus b = 0 \oplus c \)

And \(0 \oplus x = x\), which gives

\(b = c\)

So we have left cancellation with \(\oplus\). The same proof works for the right hand side, so \(\oplus\) has both left and right cancellation.

XOR in practice

Most programming languages have a built-in bitwise XOR function. In Python, the caret (^) gives you that. To use XOR as a cipher, the following code takes a plaintext and a key of the same length.

def encrypt(plaintext, key):
  return bytearray([ord(plaintext[i]) ^ ord(key) for i in xrange(len(plaintext))])

A proof of Bayes Rule

One important classifier in machine learning is the Naive Bayes classifier. I’m going to state the necessary properties for the Bayes formula, which is used to run such a classifier. I’m going to include some basic ideas in probability theory to get there.

Preliminaries

Definition: A set is a collection of elements. We usually write sets using curly braces, e.g. \(A = \{1, 2, 3\}\)

Definition: A subset \(B\) of a set \(A\) is a collection of elements taken from the original set.

Example: Consider \(B = \{1,2\}\), then \(B\) is in fact a subset of \(A\) and we write this \(B \subset A\).

Definition: The function \(Pow : Set \to Set\) takes a set as an argument and returns the collection of all of it’s subsets.

Example: Let \(A = \{1, 2, 3\}\) as before. Then \(Pow(A) = \{ \{\}, \{1\}, \{2\}, \{3\}, \{1,2\}, \{1,3\}, \{2,3\}, \{1,2,3\} \}\).

Definition: An event is an element of the set \(E\), the set of all possible events.

Example: Flip a coin. If it lands on heads we may call that \(e_H \in E\) and call it \(e_T \in E\) if it lands on tails, giving \(E = \{e_H, e_T\}\).

Example: Roll a six-sided die. We can say that \(E = \{e_1, e_2, .. e_6\}\) for all 6 possibilities of the die.

Definition: Let \(A, B\) be events. We write \(A \cap B\) to denote the fact that both \(A\) and \(B\) happened.

Definition: Two events are said to be mutually disjoint if they cannot occur at the same time.

Example: A coin landing on heads implies that it cannot also land on tails at the same time. So in this case \(e_H, e_T\) are mutually disjoint. However, a die landing on an even number and on a number less than 3 is entirely possible (namely 2). Therefore \(e_{even} \cap e_{< 3} \to e_2\)

Probabilities

Definition: The function \(Pr\), over a set of mutually disjoint events \(E\), is called the probability function and it has the following properties:

  • \(Pr : E \to [0,1]\) – in other words, for any \(e \in E\) we have \(0 \le Pr(e) \le 1\)
  • \( \sum_{e \in E} Pr(e) = 1 \) – in other words, the sum of all possible events in \(E\) must sum to \(1\).
  • Let \(e_1, e_2 \in E\) such that \(e_1 \neq e_2\). Then \(Pr({e_1, e_2}) = Pr(e_1) + Pr(e_2)\).

These three properties are called Kolmogorov’s axioms, and from them follow all other properties about probability.

Joint Probability

Definition: The joint probability of two events \(A \subseteq E\) and \(B \subseteq E\) is defined as \(Pr(A \cap B)\).

Lemma: If \(A \cap B = \emptyset\) then \(Pr(A \cup B) = Pr(A) + Pr(B)\).

Proof: Let \(A = \{a_1, a_2, .. a_n\}, B = \{b_1, b_2, .. b_m\}\). Then \(Pr(A) = \sum_{a_i \in A} Pr(a_i)\) and \(Pr(B) = \sum_{b_j \in B} Pr(b_j)\) and similarly \(Pr(A \cup B) = \sum_{x_k \in A \cup B} Pr(x_k)\).

However, without too much re-writing, we can easily see that the elements in \(x_k\) are exactly those elements in \(a_i, b_j\). This gives us:

\( Pr(A \cup B) = \sum_{x_k \in A \cup B} Pr(x_k) = \sum_{a_i \in A} Pr(a_i) + \sum_{b_j \in B} Pr(b_j) = Pr(A) + Pr(B) \)

Conditional Probability

Definition: The conditional probability of \(A\) given \(B\) is defined as \(Pr(A | B) = \frac{ Pr(A \cap B) }{ Pr(B) }\).

Lemma: (chain rule) For any \(A_1, A_2 \subseteq E\) we have \(Pr(A_1 \cap A_2) = Pr(A_1) Pr(A_2 | A_1)\)

Proof: By definition, \(Pr(A_2 | A_1) = \frac{ Pr(A_2 \cap A_1) }{ Pr(A_1) }\) therefore

\(Pr(A_1 \cap A_2) = Pr(A_1) \frac{ Pr(A_2 \cap A_1) }{ Pr(A_1) } = Pr(A_2 \cap A_1)\)

The proof above extends to any number \(A_1, A_2, … A_n\).

Lemma: (joint unitary rule) Let \(A_1 \cup A_2 \cup .. \cup A_n = E\). Then for any \(B \subseteq E\) we have \(\sum_i Pr(A_i \cap B) = Pr(B)\)

Proof: (TODO)

Bayes Rule

Theorem: Let \(H_1, H_2, … H_n\) be a set of mutually disjoint hypothesis, such that \(\bigcup_{ 1 \le i \le n } H_i = E\). Let \(D \subseteq E\). Then

\(Pr(H_i | D) = \frac{ Pr(H_i)Pr(D | H_i) }{ \sum_i Pr(H_i) Pr(D | H_i) }\)

Proof: We approach this by finding equalities for the numerator and the denominator.

For the numerator \(Pr(H_i) Pr(D | H_i)\) we replace the 2nd term with it’s definition, giving \(Pr(H_i) \frac{ Pr(D \cap H_i) }{ Pr(H_i) }\) giving \(Pr(H_i) Pr(D | H_i) = Pr(D \cap H_i)\)

For the dominator, we use a similar step as in the numerator, we have the identity:

\(Pr(H_i) Pr(D | H_i) = Pr(D \cap H_i) \)

However, the summation of all such \(H_i\) gives:

\(\sum_i Pr(D \cap H_i) = Pr(D)\)

This gives

\(Pr(H_i | D) = \frac{ Pr(H_i)Pr(D | H_i) }{ \sum_i Pr(H_i) Pr(D | H_i) } = \frac{ Pr(D \cap H_i) }{ Pr(D) }\)

which is the definition of conditional probability.

In the next section I will construct a Naive Bayes classifier and provide some working Python code.

Crypto Notes

Preface

These notes are mostly based on the Coursera course. However, the commentary and ordering of topics is totally my own.

Building Ciphers

Most modern cryptography is about building ciphers. The series of notes provided will give an idea how cryptographers prove that particular ciphers are secure, and to what degree they are secure. To begin, we need to define a very useful function in crypto, namely the XOR function.

The XOR function

I’ve moved the section about the XOR function into it’s own page.

Ciphers

In order to work with ciphers mathematically, we need a formal definition of a cipher.

Definition: A cipher is a pair \( (E, D) \) such that \( E : (K, M) \to C \) and \( D : (K, C) \to M \).

  • \( E \) is called the encryption algorithm and \( D \) is the decryption algorithm.
  • Both \( E \) and \( D \) must be, at worst, polynomial-time algorithms.
  • \( K \) is the key-space, \( M \) is the message-space and \( C \) is the cipher text-space.
  • The functions \( E \) and \( D \) must satisfy: \( D(k, E(k, m)) = m, \forall m \in M \).

Perfect secrecy

“A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.”

Kerckhoffs’s principle

As it so happens, it’s entirely possible to construct a cipher such that the cipher text of two different messages are statistically indistinguishable. That is to say, given a cipher text \(c\) it is impossible to tell if it comes from a plain text \(m_0\) or \(m_1\).

Definition: A cipher \( (E, D) \) is said to have perfect secrecy if

\( Pr[ E(k, m_0) = c ] = Pr[ E(k, m_1) = c ] \)

for all \( m_0, m_1 \in M \), for some \( c \in C \), and \( k \) is randomly chosen from \( K \).

The notion of perfect secrecy was formalized by Shannon, and helps us to understand the levels of secrecy that can be obtained by formal methods. In reality, most crypto algorithms do not have perfect secrecy, but there is one important example, namely the one time pad cipher, that does have perfect secrecy.

One time pads (OTPs)

Definition: The one time pad cipher is defined as \( E(k,m) = k \oplus m \) and \( D(k,c) = k \oplus c \).

Lemma: The OTP cipher is actually a cipher.

Proof: \( D(k, E(k, m)) = D(k, k \oplus m) = k \oplus k \oplus m = m \)

One time pads are the most theoretically secure method of encryption. We will prove this in a moment, but first we need a useful lemma.

Lemma: For the OTP cipher, if \(E(k_0, m) = c\) and \(E(k_1, m) = c\) then \(k_0 = k_1\).

Proof: Since \(E(k, m) = k \oplus m\) then we have

\(k_0 \oplus m = E(k_0, m) = c = E(k_1, m) = k_1 \oplus m\)

So \(k_0 \oplus m = k_1 \oplus m\) and with cancellation on the right we have \(k_0 = k_1\). So there is exactly one key for every pair \((m, c)\).

OTPs have Perfect secrecy

Theorem: The OTP cipher has perfect secrecy.

Proof: To prove that OTP has perfect secrecy, we must show that

\( Pr[ E(k, m_0) = c ] = Pr[ E(k, m_1) = c ] \)

The value \( Pr[ E(k, m_0) = c ] \) can be reasoned as follows. Since \(k\) is chosen randomly from \(K\) then the probability that we will pick some specific \(k_i\) out of \(K\) is exactly \( 1 / |K| \).

We can re-write the equation \( E(k, m_0) = c \) as \( k \oplus m_0 = c \) or using right cancellation \( k = c \oplus m_0 \). So now we must determine

\( Pr[ k = m_0 \oplus c ] \)

and since \(m_0, c\) are not free variables (they were chosen prior to the equation), then the only value we need is

\( Pr[ k ] \)

which we determined was \( 1 / |K| \). The same trick works for \( Pr[ E(k, m_1) = c ] \), since \(m_1, c\) are also bound. Therefore

\( Pr[ E(k, m_0) = c ] = \frac{1}{|K|} = Pr[ E(k, m_1) = c ] \)

No such thing as perfect..

While OTP ciphers have perfect secrecy, they require encryption keys that are equal or greater in length than the message itself. That is to say, if \( m \) is the message to be sent and \( |m| \) is the length of the message than we need a key \( k \) such that \( |m| \le |k| \). If one tries, some way, to reuse a short key to encrypt a long message, the cipher text becomes prone to certain types of cryptanalysis. Therefore, whatever attacker-resistant channel you have for sending the key, you can just as likely use it for sending the message itself.

OTPs also require that the keys generated are perfectly random, which is doable with a pair of dice or the random noise generated from the Big Bang, but is very difficult to produce in software, which is usually very deterministic.

These problems are a huge disadvantage for implementing perfect secrecy as a computer program.

In the next section I’ll talk about pseudorandom number generators and how to obtain a more practical level of secrecy, called semantic secrecy.