{"title": "Learning Shuffle Ideals Under Restricted Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 765, "abstract": "The class of shuffle ideals is a fundamental sub-family of regular languages. The shuffle ideal generated by a string set $U$ is the collection of all strings containing some string $u \\in U$ as a (not necessarily contiguous) subsequence. In spite of its apparent simplicity, the problem of learning a shuffle ideal from given data is known to be computationally intractable. In this paper, we study the PAC learnability of shuffle ideals and present positive results on this learning problem under element-wise independent and identical distributions and Markovian distributions in the statistical query model. A constrained generalization to learning shuffle ideals under product distributions is also provided. In the empirical direction, we propose a heuristic algorithm for learning shuffle ideals from given labeled strings under general unrestricted distributions. Experiments demonstrate the advantage for both efficiency and accuracy of our algorithm.", "full_text": "Learning Shuf\ufb02e Ideals\n\nUnder Restricted Distributions\n\nDongqu Chen\n\nYale University\n\nDepartment of Computer Science\n\ndongqu.chen@yale.edu\n\nAbstract\n\nThe class of shuf\ufb02e ideals is a fundamental sub-family of regular languages. The\nshuf\ufb02e ideal generated by a string set U is the collection of all strings containing\nsome string u \u2208 U as a (not necessarily contiguous) subsequence. In spite of\nits apparent simplicity, the problem of learning a shuf\ufb02e ideal from given data is\nknown to be computationally intractable. In this paper, we study the PAC learn-\nability of shuf\ufb02e ideals and present positive results on this learning problem under\nelement-wise independent and identical distributions and Markovian distributions\nin the statistical query model. A constrained generalization to learning shuf\ufb02e\nideals under product distributions is also provided. In the empirical direction, we\npropose a heuristic algorithm for learning shuf\ufb02e ideals from given labeled strings\nunder general unrestricted distributions. Experiments demonstrate the advantage\nfor both ef\ufb01ciency and accuracy of our algorithm.\n\n1\n\nIntroduction\n\nThe learnablity of regular languages is a classic topic in computational learning theory. The applica-\ntions of this learning problem include natural language processing (speech recognition, morpholog-\nical analysis), computational linguistics, robotics and control systems, computational biology (phy-\nlogeny, structural pattern recognition), data mining, time series and music ([7, 16, 18\u201321, 23, 25]).\nExploring the learnability of the family of formal languages is signi\ufb01cant to both theoretical and\napplied realms. In the classic PAC learning model de\ufb01ned by Valiant [26], unfortunately, the class\nof regular languages, or equivalently the concept class of deterministic \ufb01nite automata (DFA), is\nknown to be inherently unpredictable ([1, 9, 22]). In a modi\ufb01ed version of Valiant\u2019s model which\nallows the learner to make membership queries, Angluin [2] has shown that the concept class of\nregular languages is PAC learnable.\nThroughout this paper we study the PAC learnability of a fundamental subclass of regular languages,\nthe class of (extended) shuf\ufb02e ideals. The shuf\ufb02e ideal generated by an augmented string U is the\ncollection of all strings containing some u \u2208 U as a (not necessarily contiguous) subsequence,\nwhere an augmented string is a \ufb01nite concatenation of symbol sets (see Figure 1 for an illustration).\nThe special class of shuf\ufb02e ideals generated by a single string is called the principal shuf\ufb02e ideals.\nIn spite of its simplicity, the class of shuf\ufb02e ideals plays a prominent role in formal language theory.\nThe boolean closure of shuf\ufb02e ideals is the important language family known as piecewise-testable\nlanguages ([24]). The rich structure of this language family has made it an object of intensive study\nin complexity theory and group theory ([12, 17]). In the applied direction, Kontorovich et al. [15]\nshow the shuf\ufb02e ideals capture some rudimentary phenomena in human language morphology.\nUnfortunately, even such a simple class is not PAC learnable, unless RP=NP ([3]). However, in\nmost application scenarios, the strings are drawn from some particular distribution we are interested\nin. Angluin et al. [3] prove under the uniform string distribution, principal shuf\ufb02e ideals are PAC\n\n1\n\n\fFigure 1: The DFA accepting precisely the shuf\ufb02e ideal of U = (a|b|d)a(b|c) over \u03a3 = {a, b, c, d}.\n\nlearnable. Nevertheless, the requirement of complete knowledge of the distribution, the dependence\non the symmetry of the uniform distribution and the restriction of principal shuf\ufb02e ideals lead to the\nlack of generality of the algorithm. Our main contribution in this paper is to present positive results\non learning the class of shuf\ufb02e ideals under element-wise independent and identical distributions\nand Markovian distributions. Extensions of our main results include a constrained generalization\nto learning shuf\ufb02e ideals under product distributions and a heuristic method for learning principal\nshuf\ufb02e ideals under general unrestricted distributions.\nAfter introducing the preliminaries in Section 2, we present our main result in Section 3: the ex-\ntended class of shuf\ufb02e ideals is PAC learnable from element-wise i.i.d. strings. That is, the dis-\ntributions of the symbols in a string are identical and independent of each other. A constrained\ngeneralization to learning shuf\ufb02e ideals under product distributions is also provided. In Section 4,\nwe further show the PAC learnability of principal shuf\ufb02e ideals when the example strings drawn\nfrom \u03a3\u2264n are generated by a Markov chain with some lower bound assumptions on the transition\nmatrix.\nIn Section 5, we propose a greedy algorithm for learning principal shuf\ufb02e ideals under\ngeneral unrestricted distributions. Experiments demonstrate the advantage for both ef\ufb01ciency and\naccuracy of our heuristic algorithm.\n\n2 Preliminaries\nWe consider strings over a \ufb01xed \ufb01nite alphabet \u03a3. The empty string is \u03bb. Let \u03a3\u2217 be the Kleene\nstar of \u03a3 and \u03a3\u222a be the collection of all subsets of \u03a3. As strings are concatenations of symbols, we\nsimilarly de\ufb01ne augmented strings as concatenations of unions of symbols.\n\nDe\ufb01nition 1 (Alphabet, simple string and augmented string) Let \u03a3 be a non-empty \ufb01nite set of\nsymbols, called the alphabet. A simple string over \u03a3 is any \ufb01nite sequence of symbols from \u03a3, and\n\u03a3\u2217 is the collection of all simple strings. An augmented string over \u03a3 is any \ufb01nite concatenation of\nsymbol sets from \u03a3\u222a, and (\u03a3\u222a)\n\n\u2217 is the collection of all augmented strings.\n\nDenote by s the cardinality of \u03a3. Because an augmented string only contains strings of the same\nlength, the length of an augmented string U, denoted by |U|, is the length of any u \u2208 U. We use\nexponential notation for repeated concatenation of a string with itself, that is, vk is the concatenation\nof k copies of string v. Starting from index 1, we denote by vi the i-th symbol in string v and use\n\u2217\nnotation v[i, j] = vi . . . vj for 1 \u2264 i \u2264 j \u2264 |v|. De\ufb01ne the binary relation (cid:118) on (cid:104)(\u03a3\u222a)\n, \u03a3\u2217(cid:105) as\nfollows. For a simple string w, w (cid:118) v holds if and only if there is a witness (cid:126)i = (i1 < i2 < . . . <\ni|w|) such that vij = wj for all integers 1 \u2264 j \u2264 |w|. For an augmented string W , W (cid:118) v if and only\nif there exists some w \u2208 W such that w (cid:118) v. When there are several witnesses for W (cid:118) v, we may\norder them coordinate-wise, referring to the unique minimal element as the leftmost embedding. We\nwill write IW(cid:118)v to denote the position of the last symbol of W in its leftmost embedding in v (if the\nlatter exists; otherwise, IW(cid:118)v = \u221e).\nDe\ufb01nition 2 (Extended/Principal Shuf\ufb02e Ideal) The (extended) shuf\ufb02e ideal of an augmented\nstring U \u2208 (\u03a3\u222a)L is a regular language de\ufb01ned as X(U ) = {v \u2208 \u03a3\u2217 | \u2203u \u2208 U, u (cid:118) v} =\n\u03a3\u2217U1\u03a3\u2217U2\u03a3\u2217 . . . \u03a3\u2217UL\u03a3\u2217. A shuf\ufb02e ideal is principal if it is generated by a simple string.\n\nA shuf\ufb02e ideal is an ideal in order theory and was originally de\ufb01ned for lattices. Denote by\nthe\nclass of principal shuf\ufb02e ideals and by X the class of extended shuf\ufb02e ideals. Unless otherwise\nstated, in this paper shuf\ufb02e ideal refers to the extended ideal. An example is given in Figure 1. The\nfeasibility of determining whether a string is in the class X(U ) is obvious.\nLemma 1 Evaluating relation U (cid:118) x and meanwhile determining IU(cid:118)x is feasible in time O(|x|).\n\n2\n\n\fIn a computational learning model, an algorithm is usually given access to an oracle providing\ninformation about the sample. In Valiant\u2019s work [26], the example oracle EX(c,D) was de\ufb01ned,\nwhere c is the target concept and D is a distribution over the instance space. On each call, EX(c,D)\ndraws an input x independently at random from the instance space I under the distribution D, and\nreturns the labeled example (cid:104)x, c(x)(cid:105).\nDe\ufb01nition 3 (PAC Learnability: [26]) Let C be a concept class over the instance space I. We\nsay C is probably approximately correctly (PAC) learnable if there exists an algorithm A with the\nfollowing property: for every concept c \u2208 C, for every distribution D on I, and for all 0 < \u0001 < 1/2\nand 0 < \u03b4 < 1/2, if A is given access to EX(c,D) on I and inputs \u0001 and \u03b4, then with probability\nat least 1 \u2212 \u03b4, A outputs a hypothesis h \u2208 H satisfying Prx\u2208D[c(x) (cid:54)= h(x)] \u2264 \u0001. If A runs in time\npolynomial in 1/\u0001, 1/\u03b4 and the representation size of c, we say that C is ef\ufb01ciently PAC learnable.\n\nIf the error parameter\nWe refer to \u0001 as the error parameter and \u03b4 as the con\ufb01dence parameter.\nis set to 0, the learning is exact ([6]). Kearns [11] extended Valiant\u2019s model and introduced the\nstatistical query oracle STAT(c,D). Kearns\u2019 oracle takes as input a statistical query of the form\n(\u03c7, \u03c4 ). Here \u03c7 is any mapping of a labeled example to {0, 1} and \u03c4 \u2208 [0, 1] is called the noise\ntolerance. STAT(c,D) returns an estimate for the expectation IE\u03c7, that is, the probability that \u03c7 = 1\nwhen the labeled example is drawn according to D. A statistical query can have a condition so IE\u03c7\ncan be a conditional probability. This estimate is accurate within additive error \u03c4.\n\nDe\ufb01nition 4 (Legitimacy and Feasibility: [11]) A statistical query \u03c7 is legimate and feasible if\nand only if with respect to 1/\u0001, 1/\u03c4 and representation size of c:\n\n1. Query \u03c7 maps a labeled example (cid:104)x, c(x)(cid:105) to {0, 1};\n2. Query \u03c7 can be ef\ufb01ciently evaluated in polynomial time;\n\n3. The condition of \u03c7, if any, can be ef\ufb01ciently evaluated in polynomial time;\n\n4. The probability of the condition of \u03c7, if any, should be at least polynomially large.\n\nThroughout this paper, the learnability of shuf\ufb02e ideals is studied in the statistical query model.\nKearns [11] proves that oracle STAT(c,D) is weaker than oracle EX(c,D). In words, if a concept\nclass is PAC learnable from STAT(c,D), then it is PAC learnable from EX(c,D), but not necessarily\nvice versa.\n\n3 Learning shuf\ufb02e ideals from element-wise i.i.d. strings\n\nAlthough learning the class of shuf\ufb02e ideals has been proved hard, in most scenarios the string\ndistribution is restricted or even known. A very usual situation in practice is that we have some prior\nknowledge of the unknown distribution. One common example is the string distributions where each\nsymbol in a string is generated independently and identically from an unknown distribution. It is\nelement-wise i.i.d. because we view a string as a vector of symbols. This case is general enough to\ncover some popular distributions in applications such as the uniform distribution and the multinomial\ndistribution. In this section, we present as our main result a statistical query algorithm for learning\nthe concept class of extended shuf\ufb02e ideals from element-wise i.i.d. strings and provide theoretical\nguarantees of its computational ef\ufb01ciency and accuracy in the statistical query model. The instance\nspace is \u03a3n. Denote by U the augmented pattern string that generates the target shuf\ufb02e ideal and by\nL = |U| the length of U.\n\n3.1 Statistical query algorithm\nBefore presenting the algorithm, we de\ufb01ne function \u03b8V,a(\u00b7) and query \u03c7V,a(\u00b7,\u00b7) for any augmented\nstring V \u2208 (\u03a3\u222a)\n\n\u2264n and any symbol a \u2208 \u03a3 as as follows.\n\n(cid:26) a\n\nif V (cid:54)(cid:118) x[1, n \u2212 1]\nif V (cid:118) x[1, n \u2212 1]\n\n\u03b8V,a(x) =\n\nxIV (cid:118)x+1\n\n\u03c7V,a(x, y) =\n\n1\n2\n\n(y + 1) given \u03b8V,a(x) = a\n\n3\n\n\fwhere y = c(x) is the label of example string x. More precisely, y = +1 if x \u2208 X(U ) and y = \u22121\notherwise. Our learning algorithm uses statistical queries to recover string U \u2208 (\u03a3\u222a)L one element\nat a time. It starts with the empty string V = \u03bb. Having recovered V = U [1, (cid:96)] where 0 \u2264 (cid:96) < L,\nwe infer U(cid:96)+1 as follows. For each a \u2208 \u03a3, the statistical query oracle is called with the query \u03c7V,a\nat the error tolerance \u03c4 claimed in Theorem 1. Our key technical observation is that the value of\nIE\u03c7V,a effectively selects U(cid:96)+1. The query results of \u03c7V,a will form two separate clusters such that\nthe maximum difference (variance) inside one cluster is smaller than the minimum difference (gap)\nbetween the two clusters, making them distinguishable. The set of symbols in the cluster with larger\nquery results is proved to be U(cid:96)+1. Notice that this statistical query only works for 0 \u2264 (cid:96) < L. To\ncomplete the algorithm, we address the trivial case (cid:96) = L with query Pr[y = +1 | V (cid:118) x] and the\nalgorithm halts if the query answer is close to 1.\n\n3.2 PAC learnability of ideal X\n\nWe show the algorithm described above learns the class of shuf\ufb02e ideals from element-wise i.i.d.\nstrings in the statistical query learning model.\nTheorem 1 Under element-wise independent and identical distributions over instance space I =\n\u03a3n, concept class X is approximately identi\ufb01able with O(sn) conditional statistical queries from\nSTAT(X,D) at tolerance\n\nor with O(sn) statistical queries from STAT(X,D) at tolerance\n\u00014\n\n\u0001\n\n(cid:18)\n\n\u00af\u03c4 =\n\n1 \u2212\n\n\u03c4 =\n\n40sn2 + 4\u0001\n\n\u00012\n\n(cid:19)\n\n20sn2 + 2\u0001\n\n16sn(10sn2 + \u0001)\n\nWe provide the main idea of the proofs in this section and defer the details and algebra to Appendix\nA. The proof starts from the legitimacy and feasibility of the algorithm. Since \u03c7V,a computes a\nbinary mapping from labeled examples to {0, 1}, the legitimacy is trivial. But \u03c7V,a is not feasible\nfor symbols in \u03a3 of small occurrence probabilities. We avoid the problematic cases by reducing the\noriginal learning problem to the same problem with a polynomial lower bound assumption Pr[xi =\na] \u2265 \u0001/(2sn) \u2212 \u00012/(20sn2 + 2\u0001) for any a \u2208 \u03a3 and achieve feasibility.\nThe correctness of the algorithm is based on the intuition that the query result IE\u03c7V,a+ of a symbol\na+ \u2208 U(cid:96)+1 should be greater than that of a symbol a\u2212 (cid:54)\u2208 U(cid:96)+1 and the difference is large enough\nto tolerate the noise from the oracle. To prove this, we \ufb01rst consider the exact learning case. De\ufb01ne\nan in\ufb01nite string U(cid:48) = U [1, (cid:96)]U [(cid:96) + 2, L]U\u221e\n(cid:96)+1 and let x(cid:48) = x\u03a3\u221e be the extension of x obtained by\npadding it on the right with an in\ufb01nite string generated from the same distribution as x. Let Q(j, i)\nbe the probability that the largest g such that U(cid:48)[1, g] (cid:118) x(cid:48)[1, i] is j, or formally\nQ(j, i) = Pr[U(cid:48)[1, j] (cid:118) x(cid:48)[1, i] \u2227 U(cid:48)[1, j + 1] (cid:54)(cid:118) x(cid:48)[1, i]]\n\nBy taking the difference between IE\u03c7V,a+ and IE\u03c7V,a\u2212 in terms of Q(j, i), we get the query tolerance\nfor exact learning.\nLemma 2 Under element-wise independent and identical distributions over instance space I =\n\u03a3n, concept class X is exactly identi\ufb01able with O(sn) conditional statistical queries from\nSTAT(X,D) at tolerance\n\n\u03c4(cid:48) =\n\nQ(L \u2212 1, n \u2212 1)\n\n1\n5\n\nLemma 2 indicates bounding the quantity Q(L \u2212 1, n \u2212 1) is the key to the tolerance for PAC\nlearning. Unfortunately, the distribution {Q(j, i)} doesn\u2019t seem to have any strong properties we\nknow of providing a polynomial lower bound. Instead we introduce new quantity\n\nR(j, i) = Pr[U(cid:48)[1, j] (cid:118) x(cid:48)[1, i] \u2227 U(cid:48)[1, j] (cid:54)(cid:118) x(cid:48)[1, i \u2212 1]]\n\nbeing the probability that the smallest g such that U(cid:48)[1, j] (cid:118) x(cid:48)[1, g] is i. An important property of\ndistribution {R(j, i)} is its strong unimodality as de\ufb01ned below.\n\n4\n\n\fDe\ufb01nition 5 (Unimodality: [8]) A distribution {P (i)} with all support on the lattice of integers is\nunimodal if and only if there exists at least one integer K such that P (i) \u2265 P (i \u2212 1) for all i \u2264 K\nand P (i + 1) \u2264 P (i) for all i \u2265 K. We say K is a mode of distribution {P (i)}.\n\nThroughout this paper, when referring to the mode of a distribution, we mean the one with the largest\nindex, if the distribution has multiple modes with equal probabilities.\nDe\ufb01nition 6 (Strong Unimodality: [10]) A distribution {H(i)} is strongly unimodal if and only if\nthe convolution of {H(i)} with any unimodal distribution {P (i)} is unimodal.\n\nSince a distribution with all mass at zero is unimodal, a strongly unimodal distribution is also uni-\nmodal. In this paper, we only consider distributions with all support on the lattice of integers. So the\nconvolution of {H(i)} and {P (i)} is\n\n{H \u2217 P}(i) =\n\nH(j)P (i \u2212 j) =\n\nH(i \u2212 j)P (j)\n\n\u221e(cid:88)\n\nj=\u2212\u221e\n\n\u221e(cid:88)\n\nj=\u2212\u221e\n\nWe prove the strong unimodality of {R(j, i)} with respect to i via showing it is the convolution of\ntwo log-concave distributions by induction. We do an initial statistical query to estimate Pr[y = +1]\nto handle two marginal cases Pr[y = +1] \u2264 \u0001/2 and Pr[y = +1] \u2265 1\u2212\u0001/2. After that an additional\nquery Pr[y = +1 | V (cid:118) x] is made to tell whether (cid:96) = L. If the algorithm doesn\u2019t halt, it means\n(cid:96) < L and both Pr[y = +1] and Pr[y = \u22121] are at least \u0001/2 \u2212 2\u03c4. By upper bounding Pr[y = +1]\nand Pr[y = \u22121] using linear sums of R(j, i), the strong unimodality of {R(j, i)} gives a lower\nbound for R(L, n), which further implies one for Q(L \u2212 1, n \u2212 1) and completes the proof.\n\n3.3 A generalization to instance space \u03a3\u2264n\n\n\ufb01xed length i. Because instance space \u03a3\u2264n =(cid:83)\n\nWe have proved the extended class of shuf\ufb02e ideals is PAC learnable from element-wise i.i.d. \ufb01xed-\nlength strings. Nevertheless, in many real-world applications such as natural language processing\nand computational linguistics, it is more natural to have strings of varying lengths. Let n be the\nmaximum length of the sample strings and as a consequence the instance space for learning is \u03a3\u2264n.\nHere we show how to generalize the statistical query algorithm in Section 3.1 to the more general\ninstance space \u03a3\u2264n.\nLet Ai be the algorithm in Section 3.1 for learning shuf\ufb02e ideals from element-wise i.i.d. strings of\ni\u2264n \u03a3i, we divide the sample S into n subsets {Si}\nwhere Si = {x | |x| = i}. An initial statistical query then is made to estimate probability Pr[|x| = i]\nfor each i \u2264 n at tolerance \u0001/(8n). We discard all subsets Si with query answer \u2264 3\u0001/(8n) in the\nlearning procedure, because we know Pr[|x| = i] \u2264 \u0001/(2n). As there are at most (n \u2212 1) such\nSi of low occurrence probabilities. The total probability that an instance comes from one of these\nnegligible sets is at most \u0001/2. Otherwise, Pr[|x| = i] \u2265 \u0001/(4n) and we apply algorithm Ai on each\nSi with query answer \u2265 3\u0001/(8n) with error parameter \u0001/2. Because the probability of the condition\nis polynomially large, the algorithm is feasible. Finally, the total error over the whole instance space\nwill be bounded by \u0001 and concept class X is PAC learnable from element-wise i.i.d. strings over\ninstance space \u03a3\u2264n.\nCorollary 1 Under element-wise independent and identical distributions over instance space I =\n\u03a3\u2264n, concept class X is approximately identi\ufb01able with O(sn2) conditional statistical queries from\nSTAT(X,D) at tolerance\n\nor with O(sn2) statistical queries from STAT(X,D) at tolerance\n\n\u03c4 =\n\n160sn2 + 8\u0001\n\n\u00012\n\n(cid:19)\n\n(cid:18)\n\n\u00af\u03c4 =\n\n1 \u2212\n\n\u0001\n\n\u00015\n\n40sn2 + 2\u0001\n\n512sn2(20sn2 + \u0001)\n\n5\n\n\f3.4 A constrained generalization to product distributions\n\nelement-wise independence between its elements. That is, Pr[X = x] =(cid:81)|x|\n\nA direct generalization from element-wise independent and identical distributions is product dis-\ntributions. A random string, or a random vector of symbols under a product distribution has\ni=1 Pr[Xi = xi]. Al-\nthough strings under product distributions share many independence properties with element-wise\ni.i.d. strings, the algorithm in Section 3.1 is not directly applicable to this case as the distribution\n{R(j, i)} de\ufb01ned above is not unimodal with respect to i in general. However, the intuition that\ngiven IV (cid:118)x = h, the strings with xh+1 \u2208 U(cid:96)+1 have higher probability of positivity than that of the\nstrings with xh+1 (cid:54)\u2208 U(cid:96)+1 is still true under product distributions. Thus we generalize query \u03c7V,a\nand de\ufb01ne for any V \u2208 (\u03a3\u222a)\n\n\u2264n, a \u2208 \u03a3 and h \u2208 [0, n \u2212 1],\n\n\u02dc\u03c7V,a,h(x, y) =\n\n1\n2\n\n(y + 1)\n\ngiven IV (cid:118)x = h and xh+1 = a\n\nwhere y = c(x) is the label of example string x. To ensure the legitimacy and feasibility of the\nalgorithm, we have to attach a lower bound assumption that Pr[xi = a] \u2265 t > 0, for \u22001 \u2264 i \u2264 n and\n\u2200a \u2208 \u03a3. Appendix C provides a constrained algorithm based on this intuition. Let P (+|a, h) denote\nIE \u02dc\u03c7V,a,h. If the difference P (+|a+, h)\u2212 P (+|a\u2212, h) is large enough for some h with nonnegligible\nPr[IV (cid:118)x = h], then we are able to learn the next element in U. Otherwise, the difference is very\nsmall and we will show that there is an interval starting from index (h + 1) which we can skip\nwith little risk. The algorithm is able to classify any string whose classi\ufb01cation process skips O(1)\nintervals. Details of this constrained generalization are deferred to Appendix C.\n\n4 Learning principal shuf\ufb02e ideals from Markovian strings\n\nMarkovian strings are widely studied in natural language processing and biological sequence mod-\neling. Formally, a random string x is Markovian if the distribution of xi+1 only depends on the\nvalue of xi: Pr[xi+1 | x1 . . . xi] = Pr[xi+1 | xi] for any i \u2265 1. If we denote by \u03c00 the distribution\nof x1 and de\ufb01ne s \u00d7 s stochastic matrix M by M (a1, a2) = Pr[xi+1 = a1 | xi = a2], then a\nrandom string can be viewed as a Markov chain with initial distribution \u03c00 and transition matrix\nM. We choose \u03a3\u2264n as the instance space in this section and assume independence between the\nstring length and the symbols in the string. We assume Pr[|x| = k] \u2265 t for all 1 \u2264 k \u2264 n and\nmin{M (\u00b7,\u00b7), \u03c00(\u00b7)} \u2265 c for some positive t and c. We will prove the PAC learnability of class\nunder this lower bound assumption. Denote by u be the target pattern string and let L = |u|.\n\n4.1 Statistical query algorithm\n\nrecovered v = u[1, (cid:96)], we infer u(cid:96)+1 by \u03a8v,a =(cid:80)n\n\nk=h+1 IE\u03c7v,a,k, where\n\nStarting with empty string v = \u03bb, the pattern string u is recovered one symbol at a time. Having\n\n\u03c7v,a,k(x, y) =\n\n(y + 1)\n\ngiven Iv(cid:118)x = h, xh+1 = a and |x| = k\n\n1\n2\n\n0 \u2264 (cid:96) < L and h is chosen from [0, n \u2212 1] such that the probability Pr[Iv(cid:118)x = h] is polynomially\nlarge. The statistical queries \u03c7v,a,k are made at tolerance \u03c4 claimed in Theorem 2 and the symbol\nwith the largest query result of \u03a8v,a is proved to be u(cid:96)+1. Again, the case where (cid:96) = L is addressed\nby query Pr[y = +1 | v (cid:118) x]. The learning procedure is completed if the query result is close to 1.\n\n4.2 PAC learnability of principal ideal\n\nWith query \u03a8v,a, we are able to recover the pattern string u approximately from STAT(\nproper tolerance as stated in Theorem 2:\nTheorem 2 Under Markovian string distributions over instance space I = \u03a3\u2264n, given Pr[|x| =\nk] \u2265 t > 0 for \u22001 \u2264 k \u2264 n and min{M (\u00b7,\u00b7), \u03c00(\u00b7)} \u2265 c > 0, concept class\nis approximately\nidenti\ufb01able with O(sn2) conditional statistical queries from STAT(\n\n,D) at tolerance\n\n(u),D) at\n\n\u03c4 =\n\n\u0001\n\n3n2 + 2n + 2\n\n6\n\n\for with O(sn2) statistical queries from STAT(\n\n,D) at tolerance\n3ctn\u00012\n\n\u00af\u03c4 =\n\n(3n2 + 2n + 2)2\n\nPlease refer to Appendix B for a complete proof of Theorem 2. Due to the probability lower bound\nassumptions, the legitimacy and feasibility are obvious. To calculate the tolerance for PAC learning,\nwe \ufb01rst consider the exact learning tolerance. Let x(cid:48) be an in\ufb01nite string generated by the Markov\nchain de\ufb01ned above. For any 0 \u2264 (cid:96) \u2264 L \u2212 j, we de\ufb01ne quantity R(cid:96)(j, i) by\nR(cid:96)(j, i) = Pr[u[(cid:96) + 1, (cid:96) + j] (cid:118) x(cid:48)[m + 1, m + i]\u2227 u[(cid:96) + 1, (cid:96) + j] (cid:54)(cid:118) x(cid:48)[m + 1, m + i\u2212 1] | x(cid:48)\nm = u(cid:96)]\nIntuitively, R(cid:96)(j, i) is the probability that the smallest g such that u[(cid:96) + 1, (cid:96) + j] (cid:118) x(cid:48)[m + 1, m + g]\nis i, given x(cid:48)\nLemma 3 Under Markovian string distributions over instance space I = \u03a3\u2264n, given Pr[|x| =\nk] \u2265 t > 0 for \u22001 \u2264 k \u2264 n and min{M (\u00b7,\u00b7), \u03c00(\u00b7)} \u2265 c > 0, the concept class\nis exactly\n,D) at tolerance\nidenti\ufb01able with O(sn2) conditional statistical queries from STAT(\n\nm = u(cid:96). We have the following conclusion on the exact learning tolerance.\n\n(cid:40)\n\nn(cid:88)\n\nk=h+1\n\n\u03c4(cid:48) = min\n0\u2264(cid:96) 0 for \u22001 \u2264 k \u2264 n, \u03c00(u1) \u2265 c and M (u(cid:96)+1, u(cid:96)) \u2265 c > 0 for \u22001 \u2264 (cid:96) \u2264 L \u2212 1, concept\n,D)\nclass\nat tolerance\n\nis approximately identi\ufb01able with O(sn2) conditional statistical queries from STAT(\n\n(cid:27)\n\n(cid:26)\n\nor with O(sn2) statistical queries from STAT(\n\n\u03c4 = min\n\n(cid:26)\n\n,\n\n\u0001\n\n3n2 + 2n + 2\n\nc\n3\n,D) at tolerance\ntn\u0001c2\n\n\u00af\u03c4 = min\n\nctn\u00012\n\n(3n2 + 2n + 2)2 ,\n\n3(3n2 + 2n + 2)\n\n(cid:27)\n\n5 Learning shuf\ufb02e ideals under general distributions\n\nAlthough the string distribution is restricted or even known in most application scenarios, one might\nbe interested in learning shuf\ufb02e ideals under general unrestricted and unknown distributions without\nany prior knowledge. Unfortunately, under standard complexity assumptions, the answer is negative.\nAngluin et al. [3] have shown that a polynomial time PAC learning algorithm for principal shuf\ufb02e\nideals would imply the existence of polynomial time algorithms to break the RSA cryptosystem,\nfactor Blum integers, and test quadratic residuosity.\nTheorem 3 ([3]) For any alphabet of size at least 2, given two disjoint sets of strings S, T \u2282 \u03a3\u2264n,\nthe problem of determining whether there exists a string u such that u (cid:118) x for each x \u2208 S and\nu (cid:54)(cid:118) x for each x \u2208 T is NP-complete.\n\n7\n\n\fAs ideal\nover instance space \u03a3n? The answer is again no.\n\nis a subclass of ideal X, we know learning ideal X is only harder. Is the problem easier\n\nLemma 4 Under general unrestricted string distributions, a concept class is PAC learnable over\ninstance space \u03a3\u2264n if and only if it is PAC learnable over instance space \u03a3n.\n\nThe proof of Lemma 4 is presented in Appendix D using the same idea as our generalization in\nSection 3.3. Note that Lemma 4 holds under general string distributions. It is not necessarily true\nwhen we have assumptions on the marginal distribution of string length.\nDespite the infeasibility of PAC learning a shuf\ufb02e ideal in theory, it is worth exploring the possi-\nbilities to do the classi\ufb01cation problem without theoretical guarantees, since most applications care\nmore about the empirical performance than about theoretical results. For this purpose we propose a\nheuristic greedy algorithm for learning principal shuf\ufb02e ideals based on reward strategy as follows.\nsumes k elements in x if min{Iva(cid:118)x, n + 1} \u2212 Iv(cid:118)x = k. The reward strategy depends on the ratio\nr+/r\u2212: the algorithm receives r\u2212 reward from each element it consumes in a negative example or\n\nUpon having recovered v = (cid:98)u[1, (cid:96)], for a symbol a \u2208 \u03a3 and a string x of length n, we say a con-\nr+ penalty from each symbol it consumes in a positive string. A symbol is chosen as (cid:98)u(cid:96)+1 if it\nbrings us most reward. The algorithm will halt once(cid:98)u exhausts any positive example and makes a\n((cid:98)u[1, (cid:96) \u2212 1]) is returned\nexamples x such that (cid:98)u (cid:118) x and #(\u2212) is the number of negative examples x such that (cid:98)u (cid:118) x.\n\nfalse negative error, which means we have gone too far. Finally the ideal\nas the hypothesis. The performance of this greedy algorithm depends a great deal on the selection of\nparameter r+/r\u2212. A clever choice is r+/r\u2212 = #(\u2212)/#(+), where #(+) is the number of positive\n\nA more recommended but more complex strategy to determine the parameter r+/r\u2212 in practice is\ncross validation.\nA better studied approach to learning regular languages, especially the piecewise-testable ones, in\nrecent works is kernel machines ([13, 14]). An obvious advantage of kernel machines over our\ngreedy method is its broad applicability to general classi\ufb01cation learning problems. Nevertheless,\nthe time complexity of the kernel machine is O(N 3 + n2N 2) on a training sample set of size N\n([5]), while our greedy method only takes O(snN ) time due to its great simplicity. Because N\nis usually huge for the demand of accuracy, kernel machines suffer from low ef\ufb01ciency and long\nrunning time in practice. To make a comparison between the greedy method and kernel machines\nfor empirical performance, we conducted a series of experiments on a real world dataset [4] with\nstring length n as a variable. The experiment results demonstrate the empirical advantage on both\nef\ufb01ciency and accuracy of the greedy algorithm over the kernel method, in spite of its simplicity.\nAs this is a theoretical paper, we defer the details on the experiments to Appendix D, including the\nexperiment setup and \ufb01gures of detailed experiment results.\n\n6 Discussion\n\nWe have shown positive results for learning shuf\ufb02e ideals in the statistical query model under\nelement-wise independent and identical distributions and Markovian distributions, as well as a con-\nstrained generalization to product distributions. It is still open to explore the possibilities of learning\nshuf\ufb02e ideals under less restricted distributions with weaker assumptions. Also a lot more work\nneeds to be done on approximately learning shuf\ufb02e ideals in applications with pragmatic approaches.\nIn the negative direction, even a family of regular languages as simple as the shuf\ufb02e ideals is not\nef\ufb01ciently properly PAC learnable under general unrestricted distributions unless RP=NP. Thus, the\nsearch for a nontrivial properly PAC learnable family of regular languages continues. Another the-\noretical question that remains is how hard the problem of learning shuf\ufb02e ideals is, or whether PAC\nlearning a shuf\ufb02e ideal is as hard as PAC learning a deterministic \ufb01nite automaton.\n\nAcknowledgments\n\nWe give our sincere gratitude to Professor Dana Angluin of Yale University for valuable discussions\nand comments on the learning problem and the proofs. Our thanks are also due to Professor Joseph\nChang of Yale University for suggesting supportive references on strong unimodality of probability\ndistributions and to the anonymous reviewers for their helpful feedback.\n\n8\n\n\fReferences\n[1] D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337\n\n\u2013 350, 1978.\n\n[2] D. Angluin. Learning regular sets from queries and counterexamples.\n\n75(2):87\u2013106, Nov. 1987.\n\nInformation and Computation,\n\n[3] D. Angluin, J. Aspnes, S. Eisenstat, and A. Kontorovich. On the learnability of shuf\ufb02e ideals. Journal of\n\nMachine Learning Research, 14:1513\u20131531, 2013.\n\n[4] K. Bache and M. Lichman. NSF research award abstracts 1990-2003 data set. UCI Machine Learning\n\nRepository, 2013.\n\n[5] L. Bottou and C.-J. Lin. Support vector machine solvers. Large scale kernel machines, pages 301\u2013320,\n\n2007.\n\n[6] N. H. Bshouty. Exact learning of formulas in parallel. Machine Learning, 26(1):25\u201341, Jan. 1997.\n[7] C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38(9):1332\u2013\n\n1348, Sept. 2005.\n\n[8] B. Gnedenko and A. N. Kolmogorov. Limit distributions for sums of independent random variables.\n\nAddison-Wesley series in statistics, 1949.\n\n[9] E. M. Gold. Complexity of automaton identi\ufb01cation from given data. Information and Control, 37(3):302\n\n\u2013 320, 1978.\n\n[10] I. Ibragimov. On the composition of unimodal distributions. Theory of Probability and Its Applications,\n\n1(2):255\u2013260, 1956.\n\n[11] M. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries.\n\n45(6):983\u20131006, Nov. 1998.\n\nJournal of the ACM (JACM),\n\n[12] O. Kl\u00b4\u0131ma and L. Pol\u00b4ak. Hierarchies of piecewise testable languages. Proceedings of the 12th International\n\nConference on Developments in Language Theory, pages 479\u2013490, 2008.\n\n[13] L. A. Kontorovich, C. Cortes, and M. Mohri. Kernel methods for learning languages. Theoretical Com-\n\nputer Science, 405(3):223\u2013236, Oct. 2008.\n\n[14] L. A. Kontorovich and B. Nadler. Universal kernel-based learning with applications to regular languages.\n\nThe Journal of Machine Learning Research, 10:1095\u20131129, June 2009.\n\n[15] L. A. Kontorovich, D. Ron, and Y. Singer. A markov model for the acquisition of morphological structure.\n\nTechnical Report CMU-CS-03-147, 10, June 2003.\n\n[16] K. Koskenniemi. Two-level model for morphological analysis. Proceedings of the Eighth International\n\nJoint Conference on Arti\ufb01cial Intelligence - Volume 2, pages 683\u2013685, 1983.\n\n[17] M. Lothaire. Combinatorics on Words (Encyclopedia of Mathematics and Its Applications - Vol 17).\n\nAddison-Wesley, 1983.\n\n[18] M. Mohri. On some applications of \ufb01nite-state automata theory to natural language processing. Journal\n\nof Natural Language Engineering, 2(1):61\u201380, Mar. 1996.\n\n[19] M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics,\n\n23(2):269\u2013311, June 1997.\n\n[20] M. Mohri, P. J. Moreno, and E. Weinstein. Ef\ufb01cient and robust music identi\ufb01cation with weighted \ufb01nite-\nstate transducers. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):197\u2013207, Jan.\n2010.\n\n[21] M. Mohri, F. Pereira, and M. Riley. Weighted \ufb01nite-state transducers in speech recognition. Computer\n\nSpeech and Language, 16(1):69 \u2013 88, 2002.\n\n[22] L. Pitt and M. K. Warmuth. The minimum consistent DFA problem cannot be approximated within any\n\npolynomial. Journal of the ACM (JACM), 40(1):95\u2013142, Jan. 1993.\n\n[23] O. Rambow, S. Bangalore, T. Butt, A. Nasr, and R. Sproat. Creating a \ufb01nite-state parser with application\nsemantics. Proceedings of the 19th International Conference on Computational Linguistics - Volume 2,\npages 1\u20135, 2002.\n\n[24] I. Simon. Piecewise testable events. Proceedings of the 2nd GI Conference on Automata Theory and\n\nFormal Languages, pages 214\u2013222, 1975.\n\n[25] R. Sproat, W. Gale, C. Shih, and N. Chang. A stochastic \ufb01nite-state word-segmentation algorithm for\n\nChinese. Computational Linguistics, 22(3):377\u2013404, Sept. 1996.\n\n[26] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, Nov. 1984.\n\n9\n\n\f", "award": [], "sourceid": 519, "authors": [{"given_name": "Dongqu", "family_name": "Chen", "institution": "Yale University"}]}