Suppose I have a set of sequences of length n with symbols from an alphabet A = {a1, a2, ..., an}

A particular sequence may appear multiple times in the set.

I would like to know which sequences have frequencies that are statistically significant. That is, I would like to identify sequences whose frequency is more or less than the frequency if the sequences had been generated at random.

To do this with a brute force method, I would determine the probability that the sequence would occur at random.

In particular, suppose we have an alphabet of letters A = {a1, a2, ..., an}. A random sequence is formed by selecting a letter from A with probabilities {p1, p2, ..., pn}, respectively.

So with simulation, I would generate random sequences of length n, and then look at their frequencies.

Then I would look at the difference between the observed sequence frequency in my sample data and the expected "random" frequency to determine which sequences are significant.

I had two main questions:

1. In general, how should I determine the probabilities of letters {p1, p2, ..., pn}? Can I simply look at my set of sequences and count the frequency at which letters appear and divide this by the total number of letter appearances?

2. Is there an analytic way to determine which sequence is the most statistically significant?

In general, I do know of chi-square tests, anderson-darling tests, and ks-tests. But these tests compare two distributions. I am mainly interested in identifying

"the most statistically significant" sequence, not whether the distribution matches a random distribution or not.

I also know of the seminal paper by Karlin and Atlschul titled "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes", but I'm not sure it it applicable as the paper seems to assume a scoring scheme.

Randomness tests just tell me whether the sequence is random or not, not whether its frequency is statistically significant.

Thank you.