Probabilistic Lookup

Here’s a challenge for someone. Well, maybe it’s easy, but it’s challenging for me. For lack of a better name, I call it probabilistic lookup. Say you have a list of names, ordered by popularity (according to the U.S. Census Bureau):

Mary
Nancy
Pamela
Beatrice
Deborah
Enid
Lolita
Delta
Mayola
Concha

How can you randomly select a name such that names at the top of the list are more likely to be selected than names at the bottom of the list? In other words, if you randomly select 1,000 names from this list and then do a sorted frequency distribution of the selected names, they should (roughly) appear in the same order as the original list.

It needs to be generalizable to handle a much larger list (~10,000) . I’m not seeking statistical perfection, just a reasonably accurate way to give names higher on the list a greater chance of being selected, and names lower on the list a lower chance of being selected.

I’d prefer a formula solution, if possible.

Posted in Uncategorized

20 thoughts on “Probabilistic Lookup

  1. I can think of a number of ways.
    Here is a pretty simple one.
    Reverse the list order.
    Add a column
    Autonumber 1 through n+1 (assuming n names)
    In the next column use a formula
    The first cell is 1, the next one down = cell above + the autonumbered cell
    Drag all the way down to one beyond the last name.
    This is now your lookup table.

    Next use random() and multiply by the last value
    Use that as the input to your lookup table and return the name associated with that number.

    If you had 4 names the first name would have 1 chance in 10, the second 2 chances in ten etc…

    1 Bob
    3 Carol
    6 Ted
    10 Alice

  2. Hi John,

    How about this solution:

    Assign your name list a named range of Names.

    This formula will return the names at the top of the list more often:

    =INDEX(Names,INT(COUNTA(Names)-(SQRT(2*(RAND()*COUNTA(Names)*(COUNTA(Names)+1)/2)+0.25)-0.5))+1)

    The weights work like this:

    e.g. If there are 3 names in the list
    Name 1 is returned around 3/6 of the time
    Name 2 is returned around 2/6 of the time
    Name 3 is returned around 1/6 of the time

    e.g. If there are 4 names in the list
    Name 1 is returned around 4/10 of the time
    Name 2 is returned around 3/10 of the time
    Name 3 is returned around 2/10 of the time
    Name 4 is returned around 1/10 of the time

    etc.

    If you’re interested in the math, then I could explain it!

    Thanks,
    Shawn

  3. John – I’ve taken a stab at it. My solution requires some sort of frequency to be specified. Lacking one, I assumed the top item in a list of n items happens n times, the second item happens (n-1) times, and so forth. I suppose it’s possible some sort of megaformula could be made to handle it all in one calc, but I’d prefer not to try and wrap my brain around that…

    I’ve posted the file at my site so anybody can download and play with it: http://bill.simonifamily.net/2006/07/12/probabilistic-lookup-in-excel/

  4. Here’s a start. Assuming the names are in A1:A10, I assigned probabilities to each name in B1:B10 and then sorted both columns ascending based on column B. So I have:

    Delta 0.005
    Mayola 0.005
    Concha 0.005
    Lolita 0.01
    Enid 0.025
    Beatrice 0.05
    Deborah 0.05
    Pamela 0.1
    Nancy 0.25
    Mary 0.5

    I used a helper column (column C) by multiplying each probability by 10000 (cell G1) and then adding the values cumulatively down col. C. So C1 has: =B1*$G$1 and C2: =B2*$G$1+C1. Copy the formula in C2 down to C10. Now I have 3 columns that look like:

    Delta 0.00550
    Mayola 0.005100
    Concha 0.005150
    Lolita 0.01250
    Enid 0.025500
    Beatrice 0.051000
    Deborah 0.051500
    Pamela 0.12500
    Nancy 0.255000
    Mary 0.510000

    From here I used an INDEX/MATCH combo with RANDBETWEEN selecting a whole number between 0 and 9999:

    =INDEX(A1:A10,MATCH(RANDBETWEEN(0,G1-1),C1:C10,1)+1)

    Of course the biggest issue is that the numbers 0-49 will product #N/A, and the formula will never return “Delta.” That should be an easy workaround but I can’t find one at the moment. I’ll address later.

    I also picked “pretty” probabilities in this example.

  5. Hi

    Formula in Colum C =IF(RANDBETWEEN(1,26)>ROW()=TRUE,A2,0)
    Formula in Colum D =IF(C2=0,””,IF(AND(C20),”T”,”B”))
    This gives more random “T”‘s than “B”‘s

    Regards
    Sam

    A B C D
    SlNameTOP/BOTTOM
    1a1TTop1077%
    2b0Bottom323%
    3c3T13
    4d4T
    5e5T
    6f6T
    7g0
    8h8T
    9i9T
    10j0
    11k11T
    12l12T
    13m13T
    14n14B
    15o0
    16p0
    17q17B
    18r0
    19s0
    20t0
    21u21B
    22v0
    23w0
    24×0
    25y0
    26z0

  6. Hi John,

    If the list is in, say, cells A1 to A10, then you could make your selections with a formula like:

    =OFFSET($A$1,TRUNC(RAND()*RAND()*COUNTA($A$1:$A$10),0),0)

    Cheers,

    XLurker

  7. Six Steps:

    1. Copy list to A1:A10
    2. Sort list by smallest frequency of occurance to largest
    3. Enter 0 in Cell B1
    4. Enter =INT(AVERAGE(B1+2,(B1+2)*2)) in cell B2
    5. Copy formula down to Cell B11 (Yes…one row below the end the of list)
    6. Enter this formula anywhere =INDEX($A$1:$A$10,(MATCH(INT((RAND()*MAX($B:$B)+1)),$B$1:$B$11,1)))

    Copy formula down to 1000 rows, and you will get a distribution that should support the premise of your original list. The more rows you include, the closer you get to the correct distribution.
    The formula in step 4 could be tweaked for better results.

  8. Hi,

    here’s my try:

    =INDEX(INDIRECT(“A1:A”&COUNTA($A:$A)),COUNTA($A:$A)-INT(LOG10(10*(9/10*RAND()+0.1))*COUNTA($A:$A)))

    Assuming that your list of names starts at A1.

    -Asser

  9. First is to generate an index table:
    “Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.
    “Prob” is the cumulative probability of that name being returned. The top name will always be 1.000

    The formula in the Prob column is:
    = SUM(B2:$B$11)/SUM($B$2:$B$11)
    A B C
    1 Names Occurance Prob
    2 Mary 15.10% 1.000
    3 Nancy 14.90% 0.765
    4 Pamela 9.10% 0.533
    5 Beatrice 6.10% 0.392
    6 Deborah 5.60% 0.297
    7 Enid 4.80% 0.210
    8 Lolita 3.10% 0.135
    9 Delta 2.40% 0.087
    10 Mayola 1.90% 0.050
    11 Concha 1.30% 0.020

    To actually return names the formula is:
    =INDEX(Names,MATCH(RAND(),Prob,-1))

    Where:
    “Names” represents A$2:$A$11
    “Prob” represents C$2:$C$11

    Example Result for a block of 10000 selections
    FWIW formula to complie this was:
    =COUNTIF($J$1:$S$1000,A2) – to Give answer for “Mary”

    Mary      2337
    Nancy     2316
    Pamela    1325
    Beatrice   960
    Deborah    922
    Enid       783
    Lolita     520
    Delta      348
    Mayola     308
    Concha     181

    Mpemba

    Editor: Mpemba, I’ve been using your comment to play around with the formatting.

  10. Darn that formatting: it looked fine in the preview window (Courier) !

    I should add, if you want a working copy of the spreadsheet I’d gladly mail it.

    M

  11. Hi,

    If the list is in, say, cells A1 to A10, then how about making your selections with a formula like this:

    =OFFSET($A$1,TRUNC(RAND()^2*COUNTA($A$1:$A$10),0),0)

    The COUNTA is for scalability of the list. Set the power of 2 higher to skew the selection more to the top of the list, or between 0 and 1 to skew the selection to the bottom of the list.

    Cheers,

    XLurker

  12. Me again:
    In hindsight I should add that in post 4 my “Occurance” column does not have to be sorted in to ascending order. The order does not matter a jot using this method.

    I must also admit I found it surprising Mary was at the top of the example list. The 2005 data for the USA listed at:
    http://www.socialsecurity.gov/OACT/babynames/

    Top ten for the 2000s
    http://www.socialsecurity.gov/OACT/babynames/decades/names2000s.html
    Emily 149,420 1.2525%
    Madison 123,729 1.0372%
    Hannah 110,081 0.9228%
    Emma 106,428 0.8921%
    Ashley 91,644 0.7682%
    Abigail 89,848 0.7532%
    Alexis 89,512 0.7503%
    Olivia 88,971 0.7458%
    Samantha 88,669 0.7433%
    Sarah 85,747 0.7188%

  13. RNDBTWN? thght mst b mssng smthng, hng n tls > dd-ns > nlyss tlpk.

    It’s making a lot more sense now.

  14. Mpemba:
    You shouldn’t be surprised Mary was at the top of the list. If you go to the site you linked and check the “Popularity of a Name” on Female for the past 100 years using the name Mary, you’ll find it was a very popular name for a very long time.

  15. {=OFFSET($A$1,MATCH(RAND(),1-EXPONDIST(ROW(1:[n])-1,[tweakable constant],TRUE),-1),0)} … so entered as an array formula over the required output cells.

    A1 is one cell above the input list
    [n] is the upper bound for the number of input values + 1
    [tweakable constant] controls the decay in probabilities for the exponential function. 0.1 ‘works’ over 10,000 cells, 0.9 can be seen to ‘work’ for a shorter list. 0.1 is very ‘slow’ decay in probability.

  16. Statistical Perfection yields the best results. Sorry so late but I was helping someone with homework and came across this unsatisfactory answer. What you need is an “adjusted cumulative frequency” column to the left of the names. Luckily, the data includes the cumulative frequency:

    adj.cf name freq cum.freq(%) rank
    0 MARY 2.629 2.629 1
    .02629 PATRICIA 1.073 3.702 2
    .03702 LINDA 1.035 4.736 3
    .04736 BARBARA 0.980 5.716 4
    .05716 ELIZABETH 0.937 6.653 5

    =VLOOKUP(RAND(), $A$2:$B$6, 2)
    or if you use the actual data
    =VLOOKUP(RAND(), $A$2:$B$4275, 2)

    http://www.excelforum.com/showthread.php?t=628930
    shows similar question with Letters of alphabet

  17. Thanks to Dick and all the contributors of this site. I have picked up hundreds of tips here. Many thanks.

    I am trying to develop a function that will count the number of runs within a range with one run being defined as when the value encounters a sign change (i.e. from positive to negative or negative to positive). Consider the following range: starting in cell A1 and continuing down to cell A10 I have these numbers: 125, 200, 175, -50, -75, 135, -65, -225, -90, and 115. The first three values (125, 200, 175) are positive and are considered as one run (a winning streak). The next two values are negative (-50, -75) thus are considered another run (a losing streak). The next value (135) is positive and is considered another run. The next three values (-65, -225, -90) are negative and thus is considered another run. Finally, the last value in the range (115) is positive, thus is considered another run. Therefore, for this example range the total number of runs 5. I have developed functions to calculate the longest winning and losing streaks, but I am really hitting a wall developing a function to calculate the total number of runs within a range. Any suggestions on how to approach this problem would be greatly appreciated. Thanks.


Posting code? Use <pre> tags for VBA and <code> tags for inline.

Leave a Reply

Your email address will not be published.