Searching for Expertise in Social Networks: A Simulation of Potential Strategies. Zhang, Ackerman. ACM Siggroup 2005.

  1. Examines three algorithms that search for expertise / some piece of information
  2. Based on a simulation that comes from the Enron email data set
  3. Paper not graph theoretic focused, but about social characteristics
  4. Plenty of relevant citations
  5. “Recently, mathematical models have been proposed to explain why these simple heuristics are good at forming short paths.  These models assume that the social network usually has a structure, in which individuals are grouped together by occupation, location, interest, and so on.  As well, these groups are grouped together into bigger groups and so forth.  The difference in people’s group identities defines their social distance.  By choosing individuals who have the shortest social distance to the target at each step, people can gradually reach the target in a short path with only local information about their own immediate acquaintances.”
  6. One proposal for finding experts is, given an feature vector describing an expert, pick a neighbor that has the most similar feature vector, along with some bonus provided to individuals that were more successful at moving the recommendation in the future
    1. Problem is in real life, how would one come up with such feature vectors? (Paper dealt with a simulation.)
    2. Also, their algorithm wasn’t compared to anything else
  7. Another algorithm basically tries to send the message to the most connected neighbor, but was acknowledged not to always work well
  8. Another option is breadth-first search, although in practice this is probably too expensive to use
  9. This paper compares these approaches, and does so on social data
  10. Characteristics of social networks:
    1. Connections aren’t uniformly distributed; connections in social graphs are meaningful and vary significantly
    2. Connections can have different strengths, and may not be symmetrical (in particular, weak ties can be important in getting new information and adopting new ideas, they were also found to be important in the small world experiment)
  11. The experiment here is based on a simulation, not an experiment, which acknowledged as unusual for social graph studies
    1. Weren’t allowed access to actual company emails
    2. A controlled participant study would be too small, and a good sized one would be too uncontrolled
  12. In an attempt to keep the simulation realistic, it is based on Enron email data set
    1. Has data from 147 employees, and ~half a million messages
  13. For this experiment, a sub-sample of ~30k messages were examined among the 147 employees
  14. The distribution of the network is highly skewed
  15. This email data set deals only with management of the company, which was probably more tightly connected than the full corporation was
  16. They also created a second network which removed weak edges and high degree nodes in an effort to simulate a network more representative of the whole corporation
  17. They also acknowledge a potential issue with how they recognize expertise, which is the existence of a keyword in an email sent to someone
  18. Also try a few other algorithms:
    1. One that forwards the message to the neighbor who has the strongest (or weakest) tie
    2. One that sends the message to a friend that is the most dissimilar to the individual
  19. In the experiment they manually generated 147 questions based on actual email text of the users
    1. At each round, a question and an asker are selected at random
  20. Each individual is scored based on a metric estimating their ability to answer a question.  If a question reaches someone with a score above a certain threshold for that question, it is considered answered
  21. Assume each individual has access to the profiles of each neighbor
  22. Aside from classical computational costs, other important costs are how many individuals are contacted (bothered), how long the resulting chain was, and total distribution of labor in all queries (how often individuals are reused during different queries – presumably people don’t want to spend their entire workday forwarding email)
  23. The success rates of all algorithms are in the high 90%s
  24. BFS can find the shortest paths (by construction), some other methods make much longer paths
  25. On the other hand, BFS contacts many people.  Those based on Hamming distance (difference from self), overall connectedness contact the fewest, and are highly effective
  26. Methods that generate trajectories (unlike BFS) generally find their target in one rollout
  27. On the other hand, Hamming distance and connectedness metrics route many messages through few people (the well connected ones).  BFS spreads messages out well
  28. Number of messages passed by people with high out degree is large even when those algorithms don’t explicitly depend on out degree.  <Probably this is because the in-degree is also high, so they are easy to reach>
  29. <The pdf is weird – I have references, then some more stuff, along with the summary (almost like an appendix, but its not).  Don’t know whats going on.  Goes page 7, 10, 9, 10, with no 8>
  30. Results on pruned graph shows that removal of weak links has a significant impact on transfer
  31. When the top-10 connected individuals are removed, success rates for some algorithms go from about 95% to 75%.  Not surprisingly, BFS is the most robust to this

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: