Notes on Obsolete Content
Picture this - You join a new company. The team is excited to have you on board. Why wouldn’t they? After all, they just waited for 6 months to find the right candidate who not only has the technical chops but also is a culture fit.
You finish the formalities and then finish a round of introductions over Zoom. A few faces look familiar. Of course, you did meet them during your onsite interviews. You try to recollect names. You fail. You move on.
Workstation. Check.
Access issues. Resolved. Thanks, Okta.
And then you start going through the documentation which in most cases, would read something like this -
Apollo team Dev onboarding
Check out the source code from VCS.
Go to the build folder and run the ant.xml file.
Wait a minute. Why would anyone still use ant build scripts and VCS? Then you check the last updated date on the Wiki and breathe a sigh of relief. Phew! It’s an old page that hasn’t been updated since 2011. Thank God.
Ok! I exaggerated a teeny tiny bit. But this is a fairly common scenario. It’s really hard to keep your team’s knowledge base up to date.
If it’s hard to keep things up to date for a 4 member team. Imagine the plight of sites like Stackoverflow, Glassdoor, and Yelp who are on a mission to be the go-to sites that users flock to check out the content in their respective verticals.
I am sure they have some content that’s obsolete or not relevant to the current day. I wanted to understand how sites deal with this.
I found this paper online titled - ‘An Empirical Study of Obsolete Answers on Stackoverflow’.
[Edit]: I am so excited to learn that dealing with obsolete content is on Stackoverflow’s 2021 roadmap
They also have a detailed Meta Stackoverflow post about the issue.
This post is a summary of the paper.
Problem statement
Stack Overflow accumulates an enormous amount of software engineering knowledge. With time, certain knowledge in answers may become obsolete. If obsolete answers are not identified, it may mislead answer seekers and cause unexpected problems. At the very least, a bad user experience.
Data collection
i) The authors curated a collection of Stackoverflow answers that had the following keywords: “deprecated”, “out of date”, “outdated”, “obsolete”,
ii) And the keywords in (i) do not appear in the question. An obsolete question will likely have an obsolete answer.
In all the authors collected 52177 answer threads, which include 58201 comments that mention obsolescence. https://github.com/SAILResearch/replication-obsolete_answers_SO
Case Study
What are the effects when an answer has been found obsolete?
i) More than half of the studied obsolete answers were probably already obsolete as they were being posted.
(Fig. 4) presents the time gap between the answer creation time and the time at which the obsolescence observation was noted.
ii) More than half of the users do not update their answers or add new answers after their answers are noted as obsolete.
It takes 227 days on average for users to provide the first update for an obsolete answer after the obsolescence is observed in a comment, while it takes
198 days on average to add the first new answer after the obsolescence is observed.
Is there a correlation between the tags and obsolescence
Answers that are related to certain tags (e.g., node.js, ajax, android, and objective-c) are more likely to become obsolete. Fair enough, a field that’s growing fast might see a high churn rate when it comes to APIs and new libraries.
For instance, Android has released 16 major versions and 28 levels of API from September 2008 to Aug 2018 and there are, on average, 115 API updates per month.
What are the potential reasons for answers to become obsolete?
The Paper lists a variety of reasons why answers become obsolete.
31.7% of the studied answers (after removing false positives) became obsolete due to the evolution of their associated third-party libraries.
30.9% of the studied answers became obsolete due to the evolution of their programming languages.
12.9% of the studied obsolete answers are due to outdated tools, and 27.9% of these outdated tools are related to IDEs.
Who observes obsolete answers and what evidence do these observers provide?
The obsolescences of answers are more frequently observed by outsiders (38.2%), compared to askers (20.5%) and answerers (24.3%)
The majority (78.6%) of the obsolete observations are supported with evidence (e.g., updated information, version information, or a reference).
Proposed Solutions
Suggestions for Stackoverflow
An automated tool could be built to identify existing obsolete answers on StackOverflow or help answerers identify obsolete answers in real-time during answer creation.
An automated mechanism to detect obsolete references is needed. Out of the 5.5 million links, 11.9% of the links were inaccessible.
The heuristic-based approach for identifying obsolete answers using comments has an accuracy of 75%. Future work could improve the accuracy of our approach using machine learning techniques (e.g., classification).
Develop mechanisms to encourage users (especially question thread insiders) to pay more attention to the obsolescence of answers (their own or others’)
and make efforts to maintain any obsolete answers.
Suggestions for Users
Answerers are encouraged to include relevant information about the valid version or the time of their knowledge when creating answers.
Answer seekers are encouraged to carefully go through the comments that are associated with answers in case these answers become obsolete, especially for answers that are related to web and mobile development, such as node.js, ajax, android, and objective-c.
The paper ends on this note.
Image courtesy: asicentral.com