- Link Types
- Data Gathering
- Bounding off the Network
- Known Issues
- A Request
This page discusses the Blogosphere 1998 data set and asks for help reviewing and completing it. There is a separate analysis of the data.
The data set distinguishes between three types of links: endorsements, list links and attributions.
Endorsements are links in a weblog post that point to another weblog in the network. Weblog editors of the period often kept essay pages that were separate from their weblogs, e.g. Dave Winer’s Davenet, Steve Bogart’s Scribbles or Cam Barrett’s Rants; links to a network site’s non-weblog parts have been excluded from the data set.
List links are situated outside the stream of weblog postings, often in a sidebar. They include items from Chris Gulker’s early blogroll of October 1997 as well as more recent blogrolls, such as the links placed at the top of John Wilson’s Untitled Weblog or at the bottom of Steve Bogart’s News, Pointers & Commentary in May 1998. List links also include Jorn Barger’s “sources page,” which is not preserved in a copy of 1998, the earliest extant version dating from April 1999. A new list link is inferred to have entered the sources page whenever Barger credited a weblog for the first time.
Am attribution is a credit for a “borrowed” link, given to the source of that link, often another weblog. Tentatively used first by Chris Gulker in 1997, credit links were introduced and popularised by Jorn Barger in February 1998.
An attribution need not involve an HTML anchor tag. Barger’s original style of link attributions did not involve a direct hyperlink to the source, but rather a non-hyperlinked citation key pointing to a “sources page”. This style was adopted and used throughout 1998 by Bogart and Whump. In May 1998 Raphael Carter introduced the more familiar credit link style that uses a direct hyperlink to the source site. In the data set, credit links with a direct hyperlink are marked with an asterisk.
Credit links, although by far the most numerous type throughout the data, are under-reported, as multiple credits per day, especially on Robot Wisdom, were counted as a single credit only. This policy may be reviewed for future revisions of the data set.
As the data is more than ten years old, its preservation is less than perfect but much better than might have been expected. Persistence, a variety of complementary search strategies, as well as the occasional serendipitous discovery, managed to unearth a fairly complete set of data from a range of locations. These locations include the Internet Archive and archives kept by the webloggers themselves.
Yahoo’s Site Explorer has proven of some assistance. It is more suitable than Google for locating inlinks, as, unlike Google, it will also list inlinks of URLs that have expired.
Bounding off the Network
The question of which sites were part of the emerging network in 1998 cannot directly be answered using a pre-existing list. Chris Gulker’s list stopped being updated in February 1998 and has the serious drawback of being limited to users of the News Page software only. The compilation that has been hailed as the “canonical list of early blogs,” Jesse Garrett’s Ye Olde Skool, was not published until some time in 1999. The original iteration of Cam Barrett’s blogroll in January 1999 is closer to the period under investigation, and its immediate predecessor, Michal Wallace’s blogroll of December 1998, is even a product of that period, but all of these lists share the limitation of being a single person’s incomplete and biased selection.
The data set tries to overcome this limitatation by using a rule-governed selection process instead of arbitrary preference. The process adopted starts from the four historical lists, but treats them as seeds only: all sites on the lists were scanned, iteratively, for references to other sites. To be included in the data set, a site must then fulfil both of these rules:
- Have a family resemblance with weblogs, especially under the then-current construction of the term
- Have a minimum degree centrality of 2
The first condition is designed both to eliminate false positives and to allow for sites that aren’t mentioned on any of the seed lists.
The second condition is a safeguard against the busywork of accounting for an extensive but largely irrelevant periphery. In practice, the need for a degree centrality of 2 or higher means that a site has to be referenced by more than one known network site, or, in the absence of a second reference, needs to reference at least one network site by itself.
The most striking characteristic of the data set is its relative completeness: contrary to what might be expected, there isn’t all that much missing.
Here are the sites that are wholly or partly missing, in order of increasing degree centrality.
- Degree centrality: 2
In March 1998, Andy Edmonds invited Barger to expect “some nice crosslink action” from Psyberspace soon. One isolated weblog page has survived from around the same time. The bulk of Edmonds’ early weblog is, unfortunately, not retrievable now, as the current registrant of Psyberspace.net has the Internet Archive’s holdings blocked by a robots.txt query exclusion. Psyberspace has an overall inlink count of 4.
- Degree centrality: 2
Phil Suh‘s news page archives from 1997 and 1998 have disappeared without a trace. The Internet Archive has no matches. The site’s overall inlink count of 4 does not suggest that the missing archives contain a large number of relevant outlinks.
- Degree centrality: 3
Andy Affleck (né Williams) states in his archives that the posts from May 1998 to July 2000 are missing due to a “hard drive incident and a lack of backup problem”. The plausible inference that more than two years’ worth of regular postings have been lost appears to be mistaken, however. This page shows that Affleck stopped updating in early May 1998, then commented on the lack of updates in late July 1998, and apparently still hadn’t resumed in February 1999, when the page was archived. In the absence of additional material by Affleck from 1998, It seems reasonable to assume that it never existed. Ragged Castle’s inlink count is 7.
- Degree centrality: 4
The Drudge Report never maintained any archives, but the Internet Archive has a few pages. The absence of links from Drudge into this network is unproven but strongly presumed on the basis of Matt Drudge’s documented unwillingness to be identified as a blogger. Drudge’s inlink count is 20.
- Degree centrality: 6
Jim Romenesko‘s Obscure Store & Reading Room in its current iteration maintains archives only back to April 2005. The Internet Archive has pages only for 2, 5 and 12 Dec 1998. It is strongly presumed that, except for the list link to Robot Wisdom, Obscure Store did not link into this network. Obscure Store’s inlink count is 67.
- Degree centrality: 10
Michael Sippey started the Web zine Stating the Obvious in 1995, and in this network diagram presents his site in affiliation with other zines.
In May 1997, Sippey added a new feature to his site, the Obvious Filter, which he nested one directory into his site. The Obvious Filter adopted the news page model, a fact that Dave Winer immediately recognised and celebrated in a brief note. The Obvious Filter ran until mid-September 1997 and appears to be fully preserved by the Internet Archive in these instalments:
28 – 31 May 1997, 1 – 13 Jun 1997, 16 – 30 Jun 1997, 1 – 16 Jul 1997, 16 – 31 Jul 1997, 1 – 15 Aug 1997, 15 – 31 Aug 1997, 2 – 15 Sep 1997, 16 Sep – 13 Oct 1997
In mid-October, Sippey chose to shut down his site for a while and relaunched in late December, offering a new implementation of the “Filter” as a weekly feature under the name “Filtered for Purity.” The archives of that feature run until late April 1998, at which point, or at a point shortly after, the “Filtered for Purity” links moved to the site’s front page. In June 1998 they were able to drive, according to Rebecca Eisenberg’s testimony, a “huge amount” of traffic. The Filtered for Purity feature remained on Stating the Obvious until the end of the year but fell victim to Sippey’s new-year resolution for 1999, in which he foreswore “the self-induced stress of producing daily content, even if that content wasn’t really content at all, but merely meta-content – links to and smartass commentary on other people’s content.” The Internet Archive has preserved a sampling of Filtered for Purity in its waning days, but its contents from May to late October 1998 are unaccounted for. As the extant Filter material has an overall outlink count of only 4, the missing months are unlikely to contain a larger number of outlinks to the rest of the network. The Obvious’ inlink count is 13.
The materials that are currently missing typically belong to sites that are somewhat peripheral to the network, showing low degree centrality, low inlink count, or low outlink count in the parts that have been preserved. Their eventual loss would therefore not significantly detract form our understanding of this network. For the sake of completeness, however, it would be desirable to close as many of the gaps in the data as possible.
If you can help out with any of the gaps in the data as discussed above, have spotted any omissions, or have offline archives that complete the data, please do get in touch!
Blogosphere 1998: Data Set · Analysis · Comment thread