Once a new item is created automated systems
Posted: Wed Jul 09, 2025 5:59 am
The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”
All content uploaded to the Archive is stored in “Items.” As telegram data with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.
quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.
We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.
As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).
Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.
All content uploaded to the Archive is stored in “Items.” As telegram data with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.
quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.
We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.
As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).
Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.