DigiPres Machines

Early this summer, the department learned a development proposal we had submitted to improve MIAP’s computer hardware was successfully funded. This was excellent news – our video digitization stations were not the only piece of equipment that had started lagging a bit and creating choke points around archival projects. But besides breathing new life into some of our existing workflows, this project also presented an exciting new opportunity.

For the past couple of years, we’d had three contemporary (~2013) MacBook Pro laptops available in our digital forensics lab. These were intended, less for forensics really, and more as a general purpose resource – for MIAP students to use in any course, say, if they forgot their own laptop at home, or their own laptop was having difficulty during a hands-on lab exercise. These were, I think, quite a bit of success, and these laptops continue to be used on the regular. So I ultimately viewed these three computers as a sort of pilot program, justifying a larger fleet of similar laptops that could potentially serve an entire class of 10 or 11 students at the same time (our usual cohort size) rather than just three people.

IMG-3177.JPG

So, in addition to the MacBook Pros already there, we had the chance to add eight more laptops to this supply. Perhaps even more excitingly (for me, anyway), I had the chance to design, from the ground-up, what a “MIAP laptop” should look like – a laptop that, from startup, would be of use to our students in even vaguely digital-related courses… which, yes, is all of them, but especially: Metadata, Digital Literacy, Handling Complex Media, or Digital Preservation. They might also be used for workshops, like my own Talking Tech series, or for other invited guest speakers. What would the foundation for all this look like? What programs would you install to educate and demonstrate a wide variety of digital preservation and administration tasks: disk imaging, metadata extraction, metadata manipulation and transformation, digital forensics, file transfer and packaging, fixity management, on and on… while minimizing the time spent installing and managing software later on?

What follows is a detailed break-down of each component of these machines: hardware, operating system, and software (both graphical and command line), along with, if prudent, what each program does and why it was selected. When applicable I’ll link out so you can download or investigate each component yourself and your needs (probably less broad than mine).

Thanks to all those who offered feedback to that initial request, or have inspired these choices via many other professional tasks and avenues.

The Hardware

gallery1_2256

  • 13” MacBook Air, mid-2016
  • 2.2 GHz i7 Intel Core processor
  • 8 GB RAM
  • 512 GB SSD storage

Why:

Apple hardware has a proven durability – the MIAP media lab houses a fleet of legacy/leftover Mac laptops and desktops stretching back to PowerBooks and a Macintosh SE that are (knock on wood) largely still functional. Anecdotally, the majority of our students have come into the program as Mac users, and are therefore already familiar with using Apple hardware.

Also, MIAP’s digital preservation/literacy instruction (as with the fields of digital preservation and open source software development at large) is currently favored toward Unix environments (i.e. command line interfacing via the Bash shell, GNU/Linux and BSD utilities). While the rapid development of the Windows Subsystem for Linux in Windows 10 offers tantalizing opportunities for (cheaper) crossover Unix/Windows education in the near future, lingering quirks in the WSL give Apple hardware the “it just works” advantage when it comes to teaching new students digipres basics. (pure Linux machines were not considered, assuming students would be unfamiliar and thus uncomfortable with such platforms, and that our graduates, not necessarily going into systems administration, are more likely to encounter Mac and Windows machines in their archive/library/museum work environments)

The MacBook Air line was chosen over the vanilla MacBook or MacBook Pro flavors of Apple laptops because of connectivity considerations – that is, available data ports. Given that the 2016 (and onward) MacBook and MacBook Pro lines have only USB Type-C Thunderbolt 3 ports, connecting to peripherals, external drives, and video projectors already available in the department would have required a host of new cables and adapters for data transfers and video output  – as well as the need to track the location and use of said adapters. Thus, given its USB (Type-A) and Thunderbolt (Mini Displayport) options, the MacBook Air was considered most flexible in terms of backwards compatibility and most efficient in terms of maintaining current department equipment.

The lack of a built-in optical drive in the MacBook Air line is something of a drawback in terms of potentially demonstrating optical disc imaging. However, built-in optical drives were discontinued on all readily-available Apple laptops several years back, in any case. A handful of USB-attached Apple SuperDrives (CD/DVD capable) were also purchased to balance this potential shortcoming.

Intel i7 processors and at least 8 GB RAM guarantee ability to run virtual machines or emulation software at comfortable processing speed/power (e.g. minimum recommended requirements for demonstrating a BitCurator VM).

Higher cost of 512 GB SSD storage (opposed to smaller 256 GB model) was considered worth the investment for the ability to potentially move/manage/analyze larger collections and files locally without fear of running out of storage space. In either case, SSDs will require close management to avoid clutter – storage should be inspected and evaluated at the end of each semester (files to be kept moved to department’s backup storage, excess/test files removed).

 

The Operating System

os-x-el-capitan

  • Mac OSX 10.11 (El Capitan)

Why:

Choosing a Mac OS was a trickier consideration. As of summer 2017, Apple only continues to roll out security updates for OSX 10.10.5 (Yosemite) and above, immediately removing any earlier OS version from consideration – exposing students who may be using these machines for tasks as mundane as web browsing to known OS security flaws is unacceptable. And generally speaking, it is the position of this writer that operating systems on professional workstations should be maintained as up-to-date as possible – beyond the security hazards of using older operating systems on a network (laid bare by the global WannaCry and NotPetya ransomware crises of summer 2017 alone), it anecdotally seems to me that archivists and librarians tend to maintain old systems in order to continue using older, favored software, that may or may not encounter issues with newer OS versions. I feel this approach simply ignores the underlying issue of software rot rather than facing it head-on. Rather than allowing workflows to become increasingly unsustainable to the point that they completely break and require the increased effort of redesign around entirely new systems, regular updating identifies problems in software/environment communication *before* they become full-stop crises, and allows for workflows to be regularly tweaked and improved in an ongoing process (particularly when dealing with open-source software, where the issue can be communally raised and addressed) rather than totally reinvented.

That said – such regular updating and testing has identified some persistent issues with macOS 10.12 (Sierra) that make OSX 10.11 (El Capitan) more appealing for the moment. The OS updates in Sierra were largely based around consumer-driven features: the introduction of Siri and Apple Pay to macOS, increased picture-in-picture mode and tabbing in third-party software, improved iCloud sync, etc. These features have made macOS Sierra a much more demanding OS, in terms of processing power, at no perceived benefit to digital archivists or our students. Other new “features” – such as “improved” download validation that can quarantine archived applications (downloaded in .zip, .tar files, etc.) without notifying the user – have proven actively problematic.

Continued rollout of Apple’s new file system (APFS), compatible only with Sierra and above, will be worth keeping an eye on, especially when 10.13 (High Sierra) arrives in fall 2017. But as long as security updates are maintained, the stability of El Capitan (plus increased functionality over Yosemite) makes it, for my money, the preferable option right now for a digipres Mac machine.

Since all new Apple machines come pre-installed with the latest OS (Sierra), this decision required actively downgrading the laptops. Doing this on a brand-new machine, where there is no need to backup user data, is luckily a relatively simple process (and, with solid-state drives, fast), with plenty of thorough instructions available on the internet.

The Software

Graphical User Interfaces (GUIs)

General Operation:

Screen Shot 2017-08-07 at 11.05.23 AM.png

Open source web browser preferred. Given that these laptops were made available to the department of Cinema Studies and the MIAP program specifically under the idea that students should not face any technological barriers to their education in this program, I believe we have a responsibility not to ask students trade their privacy for that privilege. While we do not control the privacy policies for use or administration of NYU’s network or NYU’s Google Apps for Education, we control this hardware, and as is the case in public libraries, have a responsibility to at least maintain privacy via physical access (i.e. multiple students potentially using the same device) and limit outside tracking of our students to the best of our ability and advocacy.

Thus, our Firefox browsers (set as the OS default browser) come pre-installed with several open-source extensions and preferences designed to enhance web security, limit third-party tracking, and automatically clear personal data like browsing history and passwords.

Hidden scripts activated at regular intervals (after any overnight/project checkout, or once a month, whichever comes first) by an admin account (AppleScripts borrowed/tweaked from the staff at Bobst Library’s Computer Center! Thanks, guys!) also automatically securely delete files from most common user directories: e.g. Downloads, Documents, Desktop, etc.

Screen Shot 2017-08-07 at 11.05.34 AM.png

Open source software preferred. Flexible but very light source code editor,  good introduction for new users to simple text/markdown/HTML/XML editing

Screen Shot 2017-08-07 at 11.06.28 AM.png

Open source, flexible hex code viewing/editing.

Screen Shot 2017-08-07 at 11.11.19 AM.png

Robust, proprietary software suite for the creation and manipulation of XML-based data. Used for instruction and homework in Metadata course. Made available via academic licensing.

Screen Shot 2017-08-07 at 11.06.57 AM.png

Open source software preferred. Maintains local compatibility with Google Docs, Sheets, Slides (used by many students and courses via NYU’s Google Apps for Education) for basic word processing, spreadsheets, presentations, etc.

Screen Shot 2017-08-07 at 11.10.15 AM.png

Powerful and flexible program for database design, data management, and cataloging. Used for instruction and a major assignment in Metadata course. Proprietary, licensed; Filemaker’s move to cloud-based download and licensing structure, rather than via physical media (install disc) makes our usual model of temporarily lending students licensed copies of the software for their own computers increasingly untenable to manage. Investigation into a suitable, long-term open-source alternative for the Filemaker assignment is ongoing.

Screen Shot 2017-08-07 at 11.06.50 AM.png

Software for designing, deploying and manipulating MySQL databases. Standard Editions of MySQL-branded software (Server, Workbench, etc.) have been closed/proprietary since being bought by Sun Microsystems (now Oracle), but the foundational software (lacking some features/modules) is currently made available via open-source “community editions” buried on Oracle’s site. Potential alternative to use for the Filemaker assignment in Metadata course, but for now at least offered for students to explore.

Screen Shot 2017-08-07 at 11.07.32 AM.png

Open source software preferred. Flexible media playback software that maintains compatibility with most consumer and even many preservation-grade video formats and codecs for QC and access.

Screen Shot 2017-08-07 at 11.07.21 AM.png

Open source alternative to built-in Mac “Archive Utility” app for opening zipped/archived/compressed files – and more powerful (can extract from a much wider variety of archive formats, e.g. .rar or .7z, and can better handle non-English characters).

Screen Shot 2017-08-07 at 11.06.14 AM.png

Open source GUI for the rsync Unix command-line utility. The Mac port of Grsync hasn’t been actively developed since 2013, but Grsync is a stable and intuitive cross-platform interface that makes crafting rsync scripts easier for beginner or command-line shy users.

Screen Shot 2017-08-07 at 11.11.43 AM.png

FUSE for OSX is an open source file system compatibility layer that allows for read/write mounting of third-party file systems in OSX – combined with the NTFS-3G driver (installed via Homebrew), allows NTFS (Windows)-formatted disks to be mounted with both read AND write capability on a Mac (out of the box, OSX will usually mount and read/display files on NTFS disks but can not write/change data on them).

Screen Shot 2017-08-07 at 11.11.48 AM.png

Necessary Java libraries for running some of the Java-based GUI tools listed below.

 

Metadata Characterization/Extraction:

Open source software that extracts and displays technical metadata and text from a huge range of file types (critically, including non-AV material like PDFs, PPTs, DOCs, etc. etc.)

Open source tool for extracting metadata and creating reports on the files contained within disk images. This is a GUI based on top of the brunnhilde.py command line tool; may also be installed within the BitCurator virtual machine (see below). [requires some command line parts to be installed first – read through the instructions in the link]

Free and open source software for batch identification of file formats, using the PRONOM technical registry.

Screen Shot 2017-08-07 at 11.06.44 AM.png

Open source software that displays basic technical metadata embedded in a wide variety of video and audio codecs.

 

Transcoding:

Screen Shot 2017-08-07 at 11.06.22 AM.png

Open source video transcoding software. Perhaps most commonly used for ripping DVDs but also useful for creating web or disc-ready access copies of video files.

Screen Shot 2017-08-07 at 11.13.34 AM.png

Alternative free (but not open) video conversion and editing software. Transcodes and demuxes a variety of formats (not, despite its name, just MPEGs), with some more sophisticated clipping/editing options than Handbrake.

 

Virtualization + Forensics:

Screen Shot 2017-08-07 at 11.07.27 AM.png

VirtualBox, by Oracle, is a free and open source virtualization platform, allowing for a variety of contemporary and legacy guest operating systems to be run safely on a modern host. In this instance, VirtualBox is used to run the BitCurator Linux distribution (derived from Ubuntu) – in essence, a comprehensive suite of of forensics and data analysis software aimed at processing born-digital materials.

The full list of tools found in the BitCurator environment can be found on their wiki. In addition to the tools on this list, two additional pieces of Linux-only command-line software were installed (via Ubuntu’s CLI package manager, apt-get): dvdisaster, an optical disc data recovery tool; and dvgrab, designed to capture raw DV streams from from tape.

Edit (8/9/17): dvgrab is a great tool, but the FireWire connections necessary to hook up to DV tape decks will not pass through VirtualBox into a virtualized OS. So installing dvgrab in this particular setup was essentially moot. But, leaving the mention of it here because it would still be useful for any digipres-oriented machine running a dedicated Linux/BitCurator installation rather than a virtualization.

 

Packaging and Fixity:

The Library of Congress’ free and open source GUI for packaging and transferring files according to the BagIt specification.

Simple GUI tool developed for quick accessioning of digital media (including assigning unique identifiers, basic Dublin Core metadata, and checksum validation) into a repository prior to more detailed appraisal and description.

Utility for regular review and checksum validation of files in a given directory. Can email regular reports to the user on status of files in long-term storage.

 

Validation and Quality Control:

Free and open source tool for adding, extracting and validating metadata within Broadcast Wave format audio files.

Quality control and reporting tool for extracting and examining metadata from DV streams, either during reformatting from tape media or post-capture analysis.

Open source framework for file format identification and validation. (Identifies what format a file purports to be, and confirms whether or not it validly conforms to that format specification)

Screen Shot 2017-08-07 at 11.06.38 AM.png

Extensible, open source policy checker and reporter, currently targeted at validating audiovisual files conforming to preservation-level specs of Matroska (wrapper), LPCM (audio codec) and FFV1 (video codec).

Screen Shot 2017-08-07 at 11.07.03 AM.png

Now a volunteer, open source project (formerly funded by Google and known as Google Refine) for cleaning and transforming metadata.

Screen Shot 2017-08-07 at 11.07.11 AM.png

Free and open source option for in-depth inspection of digital video signals; intended to assist in identification and correction of common errors during digitization of analog video. GUI is aimed more at in-depth analysis of single files (see command line tool for batch processing).

 

Command Line

 

Screen Shot 2017-08-07 at 11.07.51 AM.png

The “missing” package manager for Macs. Allows easy install and use of a huge variety of open source command line software. Unless otherwise noted, all programs below installed via

$ brew install [packagename]

But first you should

$ brew tap amiaopensource/amiaos

to allow easy download of a number of useful programs and libraries made available from the Association of Moving Image Archivists’ Open Source Committee!

The Mac OS Terminal application, which is how most users access and use command line software, by default uses version 3.2 of the Bash shell, whereas the most recent stable version of Bash is all the way up to 4.4. The difference won’t be noticeable to the vast majority of users, particularly regarding common digipres tasks, but users or students interested in more advanced bash scripting may want to take advantage of the shell’s newer features. Once Homebrew is installed, updating the Bash shell is a very easy process, outlined in the link above.

 

File Conversion and Manipulation:

Extremely powerful and flexible media transcoding and editing software. Use of ffmprovisr as a guide to just *some* of the possibilities highly recommended!

  • imagemagick

Similar to ffmpeg, but targeted specifically at transformation and transcoding of still image files.

  • sox

Sound processing tool for transcoding, editing, analyzing, and adding many effects to audio files.

  • jq

Utility specifically for manipulating and transforming metadata in the JSON format.

 

Metadata Extraction/Characterization:

  • exiftool

Reads embedded and technical metadata of media files, especially powerful/helpful with regard to still image formats.

  • mediainfo

Command line tool for displaying and creating reports on basic technical metadata for a wide variety of video and audio formats.

Signature-based file format identification tool that draws on several established registries (PRONOM, MIME-info). [special installation instructions required; see link for details]

  • tree

Basic utility (ported from Windows) that recursively displays all files and subfolders within a directory in a human-readable format – a nice companion to the basic “ls” command.

 

File transfer:

  • rsync

Fast, versatile tool for securely transferring or syncing files, either remotely (to a server) or locally (between drives).

  • wget

Allows for downloading files from the internet (via HTTP, HTTPS, or FTP) from the command line.

  • youtube-dl

Command line program from downloading media off of YouTube and other popular media hosting web platforms.

 

Validation and Quality Control:

  • md5deep

Allows for computing, comparing and validating hashes for any file, according to a number of major checksum protocols (md5, SHA-1, SHA-256, etc.)

  • mediaconch

Command line policy checker and reporter for validating Matroska, FFV1 and LPCM files.

  • qcli

Command line tool for generating QCTools reports. Aimed at batch processing reports for a number of video files at once.

 

Disk imaging:

  • ddrescue

Smart data recovery tool, can be life-saving for dealing with “dead” and failing hard drives.

More advanced versions of AccessData’s Forensic Tool Kit software are available via commercial licensing, but the command line version of their FTK Imager utility for making forensic disk images (raw or E01 format) can be downloaded for free [follow the link to appropriate product and system spec – you will have to provide some basic information and will be emailed a download link]

  • libewf

Library necessary for handling EWF forensic disk images (Expert Witness disk image Format).

 

Miscellaneous:

A package of scripts developed at CUNY TV for many useful tasks and batch processing related to audiovisual files. Utilizes several of the other command line programs mentioned here, including ffmpeg. Follow the link for installation instructions and more details on all the scripts available and what they do!

  • ntfs-3g

A read/write NTFS file system driver. When Fuse for OSX is installed, you can use the ntfs-3g command to manipulate files on NTFS (Windows)-formatted disks on Mac OS.
Python packages:

Python is a very common programming language, and its library (2.7.10) comes pre-installed on Mac OS. You can install some Python-specific tools and packages that are not available via Homebrew, but you will need Python’s own package manager – a tool called “pip”. Follow the link above for instructions on adding pip to the Python installation on your computer; once that is successful, for the following three packages, you can simply install them by typing:

$ pip install [packagename]

  • bagit

The currently-favored command line implementation of the Library of Congress’ bagit software for packaging files according to the BagIt specification. [see this post for some more context on what that all is about]. Invoke in the command line using “$ bagit.py”

  • brunnhilde

Command line tool for extracting metadata and creating reports on the files within disk images. Requires the “siegfried” tool to have already been installed via Homebrew. This utility must be installed to run the Brunnhilde GUI app. Invoke in the command line using “$ brunnhilde.py”

  • opf-fido

Package name for FIDO (Format Identification for Digital Objects), a command line tool for…uh… identifying file formats of digital objects. Uses the PRONOM registry. Invoke by just using “$ fido”

 

For fun!

Just to add a little more flavor to your command line education. Try ’em and find out.

  • cowsay
  • ponysay
  • gti
  • sl
  • cats [requires some additional installation – follow instructions in the link!]

 

 

* It’s a practical reality that basically every grad student coming into the program already owns their own laptop, and would generally prefer to use their own machine during class – especially since in-class work frequently ties into take-home assignments, group work, etc. I therefore have not yet designed any sort of official loan policy for these laptops to leave the classroom. But it has also proven true, especially in our Digital Literacy and Handling Complex Media classes, that there are times for lab exercises that it really really assists with classroom prep for there to be a guarantee that everyone in the room will be on the same page.

 

Upgrading Video Digitization Stations

In the primary MIAP lab we have four Mac Pro stations set up mainly for video digitization and capture. They get most heavily used during our two Video Preservation courses: Video Preservation I, which focuses on technical principles and practice of digitization from analog video sources, and Video Preservation II, which focuses more on vendor relations and guiding outsourced mass digitization projects, but by necessity involves a fair amount of digital video quality control/quality assurance as well. They get used for assorted projects in Collections Management, the “Talking Tech” workshops I’ve started leading myself, and the Cinema Studies department’s archive as well.

Over the course of 2016, the hardware on these four stations was really starting to show its age. These machines were originally bought and set up in 2012 – putting them in the last generation of the older “tower”-style silver Mac Pro desktops, before Apple radically shifted its hardware design to the “trash bin” style Mac Pros that you can buy today. The operating system hadn’t been updated in a while either, they were still running Mac OSX 10.10 (Yosemite), whose last full update came in August 2015 (with a few security updates still following, at least).

maxresdefault
This guy isn’t allowed in anymore, for instance.

These stations were stable – at least, in the sense that all the software we needed definitely worked, and they would get the job done of digitizing/capturing analog video. But the limitations of how quickly and efficiently they could do this work was more and more apparent. The amount of time it took, to, say, create a bag out of 200 GB of uncompressed video, transcode derivative copies, run an rsync script to back video packages up to a local NAS unit, or move the files to/from external  drives (a frequent case, as both Video Preservation classes usually partner with other cultural organizations in New York City who come to pick up their newly-digitized material via hard drive) was getting excruciating relative to newer systems, wasting class time and requiring a lot of coordination/planning of resources as ffmpeg or rsync chugged along for hours, or even overnight.

So, I knew it was time to upgrade our stations. But how to go about it? There were two basic options:

1. Purchase brand-new “trash bin” Mac Pros to replace the older stations

pratttrashcan_macpro
http://rudypospisil.com/wordpress/wp-content/uploads/2013/10/prattTrashCan_macPro.jpg

2. Open up the innards of the old Mac Pros and swap in updated, more powerful components

Buying brand-new Windows stations was basically out, just given the way our classes have been taught, the software we work with, and my own personal knowledge/preference/ability to maintain hardware. And I was lucky that #1 was even an option at all – the considerable resources available at NYU allow for choices that I would not have many other places. But, MIAP also has a lot of equipment needs, and I’d generally rather stash larger portions of our budget towards harder-to-get analog video equipment and refurbishment, than jump for splashy new hardware that we don’t actually need. So I drew up some thoughts on what I actually wanted to accomplish:

  • improved data transfer rate between desktops and external drives (the fastest connection available, at best, was the mid-2012 Mac Pro’s native FireWire 800 ports; and many times we were limited to USB 2.0)
  • improved application multi-tasking (allow for, say, a Blackmagic Media Express capture to run at the same time as the ffmpeg transcode of a previous capture)
  • improved single-application processing power (speed up transcoding, bag creation and validation, rsync transfer if possible)
  • update operating system to OSX 10.11 (El Capitan, a more secure and up-to-date release than Yosemite and MUCH more stable than the new 10.12 Sierra)
  • maintain software functionality with a few older programs, especially Final Cut 7 or equivalent native-DV capture software

Consulting with adjunct faculty, a few friends, and the good old internet, it became clear that a quick upgrade by way of just purchasing new Mac Pros would pose several issues: first, that the Blackmagic Decklink Studio 2 capture cards we used for analog video digitization would not be compatible, requiring additional purchases of stand-alone Blackmagic analog-to-digital converter boxes on top of the new desktops to maintain current workflows. It is also more difficult to cheaply upgrade or replace the storage inside the newer Mac Pros, again likely requiring the eventual purchase of stand-alone RAID storage units to keep up with the amount of uncompressed video being pumped out; whereas the old Mac Pro towers have four internal drive slots that can be swapped in and out within minutes, with minimal expertise, and be easily arranged into various internal RAID configurations.

In other words, I decided it was much cheaper and more efficient to keep the existing Mac Pro stations, which are extremely flexible and easy to upgrade, and via new components bring them more or less up to speed with what completely new Mac Pros could offer anyway. In addition to the four swappable storage slots, the old Mac Pro towers feature easy-to-replace RAM modules, and PCI expansion slots on the back that offer the option to add extra data buses (i.e. more USB, eSATA, or Thunderbolt ports). You can also update the CPU itself – but while adding a processor with more cores would in theory (if I understand the theory, which is also far from a 100% proposition) be the single biggest boost to improving/speeding up processing, the Intel Quad-Core processors already in the old towers are no slouch (the default new models of the Mac Pro still have Quad-Cores), and would be more expensive and difficult to replace than those other pieces. Again, it seemed more efficient, and safer given my limited history with building computer hardware, to incrementally upgrade all the other parts, see what we’re working with, and someday in the future step up the CPU if we really, desperately need to breathe more life into these machines.

So, for each of the four stations, here were the upgrades made (separation made between the upgrade and specific model/pricing found; for any of these you could pursue other brands/models/sellers as well):

  • (1) 120 GB solid-state drive (for operating system and applications)

OWC Mercury Extreme Pro 6G SSD: $77/unit
OWC Mount Pro Drive Sled (necessary to mount SSDs in old Mac Pros): $17/unit

  • (1) 1 TB hard drive (for general data storage – more on this later)

Western Digital Caviar Blue 1 TB Internal HDD: $50/unit

  • (1) PCI Express Expansion Card, w/ eSATA, USB 3.0 and USB 3.1 capability

CalDigit FASTA-6GU3 Plus: $161/unit

  • (4) 8 GB RAM modules, for a total of 32 GB

OWC 32.0 GB Upgrade Kit: $139/unit

IMG_2839.JPG
Swaaaaaaaaaaag

Summed up, that’s less than $500 per computer and less than $2000 for the whole lab, which is a pretty good price for (hopefully) speeding up our digitization workflow and keeping our Video Preservation courses functional for at least a couple more years.

The thinking: with all that RAM, multi-tasking applications shouldn’t be an issue, even with higher-resource applications like Final Cut 7, Blackmagic Media Express, ffmpeg, etc. With the OSX El Capitan operating system and all applications hosted on solid-state memory (the 120 GB SSD) rather than hard drive, single applications should run much faster (as the drives don’t need to literally spin around to find application or system data). By buying a new 1 TB hard drive for each computer, the three non-OS drive slots on each computer are now all filled with 1 TB hard drives. I could have two of those configured in a RAID 0 stripe arrangement, to increase the read and write speed of user data (i.e. video captures) – the third drive can serve as general backup or as storage for non-video digitization projects, as needed.

IMG_2843.JPG
RAM for days
IMG_2854.JPG
*Oh what fun it is to ride in a one-120-GB-solid-state-drive open sled*

IMG_2855.JPG

IMG_2856.JPG

The expansion cards will now allow eSATA or USB 3.0-speed transfers to compatible external drives. The USB 3.1 function on the specific CalDigit cards I got won’t work unless I upgrade the operating system to 10.12 Sierra, which I don’t want to do just yet. That’s basically the one downside compared to the all-new Mac Pros, which would’ve offered the Thunderbolt transfer speeds better than USB 3.0 – but for now, USB 3.0 is A) still a drastic improvement over what we had before, B) probably the most common connection on the consumer external drives we see anyway, and C) with an inevitable operating system upgrade we’ll “unlock” the USB 3.1 capability to keep up as USB 3.1 connections become more common on external drives.

IMG_2849.JPG
Uninstalled…
IMG_2852.JPG
…installed! Top row – followed by two slots for the Blackmagic Decklink Studio 2 input/output and the AMD Radeon graphics card input/output at the bottom.

Installing all these components was a breeze. Seriously! Even if you don’t know your way around the inside of a computer at all, the old Mac Pro towers were basically designed to be super customizable and easy to swap out parts, and there’s tons of clear, well-illustrated instructional videos available to follow.

As I mentioned in a previous post about opening up computers, the main issue was grounding. Static discharge damaging the internal parts of your computer is always a risk when you go rooting around and touching components, and especially since the MIAP lab is carpeted I was a bit worried about accidentally frying a CPU with my shaky, novice hands. So I also picked up a $20 computer repair kit that included an anti-static wristband that I wore while removing the desktops from their station mounts, cleaning them out with compressed air, and swapping in the mounted SSDs, new HDDs, expansion cards, and RAM modules.

IMG_2841.JPG

With the hardware upgrades completed, it was time to move on to software and RAID configuration. Using a free program called DiskMaker X6, I had created a bootable El Capitan install disk on a USB stick (to save the time of having to download the installer to each of the four stations separately). Booting straight into this installer program (by plugging in the USB stick and holding down the Option key when turning on a Mac), I was able to quickly go through the process of installing OSX El Capitan on to the SSDs. For now that meant I could theoretically start up the desktop from either El Capitan (on the SSD) or Yosemite (still hosted on one of the HDDs) – but I wanted to wipe all the storage and start from scratch here.

I accomplished this using Disk Utility, the built-in program for drive management included with OSX. Once I had backed up all important user data from all the hard drives, I completely reformatted all of them (sticking with the default Mac OS Extended (Journaled) formatting), including the brand-new 1 TB drives. So now each station had an operating system SSD running El Capitan and three blank 1 TB hard drives to play with. As mentioned earlier, I wanted to put two of those in a RAID 0 data stripe arrangement – a way of turning two separate drives into one logical “volume”. RAID 0 is a mildly dangerous arrangement in that failure of either one of those drives means total data loss; but, it means a significant performance boost in read/write speed (hopefully decreasing the likelihood of dropped frames during capture, improving time spent on fixity checks and bagging, etc.), while maintaining a total of 2 TB storage on the drives (most RAID arrangements focused more on data security and redundancy, rather than performance, will result in a total amount of storage on the volume less than the capacity of the physical disks), and files are not meant to be stored long-term on these stations. They are either returned to the original institution, backed up to the more secure, RAID 6-arranged NAS, or backed up to our department’s fleet of external drives – if not some combination of those options.

So it was at this point that I discovered that in the upgrade from Yosemite to El Capitan, Apple actually removed functionality from the Disk Utility application. The graphic interface for Disk Utility in Yosemite and earlier versions of OSX featured an option to easily customize RAID arrangements with your drives. In El Capitan (and, notably, El Capitan only – the feature has returned in Sierra), you’re only allowed to erase, reformat and partition drives.

jboddiskutilityyosemite-577aac645f9b58587592afb8

screen-shot-2017-03-02-at-11-07-37-am
Cool. Cool cool cool.

Which means to the Terminal we go. The command-line version of Disk Utility (invoked with the “diskutil” command) can still quickly create format a RAID volume. First, I have to run a quick

$ diskutil list

…in order to see the file paths/names for the two physical disks that I wanted to combine to create one volume (previously named MIAP_Class_Projects and Station_X_Backup):

screen-shot-2017-03-02-at-11-08-24-am

In this case, I was working with /dev/disk1 and /dev/disk3. Once I had the correct disks identified, I could use the following command:

$ diskutil appleRAID create stripe JHFS+ disk1 disk3

Let’s break this down:

diskutil – command used to invoke the Disk Utility application

appleRAID – option to invoke the underlying function of Disk Utility that creates RAIDs – it’s still there, they just removed it from the graphical version of Disk Utility in El Capitan for some reason ¯\_(ツ)_/¯

create stripe – tells Disk Utility that I want to create a RAID 0 (striped) volume

JHFS+ – tells Disk Utility I want the striped volume to be formatted using the journaled HFS+ file system (the default Mac OS Extended (Journaled) formatting)

disk 1 disk 3 – lists the two drives, with the names taken from the previous command above, that I want to combine for this striped volume

Note: Be careful! When working with Disk Utility, especially in the command line, be sure you have all the data you want properly backed up. You can see how you could easily wipe/reformat disks by putting in the wrong disk number in the command.

End result: two physical disks combined to form a 2 TB volume, renamed to MIAP_Projects_RAID:

Screen Shot 2017-03-02 at 11.15.37 AM.png
The 2 TB RAID volume, visible in the GUI of Disk Utility – note the two physical drives are still listed in the “Internal” menu on the left, but without subset logical volumes, as with the SSD and El Capitan OS volume, or WDC hard drive with the “CS_Archive_Projects” volume.

Hooray! That’s about it. I did all of this with one station first, which allowed me the chance to reinstall all the software, both graphical and CLI, that we generally use in our courses, and test our usual video capture workflows. As mentioned before, my primary concern was older native-DV capture software like Final Cut 7 or Live Capture Plus would break, given that official OS support for those programs ended a long time ago, but near as I can tell they can still work in El Capitan. That’s no guarantee, but I’ll troubleshoot more when I get there (and keep around a bootable USB stick with OSX 10.9 Mavericks on it, just in case we have to go revert to using an older operating system to capture DV).

Screen Shot 2017-03-02 at 11.13.17 AM.png
In order to not eat up memory on the 120 GB SSD operating system drive, I figured this was advisable.

I wish that I had thought to actually run some timed tests before I made these upgrades, so that I would have some hard evidence of the improved processing power and time spent on transcoding, checksumming, etc. But I can say that having the operating system and applications hosted on solid-state memory, and the USB 3.0 transfer speeds to external drives, have certainly made a difference even to the unscientific eye. It’s basically like we’ve had brand-new desktops installed – for a fraction of the cost. So if you’re running a video digitization station, I highly recommend learning your way around the inside of a computer and the different components – whether you’re on PC or still working with the old Mac Pro towers, just swapping in some fresh innards could make a big difference and save the trouble and expense of all-new machines. I can’t speak to working with the new Mac Pros of course, but would be very interested to hear with anyone using those for digitization as to their flexibility – for instance, if I didn’t already have the old Mac Pros to work with, and had been completely starting from scratch, what would you recommend? Buying the new Pros, or hunting down some of the older desktop stations, for the greater ability to customize them?

Changing a CMOS Battery

Computers, computers, computers. I knew becoming a “technician” meant that I would learn more about computers and how they work, but I admit when I took an archival post I thought it might involve more film and analog video tech. Not that I don’t work with those things – in a couple weeks I will be demonstrating/troubleshooting film digitization workflows for telecine and scanners for our first-year basic media training/handling course. But I suppose the things that I have really expanded my knowledge of in the past year, and the things I am therefore excited and inclined to pass along to you, dear readers, have to do with the inner workings of computers and electronics. I have relied on them for so long, and really had little idea of all the intricate components that make them tick – and I’d like to quickly talk about another one of those pieces today.

zy5uyzcnkbvm2mz6heqi
No not that one

A few months back while dual-booting a Windows machine I wrote about the BIOS – the “Basic Input/Output System” that lives on the Central Processing Unit (CPU) or “motherboard” of every computer. As I explained then, this is a piece of firmware (“firmware” is a read-only application or piece of code that is baked-in to a piece of equipment/hardware and can not be easily installed/uninstalled – hence “firm” instead of “soft”) that maintains some super-basic functionality for your computer even if the operating system were to corrupt or fail.

Recently, the computer on our digital audio workstation started automatically booting into the BIOS every time it was turned on. You could easily work around this – just choosing to immediately exit the BIOS would immediately boot Windows 7 and all seemed to be functioning normally. But the system clock would reset every time (back to the much more innocent time of January, 2009). I wasn’t overly concerned, as any kind of actual failure with the operating system would’ve produced some kind of error message in the BIOS, and there was no such warning. But what was going on?

img_2691
Explain yourself!!!!

Doing a little digging on the internet, the giveaway was the system clock reset. The BIOS on a central processing unit is stored on a very tiny piece of solid-state memory called ROM (Read-Only Memory). It is very small because there are only a few pieces of data that need to actually be saved there every time you turn the computer on and off – just system settings like the clock, which take up a minuscule amount of space compared to video, audio, or even large text or application files. The trick is, ROM needs a constant source of power to actually save anything – even these tiny pieces of data. Cut off its power supply and it will completely reset/empty itself (much the same way your RAM – Random Access Memory – functions, but that’s another topic). This is what was happening – the BIOS was popping up every time we booted the computer because the BIOS had been reset and it just wanted us to set up our clock, fan speed, and other default system settings again (and again, and again, and again).

img_2692

 

img_2693
Oh how I wish it was 2009 again, computer.

How was the power being cut off? Under most circumstances, even leaving a computer plugged into a wall outlet is enough to maintain a flow of enough electricity to the CMOS (central metal-oxide semiconductor) circuit that powers the ROM/CPU/BIOS. Yes, as long as your device is still plugged into a wall outlet, there is electricity running to it, even if you have turned the device off. Which is why we keep the computer (and various other devices) on our digital audio workstation on a power strip that we periodically completely switch off – to conserve power/electricity.

img_2690
Yep we’re just the environment’s super best friend

But there is usually a failsafe to save the BIOS settings even under such circumstances: every modern CPU has a small coin battery to power the CMOS, which maintains an electric flow through the ROM and thereby saves your BIOS settings, even if the power supply unit on the back of your computer is completely disconnected. What happens when you completely unplug the computer AND that battery dies? Your BIOS will reset every time you boot up the computer – exactly what happened to us.

So, at the end of all that, easy fix: buy a new coin battery at the hardware store and replace the dead battery in the computer. Most modern CMOS/CPU batteries are type CR2032, but unfortunately manufacturers tend not to just list that information in their specs, so there’s no way to know for sure until you open up the computer itself and take a look. Hopefully the battery is in an easy-to-find spot that doesn’t require you to actually remove the CPU or any expansion cards/RAM/hard drives/whatever else you’ve got going on inside your computer.

img_2695
Zoom….
img_2696
….and enhance!! Found you!

Here are some basic instructions for replacing a CMOS battery if this happens to you – but it really is as easy as it sounds. The main thing to keep in mind, when working in the inside of any computer, is to take steps to ground yourself and avoid the buildup of static electricity – electro-static discharge can seriously damage the components of your computer, without you even noticing it (the threshold for static electricity to damage circuits is much lower than it is for you to feel those shocks). There are some easy tips here for grounding yourself – I just made sure I was wearing cotton clothes, was working on an anti-static (foam) surface, and occasionally touched a nearby metal surface as I worked.

I’ll be back soon with (I hope) a much longer piece running down my experience at the Association of Moving Image Archivists’ conference in Pittsburgh this month and how I found it very powerfully related to my work this year. Cheers!

Windows Subsystem for Linux – What’s the Deal?

This past summer, Microsoft released its “Anniversary Update” for Windows 10. It included a lot of the business-as-usual sort of operating system updates: enhanced security, improved integration with mobile devices, updates to Microsoft’s “virtual assistant” Cortana (who is totally not named after a video game AI character who went rampant and is currently trying to destroy all biological life in the known universe, because what company would possibly tempt fate like that?)

halo-4-cortana-rampant
“NO I WILL NOT OPEN ANOTHER INCOGNITO WINDOW FOR YOU, FILTH”

But possibly the biggest under-the-radar change to Windows 10 was the introduction of Bash on Ubuntu on Windows. Microsoft partnered with Canonical, the company that develops the popular Linux operating system distribution Ubuntu, to create a full-fledged Linux/Ubuntu subsystem (essentially Ubuntu 14.04 LTS) inside of Windows 10. That’s like a turducken of operating systems.

foo_ck_turducken_1223
Which layer is the NT kernel, though?

What does that mean, practically speaking? For years, if you were interested in command-line control of your Windows computer, you could use Powershell or the Command Prompt – the same basic command-line system that Microsoft has been using since the pre-Windows days of MS-DOS. Contrast that to Unix-based systems like Mac OSX and Ubuntu, which by default use an input system called the Bash shell – the thing you see any time you open the application Terminal.

This slideshow requires JavaScript.

 

The Bash shell is very popular with developers and programmers. Why? A variety of reasons. It’s an open-source system versus Microsoft’s proprietary interface, for one. It has some enhanced security features to keep users from completely breaking their operating system with an errant command (if you’re a novice command-line user, that’s why you use the “sudo” command sometimes in Terminal but never in Command Prompt – Windows just assumes everyone using Command Prompt is a “super user” with access to root directories, whereas Mac OSX/Linux prefers to at least check that you still remember your administrative password before deleting your hard drive from practical existence). The Bash scripting language handles batch processing (working with a whole bunch of files at once), scheduling commands to be executed at future times, and other automated tasks a little more intuitively. And, finally, Unix systems have a lot more built-in utility tools that make software development and navigating file systems more elegant (to be clear, these utility applications are not technically part of the Bash shell – they are built into the Mac OSX/Linux operating system itself and accessed via the Bash shell).

Bringing in a Linux subsystem and Bash shell to Windows is a pretty bold move to try and win back developers to Microsoft’s platform. There have been some attempts before to build Linux-like environments for Windows to port Mac/Linux software – Cygwin was probably the most notable – but no method I ever tried, at least, felt as intuitive to a Mac/Linux user as Bash on Ubuntu on Windows does.

cygwinsetup
what even are you

Considering the increasing attention on open-source software development and command-line implementation in the archival community, I was very curious as to whether Bash on Ubuntu on Windows could start bridging the divide between Mac and Windows systems in archives and libraries. The problem of incompatible software and the difference in command-line language between Terminal and Command Prompt isn’t insurmountable, but it’s not exactly convenient. What if we could get all users on the same page with the software they use AND how they use them – regardless of operating system???

OK. That’s still a pipe dream. I said earlier that the Windows Subsystem for Linux (yes that’s what it’s technically called even though that sounds like the exact opposite of what it should be) was “full-fledged” – buuuuuut I kinda lied. Microsoft intends the WSL to be a platform for software development, not implementation. You’re supposed to use it to build your applications, but not necessarily actually deploy it into a Windows-based workflow. To that end, there are some giant glaring holes from just a pure Ubuntu installation: using Bash on Ubuntu on Windows, you can’t deploy any Linux software with a graphical user interface (GUI) (for example, the common built-in Linux text editing utility program gedit doesn’t work – but nano, which allows you to text edit from within the Bash terminal window itself, does). It’s CLI or bust. Any web-based application is also a big no-no, so you’re not going to be able to sneakily run a Windows machine as a server using the Linux subsystem any time soon.

Edit: Oh and the other giant glaring thing I forgot to mention the first time around – there’s no external drive support yet. So the WSL can’t access removable media on USB or optical disc mounted on the Windows file system – only fixed drives. So disc imaging software, while it technically “works”, can only work with data already moved to your Windows system.

But with all those caveats in mind… who cares what Microsoft says is supposed to happen? What does it actually do? What works, and what doesn’t? I went through a laundry list of command-line tools that have been used or taught the past few years in our MIAP courses (primarily Video Preservation, Digital Preservation and Handling Complex Media), plus a few tools that I’ve personally found useful. First, I wanted to see if they installed at all – and if they did, I would try a couple of that program’s most basic commands, hardly anything in the way of flags or options. I wasn’t really trying to stress-test these applications, just see if they could indeed install and access the Windows file system in a manner familiar to Mac/Linux users.

bash
*Hello Bash my only friend / I’ve come to ‘cat’ with you again*

Before I start the run-down, a note on using Bash on Ubuntu on Windows yourself, if interested. Here are the instructions for installing and launching the Windows Subsystem for Linux – since the whole thing is technically still in beta, you’ll need to activate “developer” mode. Once installed and launched, ALL of these applications will only work through the Bash terminal window – you can not access the Linux subsystem, and all software installed thereon, from the traditional Windows Command Prompt. (It goes the other way too – you can’t activate your Windows applications from the Bash shell. This is all about accessing and working with the same files from your preferred command-line environment.) And once again, the actual Ubuntu version in this subsystem is 14.04 LTS – which is not the latest stable version of that operating system. So any software designed only to work with Ubuntu 16.04 or the very latest 16.10 isn’t going to work in the Windows subsystem.

Once you’re in a Bash terminal, you can access all your files stored within the Windows file system by navigating into the “/mnt/” directory:

$ cd /mnt/

You should see different letters within this directory according to how many drives you have mounted in your computer, and their assigned letters/paths. For instance, for many Windows users all your files will probably be contained within something like:

/mnt/c/Users/your_user_name/Downloads or

/mnt/c/Users/your_user_name/Desktop , etc. etc.

And one last caveat: dragging and dropping a file into the Bash terminal to quickly get the full file path doesn’t work. It will give you a Command Prompt file path (e.g. “C:\Users\username\Downloads\file.pdf”) that the Bash shell can’t read. You’re going to have to manually type out the full file path yourself (tabbing over to automatically fill in directory/file names does still work, at least).

Let’s get to it!

Programs That Install Via Apt-Get:

  • bagit-java
  • bagit-python
  • cdrdao
  • ClamAV
  • ddrescue (install w/package name “gddrescue”, execute w/ command “ddrescue”)
  • ffmpeg (but NOT ffplay)
  • git
  • imagemagick
  • md5deep
  • mediainfo
  • MKVToolNix
  • Python/Python3/pip
  • Ruby/RubyGems
  • rsync
  • tree

Installing via Ubuntu’s “apt-get” utility is by far the easiest and most desirable method of getting applications installed on your Linux subsystem. It’s a package manager that works the same way as Homebrew on Mac, for those used to that system: just execute

$ [sudo] apt-get install nameofpackage

and apt-get will install the desired program, including all necessary dependencies (any software libraries or other software utilities necessary to make the program run). As you can see, the WSL can handle a variety of useful applications: disk imaging (cdrdao, ddrescue), transcoders (ffmpeg, imagemagick), virus scanning (ClamAV), file system and metadata utilities (mediainfo, tree), hash/checksum generation (md5deep).

mediainfo
Windows 10!

You can also get distributions of programming languages like Python and Ruby and use their own package managers (pip, RubyGems) to install further packages/libraries/programs. I tried this out with Python by installing bagit-python (my preferred flavor of BagIt – see this previous post for difference between bagit-python and the bagit-java program you get by just running “apt-get bagit”), and with Ruby by installing the Nokogiri library and running through this little Ruby exercise by Ashley Blewer. (I’d tried it before on Mac OSX but guess what, works on Windows through the WSL too!)

A couple things to note: one, if you’re trying to install the Ubuntu version of ddrescue, there’s confusingly a couple different packages named the same thing, and serve the same purpose. There’s a nice little rundown of how that happened and how to make sure you’re installing and executing exactly the program you want on the Ubuntu forums.

Also, while ffmpeg’s transcoding and metadata-gathering features (ffprobe) work fine, its built-in media playback command (ffplay) will not, because of the aforementioned issue with GUIs (it has to do with X11, the window system that Unix systems use for graphical display, but never mind that for now). Actually, it sort of depends on how you define “work”, because while ffplay won’t properly play back video, it will generate some fucking awesome text art for you in the Bash terminal:

 

Programs That Require More Complicated Installation:

  • bulk_extractor (requires legacy JDK)
  • exiftool
  • Fslint
  • mediaconch
  • The Sleuth Kit tools

These applications can’t be installed via an apt-get package, but you can still get them running with a little extra work, thanks to other Linux features such as dpkg. Dpkg is another package management program – this one comes from Debian, a Linux operating system of which Ubuntu is a direct (more user-friendly) derivative. You can use dpkg to install Debian (.deb) packages (like the CLI of Mediaconch), although take note that unlike apt-get, dpkg does not automatically install dependencies – so you might need to go out and find other libraries/packages to install via apt-get or dpkg before your desired program actually starts working (for Mediaconch, for instance, you should just apt-get install Mediainfo first to make sure you have the libmediainfo library already in place).

The WSL does also have the autocompile and automake utilities of full Linux distributions, so you can also use those to get packages like The Sleuth Kit (a bunch of digital forensics tools) or Fslint (a duplicate file finder) running. Best solution is to follow whatever Linux installation documentation there is for each of these programs – if you have questions about troubleshooting specific programs, let me know and I’ll try to walk you through my process.

fslint

 

Programs That Don’t Work/Install…Yet:

  • Archivematica
  • Guymager
  • vrecord

I had no expectation that these programs would work given the stated GUI and web-based limitations of the WSL, but this is just to confirm that as far as I can tell, there’s no way to get them running. Guymager has the obvious GUI/X11 issue (plus the inability to recognize external devices, anyway, and the general dysfunction of the /dev/ directory). The vrecord team hasn’t successfully installed on Linux yet, and the WSL would run into the GUI issue even if they do release a Linux version. And web applications definitely aren’t my strong suit, but in the long process of attempting an Archivematica installation, the WSL seemed to have separate issues with Apache, uWSGI and NGINX. That’s a lot of troubleshooting to likely no end, so best to probably leave that one aside.

That’s about all for now – I’m curious if anyone else has been testing the WSL, or has any thoughts about its possible usefulness in bridging compatibility concerns. Is there any reason we shouldn’t just be teaching everyone Bash commands now??

Update (10/20): So the very day that I post this, Microsoft released a pretty major update to the WSL, with two major effects: 1) new installations of WSL will now be Ubuntu 16.04 (Xenial), though existing users such as myself will not automatically upgrade from 14.04; and 2) the Windows and Linux command-line interfaces now have cross-compatibility, so you can launch Windows applications from the Bash terminal and Linux applications from Command Prompt or Powershell. Combine that with the comment below from Euan with directions to actually launch Linux applications with GUIs, and there’s a whole slew of options to continue exploring here. Look for further posts in the future! This subsystem is clearly way more powerful than Microsoft initially wanted to let on.

Using Bagit

do you know how to use Bagit?
cause I’m lost
the libcongress github is sooo not user friendly

I’m missing bagger aren’t I
ugh ugh I just want to make a bag!

That’s an email I got a few weeks back from a good friend and former MIAP classmate. I wanted to share it because I feel like it sums up the attitude of a lot of archivists towards a tool that SHOULD be something of a godsend to the field and a foundational step in many digital preservation workflows – namely, the Library of Congress’ BagIt.

What is BagIt? It’s a software library, developed by the LoC in conjunction with their partners in the National Digital Information Infrastructure and Preservation Program (NDIIPP) to support the creation of “bags.” OK, so what’s a bag?

amerbeauty_165pyxurz

Let’s back up a minute. One of the big challenges in digital archiving is file fixity – a fancy term for checking that the contents of a file have not been changed or altered (that the file has remained “fixed”). There’s all sorts of reasons to regularly verify file fixity, even if a file has done nothing but sit on a computer or server or external hard drive: to make sure that a file hasn’t corrupted over time, that its metadata (file name, technical specs, etc.) hasn’t been accidentally changed by software or an operating system, etc.

But one of the biggest threats to file fixity is when you move a file – from a computer to a hard drive, or over a server, or even just from one folder/directory in your computer to another. Think of it kind of like putting something in the mail: there are a lot of points in the mailing process where a computer or USPS employee has to read the labeling and sort your mail into the proper bin or truck or plane so that it ends up getting to the correct destination. And there’s a LOT of opportunity for external forces to batter and jostle and otherwise get in your mail’s personal space. If you just slap a stamp on that beautiful glass vase you bought for your mother’s birthday and shove it in the mailbox, it’s not going to get to your mom in one piece.

screen-shot-2016-09-20-at-10-04-35-am
And what if you’re delivering something even more precious than a vase?

So a “bag” is a kind of special digital container – a way of packaging files together to make sure what we get on the receiving end of a transfer is the same thing that started the journey (like putting that nice glass vase in a heavily padded box with “fragile” stamped all over it).

Great, right? Generating a bag can take more time than people want, particularly if you’re an archive dealing with large, preservation-quality uncompressed video files, but it’s a no-brainer idea to implement into workflows for backing up/storing data. The thing is, as I said, the BagIt tools developed by the Library of Congress to support the creation of bags are software libraries – not necessarily in and of themselves fully-developed, ready-to-go applications. Developers have to put some kind of interface on top of the BagIt library for people to actually be able to actually interact and use it to create bags.

So right off the bat, even though tech-savvier archivists may constantly be recommending to others in the field to “use BagIt” to deliver or move their files, we’re already muddling the issue for new users, because there literally is no one, monolithic thing called “BagIt” that someone can just google and download and start running. And I think we seriously underestimate how much of a hindrance that is to widespread implementation. Basically anyone can understand the principles and need behind BagIt (I hopefully did a swift job of it in the paragraphs above) – but actually sifting through and installing the various BagIt distributions currently takes time, and an ability to read between the lines of some seriously scattered documentation.

So here I’m going to walk through the major BagIt implementations and explain a bit about how and why you might use each one. I hope consolidating all this information in one place will be more helpful than the Library of Congress’ github pages (which indeed make little effort to make their instructions accessible to anyone unfamiliar with developer-speak). If you want to learn more about the BagIt specification itself (i.e. what pieces/files actually make up a bag, how data gets hierarchically structured inside a bag, how BagIt checksums and tag manifests to do the file fixity work I mentioned earlier), I can recommend this introductory slideshow to BagIt from Justin Littman at the LoC.

1. Bagger (BagIt-java)

screen-shot-2016-09-20-at-9-27-12-am

The BagIt library was originally developed using a programming language called Java. For its first four stable versions, Bagit-java could be used either via command-line interface (a Terminal window on Macs or Linux/Ubuntu, Command Prompt in Windows), or via a Graphical User Interface (GUI) developed by the LoC itself called Bagger.

As of version 5 of Bagit-java, the LoC has completely ceased support of using BagIt-java via the command line. That doesn’t mean it isn’t still out there – if, for instance, you’re on a Mac and use the popular package manager Homebrew, typing

$ brew install bagit

will install the last, stable version (4.12.1) of BagIt-java. But damned if I know how to actually use it, because in its deprecation of support the LoC seems to have removed (or maybe just never wrote?) any online documentation (github or elsewhere) of how to use BagIt-java via command-line. No manual page for you.

Instead you now have to use Bagger to employ BagIt-java (from the LoC’s perspective, anyway). Bagger is primarily designed to run on Windows or, with some tinkering, Linux/Ubuntu and Mac OSX.

maxresdefault
They do, funnily enough, include a screed about the iPhone 7’s lack of a 3.5mm headphone jack.

So once you actually download Bagger, which I primarily recommend if you’re on Windows, there’s some pretty good existing documentation for using the application’s various features, and even without doing that reading, it’s a pretty intuitive program. Big honking buttons help you either start making a new bag (picking the A/V and other files you want to be included in the bag one-by-one and safely packaging them together into one directory) or create a “bag in place”, which just takes an already-existing folder/directory and structures the content within that folder according to the BagIt specification. You can also validate bags that have been given/sent to you (that is, check the fixity of the data). The “Is Bag Complete” feature checks whether a folder/directory you have is, in fact, officially a bag according to the rules of the BagIt spec.

(FWIW: I managed to get Bagger running on my OSX 10.10.5 desktop by installing an older version, 2.1.3, off of the Library of Congress’ Sourceforge. That download included a bagger.jar file that my Mac could easily open after installing the legacy Java 6 runtime environment (available here). But, that same Sourceforge page yells at you that the project has moved to Github, where you can only download the latest 2.7.1 release, which only includes the Windows-compatible bagger.bat file and material for compiling on Linux, no OSX-compatible .jar file. I have no idea what’s going on here, and we’ve definitely fallen into a tech-jargon zone that will scare off laypeople, so I’m going to leave it at “use Bagger with Windows”)

Update: After some initial confusion (see above paragraph), the documentation for running Bagger on OSX has improved twofold! First, the latest release of Bagger (2.7.2) came with some tweaks to the github repo’s documentation, including instructions for Linux/Ubuntu/OSX! Thanks, guys! Also check out the comments section on this page for some instructions for launching Bagger in OSX from the command-line and/or creating an AppleScript that will let you launch Bagger from Spotlight/the Applications menu like you would any other program.

2. Exactly

screen-shot-2016-09-20-at-9-29-04-am
Ed Begley knows exactly what I’m talking about

Developed by the consulting/developer agency AVPreserve with University of Kentucky Libraries, Exactly is another GUI built on top of Bagit-java. Unlike Bagger, it’s very easy to download, immediately install and run versions of Exactly for Mac or Windows, and AVPreserve provides a handy quickstart guide and a more detailed user manual, both very useful if you’re just getting started with bagging. Its features are at once more limited and more expansive than Bagger. The interface isn’t terribly verbose, meaning it’s not always clear what the application is actually doing from moment to moment. But Exactly is more robustly designed for insertion of extra metadata into the bag (users can create their own fields and values to be inserted in a bag’s bag-info.txt file, so you could include administrative metadata unique to your own institution).

And Exactly’s biggest attraction is that it actually serves as a file delivery client – that is, it won’t just package the files you want into a bag before transferring, but actual perform the transfer. So if you want to regularly move files to and from a dedicated server for storage with minimal fuss, Exactly might be the tool you want, albeit it could still use some aesthetic/verbosity design upgrades.

3. Command Line (BagIt-python)

screen-shot-2016-09-20-at-9-02-09-am

Let’s say you really prefer to work with command-line software. If you’re comfortable with CLI, there are many advantages – greater control over your bagging software and exactly how it operates. I mentioned earlier that the LoC stopped supporting BagIt-java for command-line, but that doesn’t mean you command-line junkies are completely out of luck. Instead they shifted support and development of command-line bagging to a different programming language: Python.

If you’re working in the command-line, chances are you’re on Mac OSX or maybe Linux. I’m going to assume from here on you’re on OSX, because anyone using Linux has likely figured all this out themselves (or would prefer to). And here’s the thing, if you’re a novice digital archivist working in the Terminal on OSX: you can’t install BagIt-python using Homebrew.

Instead, you’re going to need Python’s own package manager, a program called “pip.” In order to get Bagit-python, you’re going to need to do the following:

  1. Check what version of Python is on your computer. Mac OSX and Linux machines should come with Python already installed, but Bagit-python will require at least Python version 2.6 or above. You can check what version of Python you’re running in the terminal with: 

    $ python ––version 

    If your version isn’t high enough, visit https://www.python.org/downloads/ and download/install Python 2.7.12 for your operating system. (do not download a version of Python 3.x – Bagit-python will not work with Python 3, as if this wasn’t confusing enough) 

  2. Now you’ll need the package manager/installer, pip. It may have come ready to go with your Python installation, or not. You can check that you have pip installed with: 

    $ pip ––version 

    If you get a version number, you’re ready to go to step 3. If you get a message that pip isn’t installed, you’ll have to visit https://pip.pypa.io/en/stable/installing/Click on the hyperlinked “get-pip.py”. A tab should open with a buncha text – just hit Command-S and you should have the option to save/download this page as a Python script (that is, as a .py file). Then, back in the Terminal, navigate into whatever directory you just downloaded that script into (Downloads perhaps, or Desktop, or wherever else), and run the script by invoking$ python get-pip.py

    Pip should now be installed.

     

  3. Once pip is in place you can just use it to install Bagit-python the same way you use Homebrew:$ pip install bagit               (sudo if necessary)

You should be all set now to start using Bagit-python via the command-line. You can invoke Bagit-python using commands that start with “bagit.py” – it’s a good idea to go over command line usage and options for adding metadata by visiting the help page first, which is one of the LoC’s better efforts at documentation: https://github.com/LibraryOfCongress/bagit-python or:

$ bagit.py –help

But the easiest usage is just to create a bag “in place” on a directory you already have, with a flag for whichever kind of checksum you want to use for fixity:

$ bagit.py –md5 /path/to/directory

As with Bagger, the directory will remain exactly where it is in your file system, but now will contain the various tag/checksum manifests and all the media files within a “data” folder according to the BagIt spec. Again, the power of command-line BagIt-python lies in its flexibility – the ability to add metadata, choose different checksum configurations, increase or decrease the verbosity of the software. If you’re comfortable with command-line tools, this is the implementation I would most recommend!

Update:  Please also read this terrific tutorial by Kathryn Gronsbell for the 2016 NDSR symposium for a more detailed rundown of BagIt-python use and installation, including sample exercises!!!

4. BaggerJS

screen-shot-2016-09-20-at-9-31-51-am

It’s still in an early/experimental phase, but the LoC is also working on a web-based application for bagging called BaggerJS (built on a version of the BagIt library written in JavaScript, which yes, for those wondering at home, is a totally different programming language than Java that works specifically for web applications, because we needed more versioning talk in this post).

Right now you can select and hash files (generate checksums for file fixity), and upload valid bags to a cloud server compatible with Amazon’s s3 protocol. Metadata entry and other features are still incomplete, but if development continues, this might, like Exactly, be a nice, simplified way to perform light bag transfers, particularly if you use a cloud storage service to back up files. It also has the advantage of not requiring any installation whatsoever, so novice users can more or less step around around the Java vs. Python, GUI vs. CLI questions.

https://libraryofcongress.github.io/bagger-js/

5. Integrated into other software platforms/programs

The BagIt library has also been integrated into larger, more complex software packages, designed for broader digital repository management. Creating bags is only one piece of what these platforms are designed to do. One good example is Archivematica, which can perform all kinds of file conformance checks, transcoding, automatic metadata generation and more. But it does package data according to the BagIt spec whenever it actually transfers files from one location to another.

And that’s the other, more complicated way to use the BagIt library – by building it into your own software and scripts! Obviously this is a more advanced step for archivists who interested in coding and software development. But the various BagIt versions (Java, Python, JavaScript) and the spec itself are all in the public domain and anyone could incorporate them into their own applications, or recreate the BagIt library in the programming language of their choice (there is, for instance, a BagIt-ruby version floating around out there, though it’s apparently deprecated and I’ve never heard of anyone who used it).

Dual-Boot a Windows Machine

It is an inconvenient truth that the MIAP program is spread across two separate buildings along Broadway. They’re only about five minutes apart, and the vast majority of the time this presents no problems for students or staff, but it does mean that my office and one of our primary lab spaces are in geographically separate locations. Good disaster planning, troublesome for day-to-day operations.

The Digital Forensics Lab (alternately referred to as the Old Media Lab or the Dead Media Lab, largely depending on my current level of frustration or endearment towards the equipment contained within it) is where we house our computing equipment for the excavation and exploration of born-digital archival content: A/V files created and contained on hard drive, CD, floppy disk, zip disk, etc. We have both contemporary and legacy systems to cover decades of potential media, primarily Apple hardware (stretching back to a Macintosh SE running OS 7), but also a couple of powerful modern Windows machines set up with virtual machines and emulators to handle Microsoft operating systems back to Windows 3.1 and MS-DOS.

Having to schedule planned visits over from my office to the main Tisch building in order to test, update, or otherwise work with any of this equipment is mildly irksome. That’s why my office Mac is chock full of emulators and other forensic software that I hardly use on any kind of regular basis – when I get a request from a class for a new tool to be installed in the Digital Forensics Lab, it’s much easier to familiarize myself with the setup process right where I am before working with legacy equipment; and I’m just point-blank unlikely to trek over the other building for no other reason than to test out new software that I’ve just read about or otherwise think might be useful for our courses.

sleepy-office-worker-at-desk-with-multiple-coffees
#ProtestantWorkEthic

This is a long-winded way of justifying why the department purchased, at my request, a new Windows machine that I will be able to use as a testing ground for Windows-based software and workflows (I had previously installed a Windows 7 virtual machine on my Mac to try to get around some of this, but the slowed processing power of a VM on a desktop not explicitly setup for such a purpose was vaguely intolerable). The first thing I was quite excited to do with this new hardware was to set up a dual-boot operating system: that is, make it so that on starting up the computer I would have the choice of using either Windows 7 or Windows 10, which is the main thing I’m going to talk about today.

IMG_2329
Swag

Pretty much all of our Windows computers in the archive and MIAP program still run Windows 7 Pro, for a variety of reasons – Windows 8 was geared so heavily towards improved communication with and features for mobile devices that it was hardly worth the cost of upgrading an entire department, and Windows 10 is still not even a year old, which gives me pause in terms of the stability and compatibility of software that we rely on from Windows 7. So I needed Windows 7 in order to test how new programs work with our current systems. However, as it increases in market share and developers begin to migrate over, I’m increasingly intrigued by Windows 10, to the point that I also wanted access to it in order to test out the direction our department might go in the future. In particular I very much wanted to try out the new Windows Subsystem for Linux, available in the Windows 10 Anniversary Update coming this summer – a feature that will in theory make Linux utilities and local files accessible to the Windows user via a Bash shell (the command-line interface already seen on Mac and Ubuntu setups). Depending how extensive the compatibility gets, that could smooth over some of the kinks we have getting all our students (on different operating systems) on the same page in our Digital Literacy and Digital Preservation courses. But that is a more complicated topic for another day.

When my new Windows machine arrived, it came with a warning right on the box that even though the computer came pre-installed with Windows 7 and licenses/installation discs for both 7 and Windows 10,

You may only use one version of the Windows software at a time. Switching versions will require you to uninstall one version and install the other version.

1d8acd8c6e8e337ce31bef84a8636491

This statement is only broadly true if you have no sense of partitioning, a process by which you can essentially separate your hard drive into distinct, discrete sections. The computer can basically treat separate partitions as separate drives, allowing you to format the different partitions with entirely separate file systems, or, as we will see here, install completely different operating systems.

Now, as it happens, it also turned out to be semi-true for my specific setup, but only temporarily and because of some kinks specific to manufacturer who provided this desktop (hi, HP!). I’ll explain more in a minute, but right now would be a good point to note that I was working with a totally clean machine, and therefore endangering no personal files in this whole partitioning/installation process. If you also want to setup some kind of dual-boot partition, please please please make sure all of your files are backed up elsewhere first. You never know when you will, in fact, have to perform a clean install and completely wipe your hard drive just to get back to square one.

1a0a18d74db871e6358d7526b271c0e749d9cedb8afd2411816625802370c924
“Arnim Zola sez: back up your files, kids!”

So, as the label said, booting up the computer right out of the box, I got a clean Windows 7 setup. The first step was to make a new blank partition on the hard drive, on to which I could install the Windows 10 operating system files. In order to do this, we run the Windows Disk Management utility (you can find it by just hitting the Windows Start button and typing “disk management” into the search bar:

start

Once the Disk Management window pops up, I could see the 1TB hard drive installed inside the computer (labelled “Disk 0”), as well as all the partitions (also called “volumes”) already on that drive. Some small partitions containing system and recovery files (from which the computer could boot into at least some very basic functionality even if the Windows operating system were to corrupt or fail) were present, but mostly (~900 GB) the drive is dedicated to the main C: volume, which contains all the Windows 7 operating files, program files, personal files if there were any, etc. By right-clicking on this main partition and selecting “Shrink Volume,” I can set aside some of that space to a new partition, on to which we will install the Windows 10 OS. (note all illustrative photos gathered after the fact, so some numbers aren’t going to line up exactly here, but the process is the same)

hesx3

If you wanted to dual-boot two operating systems that use completely incompatible file systems – for instance, Mac and Windows – you would have to set aside space for not only the operating system’s files, but also all of the memory you would want to dedicate to software, file storage, etc. However, Windows 7 and 10 both use the NTFS file system – meaning Windows 10 can easily read and work with files that have been created on or are stored in a Windows 7 environment. So in setting up this new partition I only technically had to create space for the Windows 10 operating system files, which run about 25 GB total. In practice I wanted to leave some extra space, just in case some software comes along that can only be installed on the Windows 10 partition, so I went ahead and doubled that number to 50 GB (since Disk Management works in MB, we enter “50000” into the amount of space to shrink from the C: volume).

shrink_volume

Disk Management runs for a minute and then a new Blank Partition appears on Disk 0. Perfect! I pop in the Windows 10 installation disc that came with the computer and restart. In my case, the hardware automatically knew to boot up from the installation disc (rather than the Windows 7 OS on the hard drive), but it’s possible others would have to reset the boot order to go from the CD/DVD drive first, rather than the installed hard drive (this involves the computer’s BIOS or UEFI firmware interface – more on that in a minute – but for now if it gives you problems, there’s plenty of guides out there on the Googles).

Following the instructions for the first few parts of the Windows 10 installer is straightforward (entering a user name and password, name for the computer, suchlike), but I ran into a problem when finally given the option to select the partition on to which I wanted to install Windows 10. I could see the blank, unformatted 50 GB partition I had created, right there, but in trying to select it, I was given this warning message:

Windows cannot be installed to this disk. The selected disk is of the GPT partition style.

Humph. In fact I could not select ANY of the partitions on the disk, so even if I had wanted to do a clean install of Windows 10 on to the main partition where Windows 7 now lived, I couldn’t have done that either. What gives, internet?

So for many many many years (in computer terms, anyway – computer years are probably at least equivalent to dog years), PCs came installed with a firmware interface called the BIOS – Basic Input/Output System. In order to install or reinstall operating system software, you need a way to send very basic commands to the hard drive. The BIOS was able to do this because it lived on the PC’s motherboard, rather than on the hard drive – as long as your BIOS was intact, your computer would have at least some very basic functionality, even if your operating system corrupted or your hard drive had a mechanical failure. With the BIOS you could reformat your hard drive, select whether you booted the operating system from the hard drive or an external source (e.g. floppy drive or CD drive), etc.

header
Or rule a dystopian underwater society! …wait

In the few seconds when you first powered on a PC, the BIOS would look to the very first section of a hard drive, which (if already formatted) would contain something called a Master Boot Record, a table that contains information about the partitions present on that hard drive: how many partitions are present, how large each of them are, what file system was present on each, which one(s) contained bootable operating system software, which partition to boot from first (if multiple partitions had a bootable OS).

windows-cannot-be-installed-to-this-disk
You probably saw something like this screen by accident once when your cat walked across your keyboard right as you started up the computer.

Here’s the thing: because of the limitations of the time, the BIOS and MBR partition style can only handle a total of four partitions on any one drive, and can only boot from a partition if it isless than about 2.2 TB in size. For a long time, that was plenty of space and functionality to work with, but with rapid advancements in the storage size of hard drives and the processing power of motherboards, the BIOS and MBR partitioning became increasingly severe and arbitrary roadblocks. So from the late ’90s through the mid-’00s, an international consortium developed a more advanced firmware interface, called UEFI (Unified Extensible Firmware Interface) that employed a new partition system, GPT (GUID Partition Table). With GPT, there’s theoretically no limit to the number of partitions on a drive, and  UEFI can boot from partitions as large as 9.4 ZB (yes, that’s zettabytes). For comparison’s sake, 1 ZB is about equivalent to 36,000 years of 1080p high-definition video. So we’re probably set for motherboard firmware and partition styles for a while.

n2cnt4
We’re expected to hit about 40 zettabytes of known data in 2020. Like, total. In the world. Our UEFI motherboards are good for now.

UEFI can not read MBR partitions as is, though it has a legacy mode that can be enabled to restrict its own functionality to that of the BIOS, and thereby read MBR. If the UEFI motherboard is set to only boot from the legacy BIOS, it can not understand or work with GPT partitions. Follow?

So GETTING BACK TO WHAT WE WERE ACTUALLY DOING….the reason I could not install a new, Windows 10-bootable partition on to my drive was that the UEFI motherboard in my computer had booted from the legacy BIOS -for some reason.

jdhvc
Me.

Honestly, I’m not sure why this is. Obviously this was not a clean hard drive when I received it – someone at HP had already installed Windows 7 on to this GPT-partitioned hard drive, which would’ve required the motherboard to be in UEFI boot mode. So why did it arrive with legacy BIOS boot mode not only enabled, but set first in the preferential boot order? My only possible answer is that after installing Windows 7, they went back in and set the firmware settings to legacy BIOS boot mode in order to improve compatibility with the Windows 7 OS – which was developed and released still in the days when BIOS was still the default for new equipment.

This was a quick fix – restart the computer, follow the brief on-screen instructions to enter the BIOS (usually pressing the ESC key, though it can vary with your setup), and navigating through the firmware settings to re-enable UEFI boot mode (I also left legacy BIOS boot enabled, though lower in the boot order, for the above-stated reasoning about compatibility with Windows 7 – so now, theoretically, my computer can start up from either MBR or GPT drives/disks with no problem).

Phew. Are you still with me after all this? As a reward, here’s a vine of LeBron James blocking Andre Iguodala to seal an NBA championship, because that is now you owning computer history and functionality.

From this point on, we can just pop the Windows 10 installation disc back in and follow the instructions like we did before. I can now select the unformatted 50 GB partition on which to install Windows 10 – and the installation wizard basically runs itself. After a lot of practical setup username and password nonsense, now when I start up my computer, I get this screen:

boot-screen-640x480

And I can just choose whether to enter the Windows 7 or 10 OS. Simple as that. I’ll go more into some of what this setup allows me to do (particularly the Windows Subsystem for Linux) another day, as this post has gone on waaaayy too long. Happy summer, everyone!