A simple ETL library
Drudgery is a simple ETL library that supports the following sources/destinations:
Supported Rubies:
Install the gem directly:
gem install drudgery
Or, add it to your Gemfile:
gem 'drudgery'
And, if using the :sqlite3
extractor or loader:
gem 'sqlite3', '~> 1.3'
And, if using the :active_record
extractor or loader:
gem 'activerecord', '~> 3.0'
And, if using the :active_record_import
loader:
gem 'activerecord-import', '>= 0.2.9'
Extracting from CSV and loading into ActiveRecord:
m = Drudgery::Manager.new
m.prepare do |job|
job.extract :csv, 'src/addresses.csv'
job.transform do |data, cache|
first_name, last_name = data.delete(:name).split(' ')
data[:first_name] = first_name
data[:last_name] = last_name
data[:state] = data.delete(:state_abbr)
data
end
job.load :active_record, Address
end
m.run
Extracting from SQLite3 and bulk loading into ActiveRecord:
db = SQLite3::Database.new('db.sqlite3')
m = Drudgery::Manager.new
m.prepare do |job|
job.batch_size 5000
job.extract :sqlite3, db, 'addresses' do |extractor|
extractor.select(
'name',
'street_address',
'city',
'state_abbr AS state',
'zip'
)
extractor.where("state LIKE 'A%'")
extractor.order('name')
end
job.transform do |data, cache|
first_name, last_name = data.delete(:name).split(' ')
data[:first_name] = first_name
data[:last_name] = last_name
data
end
job.load :active_record_import, Address
end
m.run
Provide Drudgery with a logger and info will be logged about each job.
When log level is INFO
expect to see basic output for each job (e.g.
when it starts and completes).
logger = Logger.new('log/etl.log')
logger.level = Logger::INFO # Logger defaults to log level DEBUG
Drudgery.logger = logger
When log level is DEBUG
expect to see output for each record
extracted, transformed and loaded (VERY NOISY).
Drudgery also provides progress output to STDERR courtesty of the
progressbar
gem. Progress output is on by default, but can be
disabled with the following:
Drudgery.show_progress = false
The following extractors are provided: :csv
, :sqlite3
, :active_record
You can use your own extractors if you would like. They need to implement the following methods:
#name
- returns extractor's name#record_count
- returns count of records in source#extract
- must yield each record and record indexclass ArrayExtractor
attr_reader :name
def initialize(source)
@source = source
@name = 'array'
end
def extract
index = 0
@source.each do |record|
yield [record, index]
index += 1
end
end
def record_count
@source.size
end
end
source = []
m = Drudgery::Manager.new
m.prepare do |job|
m.extract ArrayExtractor.new(source)
m.load :csv, 'destination.csv'
end
Or, if you define your custom extractor under the Drudgery::Extractors namespace:
module Drudgery
module Extractors
class ArrayExtractor
attr_reader :name
def initialize(source)
@source = source
@name = 'array'
end
def extract
index = 0
@source.each do |record|
yield [record, index]
index += 1
end
end
def record_count
@source.size
end
end
end
end
source = []
m = Drudgery::Manager.new
m.prepare do |job|
m.extract :array, source
m.load :csv, 'destination.csv'
end
Drudgery comes with a basic Transformer class. It symbolizes the keys of
each record and allows you to register a processor to process data. The
processor should implement a #call
method and return a Hash
or nil
.
custom_processor = Proc.new do |data, cache|
data[:initials] = data[:name].split(' ').map(&:capitalize).join()
data
end
transformer = Drudgery::Transformer.new
transformer.register(custom_processor)
transformer.transform({ :name => 'John Doe' }) # == { :name => 'John Doe', :initials => 'JD' }
You could also implement your own transformer if you need more custom
processing power. If you inherit from Drudgery::Transfomer
, you need
only implement the #transform
method that accepts a hash argument as an
argument and returns a Hash
or nil
.
class CustomTransformer < Drudgery::Transformer
def transform(data)
# do custom processing here
end
end
m = Drudgery::Manager.new
m.prepare do |job|
m.extract :csv, 'source.csv'
m.transform CustomTransformer.new
m.load :csv, 'destination.csv'
end
The following loaders are provided:
:csv
:sqlite3
:active_record
:active_record_import
You can use your own loaders if you would like. They need to implement the following methods:
#name
- returns the loader's name#load
- accepts an array of records and then write them to the
destinationclass ArrayLoader
attr_reader :name
def initialize(destination)
@destination = destination
@name = 'array'
end
def load(records)
@destination.push(*records)
end
end
destination = []
m = Drudgery::Manager.new
m.prepare do |job|
m.extract :csv, 'source.csv'
m.load ArrayLoader.new(destination)
end
Or, if you define your custom loader under the Drudgery::Loaders namespace:
module Drudgery
module Loaders
class ArrayLoader
attr_reader :name
def initialize(destination)
@destination = destination
@name = 'array'
end
def load(records)
@destination.push(*records)
end
end
end
end
destination = []
m = Drudgery::Manager.new
m.prepare do |job|
m.extract :csv, 'source.csv'
m.load :array, destination
end