When I started to work with Hbase, I realized that there are no good examples and tutorials for C or C++ client. So I decided to show how to create and compile a working Hbase client which may become a wheelhorse for any project needed processing of very large data sets.
Contents
- Quick reference
- Installation
- Starting
- Gen the C++ code by the Thrift Server
- C++ Hbase client code
- Compilation
- Run the Hbase client
Quick reference
Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. More details – https://en.wikipedia.org/wiki/Apache_Hadoop
Thrift is an interface definition language and binary communication protocol that is used to define and create services for numerous languages. More details – https://en.wikipedia.org/wiki/Apache_Thrift
HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. More details – https://en.wikipedia.org/wiki/Apache_HBase
Installation
In this tutorial I’m working on Ubuntu Server 15.04 but for the other Unix-like systems installation would be quite similar.
Java installation
For the Hadoop, Thrift, and HBase, you are required to set the JAVA_HOME environment variable. So first you need to install Java. I will install prebuilt OpenJDK packages.
sudo apt-get install openjdk-7-jre
The preferred location for JAVA_HOME or any system variable is /etc/environment. Open /etc/environment in any text editor like nano or gedit and add the following.
JAVA_HOME="/usr/lib/jvm/open-jdk" (java path could be different)
Use source to load the variables, by running this command.
source /etc/environment
Then check the variable, by running this command.
echo $JAVA_HOME
Hadoop installation
Just download Hadoop from http://hadoop.apache.org/releases.html
wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
tar xzvf hadoop-2.7.0.tar.gz
$ cd hadoop-2.7.0/
Setup environment variables and alias.
export HADOOP_PREFIX="/home/alex/Programs/hadoop-2.7.0" # Change this to where you unpacked hadoop to.
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
For a single-node installation lets change the main HDFS configuration file at $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml.
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/alex/Programs/hadoop-2.7.0/hdfs/datanode</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/alex/Programs/hadoop-2.7.0/hdfs/namenode</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description>
</property>
</configuration>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
<description>NameNode URI</description>
</property>
</configuration>
To configure YARN, the relevant file is $HADOOP_PREFIX/etc/hadoop/yarn-site.xml.
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
</configuration>
Thrift installation
First install all the required tools and libraries to build and install the Thrift.
sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
Download and install the Thrift.
wget http://www.apache.org/dyn/closer.cgi?path=/thrift/0.9.2/thrift-0.9.2.tar.gz
tar xzvf thrift-0.9.2.tar.gz
$ cd thrift-0.9.2/
./configure
make
sudo make install
Hbase installation
Visit the http://archive.apache.org/dist/hbase/hbase-0.98.9/, choose your contributor, then choose stable version, and download file with “bin” in the name. Extract the downloaded file, and change to the newly-created directory.
wget http://archive.apache.org/dist/hbase/hbase-0.98.9/hbase-0.98.9-hadoop2-bin.tar.gz
tar xzvf hbase-0.98.9-hadoop2-bin.tar.gz
cd hbase-0.98.9-hadoop2
For Pseudo-Distributed HBase version edit conf/hbase-site.xml, which is the main HBase configuration file.
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase<</value>
</property>
</configuration>
You do not need to create the HBase data directory. HBase will do this for you.
Starting
Let’s start all the services together.
Starting Hadoop
It’s time to setup the folders and start the daemons.
## Start HDFS daemons
# Format the namenode directory (DO THIS ONLY ONCE, THE FIRST TIME)
$HADOOP_PREFIX/bin/hdfs namenode -format
# Start the namenode daemon
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
# Start the datanode daemon
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
## Start YARN daemons
# Start the resourcemanager daemon
$HADOOP_PREFIX/sbin/yarn-daemon.sh start resourcemanager
# Start the nodemanager daemon
$HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
Starting Thrift
hbase-0.98.9-hadoop2/bin/hbase-daemon.sh start thrift
Starting Hbase
The bin/start-hbase.sh script is provided as a convenient way to start HBase.
./hbase-0.98.9-hadoop2/bin/start-hbase.sh
If you want to play with Hbase, you could start the Hbase shell.
./hbase-0.98.9-hadoop2/bin/hbase shell
Gen the C++ code by the Thrift Server
Let’s make the gen cpp files.
thrift --gen cpp hbase-0.98.9-hadoop2/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
Copy the gen cpp and lib files to your project directory, for example /var/www/hclient.
cp -R gen-cpp /var/www/hclient/gen-cpp
cp -R thrift-0.9.2/lib /var/www/hclient/lib
- we copied all the folder in assuming using other languages in our project
C++ Hbase client code
/* Author: Alex Bod
* Website: http://www.alexbod.com
* License: The GNU General Public License, version 2
* main.cpp: C++ Hbase client using the Thrift Server
*/
#include <poll.h>
#include <iostream>
#include <string.h>
#include <vector>
#include <boost/lexical_cast.hpp>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TBufferTransports.h>
#include <thrift/protocol/TBinaryProtocol.h>
#include "gen-cpp/Hbase.h"
using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;
using namespace apache::hadoop::hbase::thrift;
typedef std::vector<std::string> StrVec;
typedef std::map<std::string,std::string> StrMap;
typedef std::vector<ColumnDescriptor> ColVec;
typedef std::map<std::string,ColumnDescriptor> ColMap;
typedef std::vector<TCell> CellVec;
typedef std::map<std::string,TCell> CellMap;
/* The function to print rows */
static void printRow(const std::vector<TRowResult> &);
/* The function to print versions */
static void printVersions(const std::string &row, const CellVec &);
int main(int argc, char** argv)
{
/* Connection to the Thrift Server */
boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
/* Create the Hbase client */
HbaseClient client(protocol);
try {
transport->open();
std::string t("demo_table");
/* Scan all tables, look for the demo table and delete it. */
std::cout << "scanning tables..." << std::endl;
StrVec tables;
client.getTableNames(tables);
for (StrVec::const_iterator it = tables.begin(); it != tables.end(); ++it) {
std::cout << " found: " << *it << std::endl;
if (t == *it) {
if (client.isTableEnabled(*it)) {
std::cout << " disabling table: " << *it << std::endl;
client.disableTable(*it);
}
std::cout << " deleting table: " << *it << std::endl;
client.deleteTable(*it);
}
}
/* Create the demo table with two column families, entry: and unused: */
ColVec columns;
StrMap attr;
columns.push_back(ColumnDescriptor());
columns.back().name = "entry:";
columns.back().maxVersions = 10;
columns.push_back(ColumnDescriptor());
columns.back().name = "unused:";
std::cout << "creating table: " << t << std::endl;
try {
client.createTable(t, columns);
} catch (const AlreadyExists &ae) {
std::cerr << "WARN: " << ae.message << std::endl;
}
ColMap columnMap;
client.getColumnDescriptors(columnMap, t);
std::cout << "column families in " << t << ": " << std::endl;
for (ColMap::const_iterator it = columnMap.begin(); it != columnMap.end(); ++it) {
std::cout << " column: " << it->second.name << ", maxVer: " << it->second.maxVersions << std::endl;
}
/* Test UTF-8 handling */
std::string invalid("foo-\xfc\xa1\xa1\xa1\xa1\xa1");
std::string valid("foo-\xE7\x94\x9F\xE3\x83\x93\xE3\x83\xBC\xE3\x83\xAB");
/* Non-utf8 is fine for data */
std::vector<Mutation> mutations;
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = invalid;
client.mutateRow(t, "foo", mutations, attr);
/* Trying empty strings is not valid
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:";
mutations.back().value = "";
client.mutateRow(t, "", mutations, attr); */
/* This row name is valid utf8 */
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = valid;
client.mutateRow(t, valid, mutations, attr);
/* Non-utf8 is now allowed in row names because HBase stores values as binary */
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = invalid;
client.mutateRow(t, invalid, mutations, attr);
/* Run a scanner on the rows we just created */
StrVec columnNames;
columnNames.push_back("entry:");
std::cout << "Starting scanner..." << std::endl;
int scanner = client.scannerOpen(t, "", columnNames, attr);
try {
while (true) {
std::vector<TRowResult> value;
client.scannerGet(value, scanner);
if (value.size() == 0)
break;
printRow(value);
}
} catch (const IOError &ioe) {
std::cerr << "FATAL: Scanner raised IOError" << std::endl;
}
client.scannerClose(scanner);
std::cout << "Scanner finished" << std::endl;
/* Run some operations on a bunch of rows */
for (int i = 0; i <= 11; i++) {
/* Format row keys as "00000" to "00100" */
char buf[32];
sprintf(buf, "%05d", i);
std::string row(buf);
std::vector<TRowResult> rowResult;
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "unused:";
mutations.back().value = "DELETE_ME";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
client.deleteAllRow(t, row, attr);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "0";
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().value = "FOO";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
/* Sleep to force later timestamp */
poll(0, 0, 50);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:foo";
mutations.back().isDelete = true;
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "-1";
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = boost::lexical_cast<std::string>(i);
mutations.push_back(Mutation());
mutations.back().column = "entry:sqr";
mutations.back().value = boost::lexical_cast<std::string>(i*i);
client.mutateRow(t, row, mutations, attr);
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
mutations.clear();
mutations.push_back(Mutation());
mutations.back().column = "entry:num";
mutations.back().value = "-999";
mutations.push_back(Mutation());
mutations.back().column = "entry:sqr";
mutations.back().isDelete = true;
client.mutateRowTs(t, row, mutations, 1, attr); /* Shouldn't override latest */
client.getRow(rowResult, t, row, attr);
printRow(rowResult);
CellVec versions;
client.getVer(versions, t, row, "entry:num", 10, attr);
printVersions(row, versions);
assert(versions.size());
std::cout << std::endl;
try {
std::vector<TCell> value;
client.get(value, t, row, "entry:foo", attr);
if (value.size()) {
std::cerr << "FATAL: shouldn't get here!" << std::endl;
return -1;
}
} catch (const IOError &ioe) {
/* Blank */
}
}
/* Scan all rows/columns */
columnNames.clear();
client.getColumnDescriptors(columnMap, t);
std::cout << "The number of columns: " << columnMap.size() << std::endl;
for (ColMap::const_iterator it = columnMap.begin(); it != columnMap.end(); ++it) {
std::cout << " column with name: " + it->second.name << std::endl;
columnNames.push_back(it->second.name);
}
std::cout << std::endl;
std::cout << "Starting scanner..." << std::endl;
scanner = client.scannerOpenWithStop(t, "00020", "00040", columnNames, attr);
try {
while (true) {
std::vector<TRowResult> value;
client.scannerGet(value, scanner);
if (value.size() == 0)
break;
printRow(value);
}
} catch (const IOError &ioe) {
std::cerr << "FATAL: Scanner raised IOError" << std::endl;
}
client.scannerClose(scanner);
std::cout << "Scanner finished" << std::endl;
transport->close();
} catch (const TException &tx) {
std::cerr << "ERROR: " << tx.what() << std::endl;
}
}
/* The function to print rows */
static void printRow(const std::vector<TRowResult> &rowResult)
{
for (size_t i = 0; i < rowResult.size(); i++) {
std::cout << "row: " << rowResult[i].row << ", cols: ";
for (CellMap::const_iterator it = rowResult[i].columns.begin();it != rowResult[i].columns.end(); ++it) {
std::cout << it->first << " => " << it->second.value << "; ";
}
std::cout << std::endl;
}
}
/* The function to print versions */
static void printVersions(const std::string &row, const CellVec &versions)
{
std::cout << "row: " << row << ", values: ";
for (CellVec::const_iterator it = versions.begin(); it != versions.end(); ++it) {
std::cout << (*it).value << "; ";
}
std::cout << std::endl;
}
Download directly the tar archive or clone it from the Github.
Compilation
g++ -Wall -o hclient main.cpp gen-cpp/Hbase_types.cpp gen-cpp/Hbase_constants.cpp gen-cpp/Hbase.cpp -lthrift
Dependies explanation
gen-cpp/Hbase_types.cpp gen-cpp/Hbase_constants.cpp gen-cpp/Hbase.cpp: including the Hbase dependies
-lthrift: including the Thrift library
Run the Hbase client
./hclient