hadoop-client properties

Tue Apr 12 11:23:26 UTC 2016

Hi again,

I investigated this issue a little further and I found out that with
YARN there is no longer a single job tracker to run jobs; instead,
each job has its own ApplicationMaster that takes care of execution
flow.
I was not able to build Apache Giraph using the flags suggested for
Hadoop Yarn in the README file of its release (i.e., "mvn
-Phadoop_yarn -Dhadoop.version=2.2.0 <goals>")
Therefore, I built giraph with a Hadoop 2 flag (-Phadoop_2) and I
submit my jobs as MapReduce applications.

This works well for the SimpleShortestPathsComputation example of
Giraph as long as I have set the 'mapreduce.jobtracker.address'
property I mentioned in my previous e-mail.
However, I am interested in executing the PageRank algorithm. When I
tried to do that my job failed and I found out in the job history logs
the error: "Aggregation is not enabled."

After searching for this I figured I additionally have to use the
following properties in the yarn-site.xml file (the 1st would probably
be enough):

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>
<property>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

So, I added these to all my slave nodes as well as the
resourcemanager. I could not find a way to restart hadoop so that
these changes may take effect (stop- scripts do not seem to work). My
workaround was to rebuild my environment, add these properties before
creating any relations, and then create my relations.

And then "Job job_1460456563820_0003 completed successfully!!!" and I
finally have my PageRank values :)

Perhaps if I was able to build giraph for Hadoop Yarn I would be able
to submit my jobs as Yarn applications without changes in the client,
slave, and resourcemanager configuration. However, I believe that in
order to execute MapReduce jobs, one has to at least set the
'mapreduce.jobtracker.address' property, as it is also suggested in
this blog post:
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#mapreduce

--Panagiotis Liakos

2016-04-11 16:30 GMT+03:00 Panagiotis Liakos <p.liakos at di.uoa.gr>:
> Hi all,
>
> I am trying to setup a cluster with juju in the local environment to
> submit jobs with Apache Giraph. You can find the details of my setup
> at the end of this e-mail.
>
> I have downloaded and build Apache Giraph on my hadoop-client and I
> want to try some examples that execute on two workers.
>
> After a number of failed attempts I found out that I have to set
> property: ''mapreduce.jobtracker.address' (or the deprecated
> 'mapred.job.tracker') to 'yarn' in order to run giraph with > 1
> workers.
>
> In particular, Giraph considered that this property was set to 'local'.
> At first I found out that I can set a custom attribute with:
> -ca giraph.SplitMasterWorker=false
> to execute my job with one worker.
> Then, after finding the code responsible for this
> behavior(https://github.com/apache/giraph/blob/7e48523b520afee8e727d1e1aaab801a3bd80f06/giraph-core/src/main/java/org/apache/giraph/job/GiraphJob.java#L143)
> I was able to set the correct hadoop property and execute my job with
> 2 workers.
>
> My question is, why is this property not set in the juju client charm?
> Does it enable some otherwise undesired behavior?
> I see that 'mapreduce.framework.name' is set to 'yarn' but apparently
> this is not enough for giraph.
>
> Thank you.
>
> --Panagiotis Liakos
>