%pyspark in Zeppelin: No module named pyspark error

Thu Jul 14 13:40:07 UTC 2016

Hi Gregory,

Done some more testing today and submitted a patch for review.

The line:
"spark.driver.extraClassPath
/usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"
will fix only spark-shell
For pyspark the line to be added to spark-defaults.conf is slightly different:
"spark.jars /usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"

We have a patch under review
https://github.com/juju-solutions/layer-apache-spark/pull/25 so that
you will not have to do any editing.

Thanks,
Konstantinos

On Wed, Jul 13, 2016 at 8:47 PM, Konstantinos Tsakalozos
<kos.tsakalozos at canonical.com> wrote:
> Hi Gregory,
>
> Here is what I have so far.
>
> When in yarn-client mode pyspark jobs fail with "pyspark module not
> present": http://pastebin.ubuntu.com/19266710/
> Most probably this is because the execution end-nodes are not spark nodes,
> they are just hadoop nodes without pyspark installed.
> You will need to  run the job you have in a spark cluster setup in
> standalone execution mode scaled to match your needs.
> Relating spark to the hadoop-plugin will give you access to HDFS.
>
> In this setup you will need to manually go and add the following line:
> "spark.driver.extraClassPath
> /usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"
> inside /etc/spark/conf/spark-defaults.conf
> We are working on a patch to remove this extra manual step.
>
> A couple of asks from our side:
> - Would it be possible to share with us the job you are running so that we
> verify we have addressed your use-case?
> - You mentioned problems with using the spark charm that is based on Apache
> Bigtop. Would it be possible to
> provide us with more info on what is not working there?
>
> We would like to thank you for your feedback as it allows us to improve our
> work.
>
> Thanks,
> Konstantinos
>
>
> On Tue, Jul 12, 2016 at 9:55 PM, Kevin Monroe <kevin.monroe at canonical.com>
> wrote:
>>
>> I think i accidentally discarded kostas' message.  Sorry about that!
>>
>> Gregory, Kostas is working on reproducing your env.. We should know more
>> in the next day or so.
>>
>> ---------- Forwarded message ----------
>> From: Konstantinos Tsakalozos <kos.tsakalozos at canonical.com>
>> Date: Tue, Jul 12, 2016 at 10:39 AM
>> Subject: Re: %pyspark in Zeppelin: No module named pyspark error
>> To: Gregory Van Seghbroeck <gregory.vanseghbroeck at intec.ugent.be>
>> Cc: Kevin Monroe <kevin.monroe at canonical.com>, bigdata at lists.ubuntu.com
>>
>>
>> Hi Gregory,
>>
>> Thank you for the info you provided. I will need some time to setup the
>> deployment you just described and try to reproduce the error. I guess any
>> pyspark job should have the same effect.
>>
>> Thanks,
>> Konstantinos
>>
>> On Tue, Jul 12, 2016 at 11:31 AM, Gregory Van Seghbroeck
>> <gregory.vanseghbroeck at intec.ugent.be> wrote:
>>>
>>> Hi Kevin,
>>>
>>>
>>>
>>> Thanks for the response! Really like the juju and canonical community.
>>>
>>>
>>>
>>> I can tell you the juju version. This is 1.25.3.
>>>
>>> The status will be a problem, since I removed most of the services. This
>>> being said, I don’t think we are already using the bigtop spark charms, so
>>> this might be the problem. Here a list of the services I deployed before:
>>>
>>> -          cs:trusty/apache-hadoop-namenode-2
>>>
>>> -          cs:trusty/apache-hadoop-resourcemanager-3
>>>
>>> -          cs:trusty/apache-hadoop-slave-2
>>>
>>> -          cs:trusty/apache-hadoop-plugin-14
>>>
>>> -          cs:trusty/apache-spark-9
>>>
>>> -          cs:trusty/apache-zeppelin-7
>>>
>>>
>>>
>>> The reason why we don’t use the bigtop charms yet, is that we see
>>> problems with the hostnames on the containers. Some of the relations use
>>> hostnames, but these cannot be resolved. So I have to add the mapping
>>> between IPs and hostnames manually to the /etc/hosts file.
>>>
>>>
>>>
>>> The image I pasted in, showing our environment, was a screenshot of the
>>> Zeppelin environment. These parameters looked oké from what I could find
>>> online.
>>>
>>>
>>>
>>> Kind Regards,
>>>
>>> Gregory
>>>
>>>
>>>
>>>
>>>
>>> From: Kevin Monroe [mailto:kevin.monroe at canonical.com]
>>> Sent: Monday, July 11, 2016 7:20 PM
>>> To: Gregory Van Seghbroeck <gregory.vanseghbroeck at intec.ugent.be>
>>> Cc: bigdata at lists.ubuntu.com
>>> Subject: Re: %pyspark in Zeppelin: No module named pyspark error
>>>
>>>
>>>
>>> Hi Gregory,
>>>
>>>
>>>
>>> I wasn't able to see your data after "Our environment is set up as
>>> follows:"
>>>
>>>
>>>
>>> <big black box for me>
>>>
>>>
>>>
>>> Will you reply with the output (or a pastebin link) with the following:
>>>
>>>
>>>
>>> juju version
>>>
>>> juju status --format=tabular
>>>
>>>
>>>
>>> Kostas has found a potential zeppelin issue in the bigtop charms where
>>> the bigtop spark offering may be too old.  Knowing your juju and charm
>>> versions will help me know if your issue is related.
>>>
>>>
>>>
>>> Thanks!
>>>
>>> -Kevin
>>>
>>>
>>>
>>> On Mon, Jul 11, 2016 at 7:36 AM, Gregory Van Seghbroeck
>>> <gregory.vanseghbroeck at intec.ugent.be> wrote:
>>>
>>> Dear,
>>>
>>>
>>>
>>> We have deployed Zeppelin with juju and connected it to Spark. According
>>> to juju everything went well. We can see this is indeed the case; when we
>>> try to execute one of the Zeppelin tutorials we see some nice graphs.
>>> However, if we try to use the python interpreter (%pyspark) we always get an
>>> error.
>>>
>>>
>>> Kind Regards,
>>>
>>> Gregory
>>>
>>>
>>> --
>>> Bigdata mailing list
>>> Bigdata at lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>>
>>>
>>>
>>>
>>> --
>>> Bigdata mailing list
>>> Bigdata at lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>>
>>
>>
>>
>> --
>> Bigdata mailing list
>> Bigdata at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>