%pyspark in Zeppelin: No module named pyspark error

Gregory Van Seghbroeck gregory.vanseghbroeck at intec.ugent.be
Thu Jul 14 14:33:10 UTC 2016


Hi Konstantinos,

Thanks a lot!! I'll give it a try after my holidays.

I still have to answer your question about the bigtop charms. Here it goes ... my apologies for being vague with versions and stuff, it`s from a while back.
What I did, was deploying a small HDFS setup using the big top charms. We always set things up in LXC containers on bare metal servers. Management of these bare metal servers is out of our hands, it is provided by our Emulab system. 
Everything seemed to go fine, except the relations part. The resource manager needs FQDN to set things up properly. Unfortunately, resolving the FQDNs is something that fails. It has to do with how the physical system is set up and how the networking is handled between the LXC containers. This is something one of my colleagues (Merlijn Sebrechts, probably not a stranger to you or at least not to the community) has created for us. My workaround at that moment was to manually add all the FQDNs in the /etc/hosts file. Sufficient at that time, but not workable in the long run. So I asked my colleague if he could simply add this in the charms that provide the networking, but he responded to me that something like this actually should be handled in the big top charms, since the failing relation is part of that charm. 
I probably should have gone directly to the authors of the big top charms, but you were so helpful to also respond to this issue.

If things are not clear, I'll try to reproduce this issue on our system and will come back to you in a week or so.

Kind regards and thanks again for your help.
Gregory

-----Original Message-----
From: Konstantinos Tsakalozos [mailto:kos.tsakalozos at canonical.com] 
Sent: Thursday, July 14, 2016 3:40 PM
To: Gregory Van Seghbroeck <gregory.vanseghbroeck at intec.ugent.be>
Cc: bigdata at lists.ubuntu.com; Kevin Monroe <kevin.monroe at canonical.com>
Subject: Re: %pyspark in Zeppelin: No module named pyspark error

Hi Gregory,

Done some more testing today and submitted a patch for review.

The line:
"spark.driver.extraClassPath
/usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"
will fix only spark-shell
For pyspark the line to be added to spark-defaults.conf is slightly different:
"spark.jars /usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"

We have a patch under review
https://github.com/juju-solutions/layer-apache-spark/pull/25 so that you will not have to do any editing.

Thanks,
Konstantinos



On Wed, Jul 13, 2016 at 8:47 PM, Konstantinos Tsakalozos <kos.tsakalozos at canonical.com> wrote:
> Hi Gregory,
>
> Here is what I have so far.
>
> When in yarn-client mode pyspark jobs fail with "pyspark module not
> present": http://pastebin.ubuntu.com/19266710/
> Most probably this is because the execution end-nodes are not spark 
> nodes, they are just hadoop nodes without pyspark installed.
> You will need to  run the job you have in a spark cluster setup in 
> standalone execution mode scaled to match your needs.
> Relating spark to the hadoop-plugin will give you access to HDFS.
>
> In this setup you will need to manually go and add the following line:
> "spark.driver.extraClassPath
> /usr/lib/hadoop/share/hadoop/common/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar"
> inside /etc/spark/conf/spark-defaults.conf
> We are working on a patch to remove this extra manual step.
>
> A couple of asks from our side:
> - Would it be possible to share with us the job you are running so 
> that we verify we have addressed your use-case?
> - You mentioned problems with using the spark charm that is based on 
> Apache Bigtop. Would it be possible to provide us with more info on 
> what is not working there?
>
> We would like to thank you for your feedback as it allows us to 
> improve our work.
>
> Thanks,
> Konstantinos
>
>
> On Tue, Jul 12, 2016 at 9:55 PM, Kevin Monroe 
> <kevin.monroe at canonical.com>
> wrote:
>>
>> I think i accidentally discarded kostas' message.  Sorry about that!
>>
>> Gregory, Kostas is working on reproducing your env.. We should know 
>> more in the next day or so.
>>
>> ---------- Forwarded message ----------
>> From: Konstantinos Tsakalozos <kos.tsakalozos at canonical.com>
>> Date: Tue, Jul 12, 2016 at 10:39 AM
>> Subject: Re: %pyspark in Zeppelin: No module named pyspark error
>> To: Gregory Van Seghbroeck <gregory.vanseghbroeck at intec.ugent.be>
>> Cc: Kevin Monroe <kevin.monroe at canonical.com>, 
>> bigdata at lists.ubuntu.com
>>
>>
>> Hi Gregory,
>>
>> Thank you for the info you provided. I will need some time to setup 
>> the deployment you just described and try to reproduce the error. I 
>> guess any pyspark job should have the same effect.
>>
>> Thanks,
>> Konstantinos
>>
>> On Tue, Jul 12, 2016 at 11:31 AM, Gregory Van Seghbroeck 
>> <gregory.vanseghbroeck at intec.ugent.be> wrote:
>>>
>>> Hi Kevin,
>>>
>>>
>>>
>>> Thanks for the response! Really like the juju and canonical community.
>>>
>>>
>>>
>>> I can tell you the juju version. This is 1.25.3.
>>>
>>> The status will be a problem, since I removed most of the services. 
>>> This being said, I don’t think we are already using the bigtop spark 
>>> charms, so this might be the problem. Here a list of the services I deployed before:
>>>
>>> -          cs:trusty/apache-hadoop-namenode-2
>>>
>>> -          cs:trusty/apache-hadoop-resourcemanager-3
>>>
>>> -          cs:trusty/apache-hadoop-slave-2
>>>
>>> -          cs:trusty/apache-hadoop-plugin-14
>>>
>>> -          cs:trusty/apache-spark-9
>>>
>>> -          cs:trusty/apache-zeppelin-7
>>>
>>>
>>>
>>> The reason why we don’t use the bigtop charms yet, is that we see 
>>> problems with the hostnames on the containers. Some of the relations 
>>> use hostnames, but these cannot be resolved. So I have to add the 
>>> mapping between IPs and hostnames manually to the /etc/hosts file.
>>>
>>>
>>>
>>> The image I pasted in, showing our environment, was a screenshot of 
>>> the Zeppelin environment. These parameters looked oké from what I 
>>> could find online.
>>>
>>>
>>>
>>> Kind Regards,
>>>
>>> Gregory
>>>
>>>
>>>
>>>
>>>
>>> From: Kevin Monroe [mailto:kevin.monroe at canonical.com]
>>> Sent: Monday, July 11, 2016 7:20 PM
>>> To: Gregory Van Seghbroeck <gregory.vanseghbroeck at intec.ugent.be>
>>> Cc: bigdata at lists.ubuntu.com
>>> Subject: Re: %pyspark in Zeppelin: No module named pyspark error
>>>
>>>
>>>
>>> Hi Gregory,
>>>
>>>
>>>
>>> I wasn't able to see your data after "Our environment is set up as 
>>> follows:"
>>>
>>>
>>>
>>> <big black box for me>
>>>
>>>
>>>
>>> Will you reply with the output (or a pastebin link) with the following:
>>>
>>>
>>>
>>> juju version
>>>
>>> juju status --format=tabular
>>>
>>>
>>>
>>> Kostas has found a potential zeppelin issue in the bigtop charms 
>>> where the bigtop spark offering may be too old.  Knowing your juju 
>>> and charm versions will help me know if your issue is related.
>>>
>>>
>>>
>>> Thanks!
>>>
>>> -Kevin
>>>
>>>
>>>
>>> On Mon, Jul 11, 2016 at 7:36 AM, Gregory Van Seghbroeck 
>>> <gregory.vanseghbroeck at intec.ugent.be> wrote:
>>>
>>> Dear,
>>>
>>>
>>>
>>> We have deployed Zeppelin with juju and connected it to Spark. 
>>> According to juju everything went well. We can see this is indeed 
>>> the case; when we try to execute one of the Zeppelin tutorials we see some nice graphs.
>>> However, if we try to use the python interpreter (%pyspark) we 
>>> always get an error.
>>>
>>>
>>> Kind Regards,
>>>
>>> Gregory
>>>
>>>
>>> --
>>> Bigdata mailing list
>>> Bigdata at lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>>
>>>
>>>
>>>
>>> --
>>> Bigdata mailing list
>>> Bigdata at lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>>
>>
>>
>>
>> --
>> Bigdata mailing list
>> Bigdata at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/bigdata
>>




More information about the Bigdata mailing list